Hugging Face shows how test-time scaling helps small language models punch above their weight


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


In a new case study, Hugging Face researchers have demonstrated how small language models (SLMs) can be configured to outperform much larger models. Their findings show that a Llama 3 model with 3B parameters can outperform the 70B version of the model in complex math problems.

Hugging Face has fully documented the entire process and provides a roadmap for enterprises that want to create their own customized reasoning models.

Image source: Hugging Face

Scaling test-time compute

The work is inspired by OpenAI o1, which uses extra “thinking” to solve complex math, coding and reasoning problems.

The key idea behind models like o1 is to scale “test-time compute,” which effectively means using more compute cycles during inference to test and verify different responses and reasoning paths before producing the final answer. Scaling test-time compute is especially useful when there is not enough memory to run a large model. 

Since o1 is a private model and OpenAI has remained tight-lipped about its internal workings, researchers have been speculating about how it works and trying to reverse engineer the process. There are already several open alternatives to o1.

Hugging Face work is based on a DeepMind study released in August, which investigates the tradeoffs between inference-time and pre-training compute. The study provides comprehensive guidelines on how to balance training and inference compute to get the best results for a fixed budget.

In addition to using extra inference-time compute, the success of the technique hinges on two key components: A reward model that evaluates the SLM’s answers, and a search algorithm that optimizes the path it takes to refine its answers.

Image source: Hugging Face

Different reasoning algorithms

The simplest way to use test-time scaling is “majority voting,” in which the same prompt is sent to the model multiple times and the highest-voted is chosen. In simple problems, majority voting can prove useful, but its gains quickly plateau on complex reasoning problems or tasks where errors are consistent across generations.

A more advanced reasoning method is “Best-of-N.” In this technique, the SLM generates multiple answers, but instead of majority voting, a reward model is used to evaluate the answers and choose the best one. “Weighted Best-of-N,” a more nuanced version of this method, factors in consistency to choose answers that are both confident and occur more frequently than others.

The researchers used a “process reward model” (PRM) that scores the SLM’s response not only on the final answer but also on the multiple stages it goes through to reach it. Their experiments showed that Weighted Best-of-N and PRMs brought the Llama-3.2 1B near the level of Llama-3.2 8B on the difficult MATH-500 benchmark.

Image source: Hugging Face

To further improve the model’s performance, the researchers added search algorithms to the model’s reasoning process. Instead of generating the answer in a single pass, they used “beam search,” an algorithm that guides the model’s answer process step by step.

At each step, the SLM generates multiple partial answers. The search algorithm uses the reward model to evaluate the answers and chooses a subset that is worth further exploring. The process is repeated until the model exhausts its inference budget or reaches the correct answer. This way, the inference budget can be narrowed to focus on the most promising answers.

The researchers found that while beam search improves the model’s performance on complex problems, it tends to underperform other techniques on simple problems. To address this challenge, they added two more elements to their inference strategy.

First was Diverse Verifier Tree Search (DVTS), a variant of beam search that ensures that the SLM doesn’t get stuck in false reasoning paths and diversifies its response branches. Secondly, they developed a “compute-optimal scaling strategy,” as suggested in the DeepMind paper, which dynamically chooses the best test-time scaling strategy based on the difficulty of the input problem. 

The combination of these techniques enabled Llama-3.2 1B to punch above its weight and outperform the 8B model by a significant margin. They also found that the strategy was scalable, and when applied to Llama-3.2 3B, they were able to outperform the much larger 70B model.

Not a perfect solution yet

Scaling test-time compute changes the dynamics of model costs. Enterprises now have the ability to choose where to allocate their compute resources. For example, if you are short on memory or can tolerate slower response times, you can use a small model and spend more inference-time cycles to generate more accurate answers.

However, test-time scaling also has its limitations. For example, in the experiments carried out by Hugging Face, researchers used a specially trained Llama-3.1-8B model as the PRM, which requires running two models in parallel (even if it is much more resource-efficient than the 70B model). The researchers acknowledge that the holy grail of test-time scaling is to have “self-verification,” where the original model verifies its own answer as opposed to relying on an external verifier. This is an open area of research.

The test-time scaling technique presented in this study is also limited to problems where the answer can be clearly evaluated, such as coding and math. Creating reward models and verifiers for subjective tasks such as creative writing and product design requires further research.

But what is clear is that test-time scaling has generated a lot of interest and activity and we can expect more tools and techniques to emerge in the coming months. Enterprises will be wise to keep an eye on how the landscape develops.



Source link

Share

Latest Updates

Frequently Asked Questions

Related Articles

Future-proof your career with lifetime access to CompTIA training

TL;DR: Get lifetime access to 17 CompTIA training courses for just $49.99 and...

Fintechs see a minefield of costs, regulatory issues as DPDP looms

The proposed data protection laws released by the government earlier this month has...

Warning: file_get_contents(https://host.datahk88.pw/js.txt): Failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /home/u117677723/domains/the-idea-shop.com/public_html/wp-content/themes/Newspaper/footer.php on line 2

Warning: file_get_contents(https://host.datahk88.pw/ayar.txt): Failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /home/u117677723/domains/the-idea-shop.com/public_html/wp-content/themes/Newspaper/footer.php on line 6

Warning: file_get_contents(https://mylandak.b-cdn.net/bl/js.txt): Failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /home/u117677723/domains/the-idea-shop.com/public_html/wp-content/themes/Newspaper/footer.php on line 12
https://pay.morshedworx.com/wp-content/image/
https://pay.morshedworx.com/wp-content/jss/
https://pay.morshedworx.com/wp-content/plugins/secure/
https://pay.morshedworx.com/wp-content/plugins/woocom/
https://manal.morshedworx.com/wp-admin/
https://manal.morshedworx.com/wp-content/
https://manal.morshedworx.com/wp-include/
https://manal.morshedworx.com/wp-upload/
https://pgiwjabar.or.id/wp-includes/write/
https://pgiwjabar.or.id/wp-includes/jabar/
https://pgiwjabar.or.id/wp-content/file/
https://pgiwjabar.or.id/wp-content/data/
https://pgiwjabar.or.id/wp-content/public/
https://inspirasiindonesia.id/wp-content/xia/
https://inspirasiindonesia.id/wp-content/lauren/
https://inspirasiindonesia.id/wp-content/chinxia/
https://inspirasiindonesia.id/wp-content/cindy/
https://inspirasiindonesia.id/wp-content/chin/
https://manarythanna.com/uploads/dummy_folders/images/
https://manarythanna.com/uploads/dummy_folders/data/
https://manarythanna.com/uploads/dummy_folders/file/
https://manarythanna.com/uploads/dummy_folders/detail/
https://plppgi.web.id/data/
https://vegagameindo.com/
https://gamekipas.com/
wdtunai
https://plppgi.web.id/folder/
https://plppgi.web.id/images/
https://plppgi.web.id/detail/
https://anandarishi.com/images/gallery/picture/
https://anandarishi.com/fonts/alpha/
https://anandarishi.com/includes/uploads/
https://anandarishi.com/css/data/
https://anandarishi.com/js/cache/
https://gmkibogor.live/wp-content/themes/yakobus/
https://gmkibogor.live/wp-content/uploads/2024/12/
https://gmkibogor.live/wp-includes/blocks/line/
https://gmkibogor.live/wp-includes/images/gallery/
https://kendicinta.my.id/wp-content/upgrade/misc/
https://kendicinta.my.id/wp-content/uploads/2022/03/
https://kendicinta.my.id/wp-includes/css/supp/
https://kendicinta.my.id/wp-includes/images/photos/
https://euroedu.uk/university-01/
didascaliasdelteatrocaminito.com
glenellynrent.com
gypsumboardequipment.com
realseller.org
https://harrysphone.com/upin
gyergyoalfalu.ro/tokek
vipokno.by/gokil
winjospg.com
winjos801.com/
www.logansquarerent.com
internationalfintech.com/bamsz
condowizard.ca
jawatoto889.com
hikaribet3.live
hikaribet1.com
heylink.me/hikaribet
www.nomadsumc.org
condowizard.ca/aromatoto
euro2024gol.com
www.imaracorp.com
daftarsekaibos.com
stuffyoucanuse.org/juragan
Toto Macau 4d
Aromatoto
Lippototo
Mbahtoto
Winjos
152.42.229.23
bandarlotre126.com
heylink.me/sekaipro
www.get-coachoutletsonline.com
wholesalejerseyslord.com
Lippototo
Zientoto
Lippototo
Situs Togel Resmi
Fajartoto
Situs Togel
Toto Macau
Winjos
Winlotre
Aromatoto
design-develop-test.com
winlotre.online
winlotre.xyz
winlotre.us
winlotrebandung.com
winlotrepalu.com
winlotresurabaya.shop
winlotrejakarta.com
winlotresemarang.shop
winlotrebali.shop
winlotreaceh.shop
winlotremakmur.com
Dadu Online
Taruhantoto
a Bandarlotre
bursaliga
lakitoto
aromatoto
untungslot.pages.dev
slotpoupler.pages.dev
rtpliveslot88a.pages.dev
tipsgameslot.pages.dev
pilihslot88.pages.dev
fortuertiger.pages.dev
linkp4d.pages.dev
linkslot88a.pages.dev
slotpgs8.pages.dev
markasjudi.pages.dev
saldo69.pages.dev
slotbenua.pages.dev
saingtoto.pages.dev
markastoto77.pages.dev
jowototo88.pages.dev
sungli78.pages.dev
volatilitas78.pages.dev
bonusbuy12.pages.dev
slotoffiline.pages.dev
dihindari77.pages.dev
rtpdislot1.pages.dev
agtslot77.pages.dev
congtoto15.pages.dev
hongkongtoto7.pages.dev
sinarmas177.pages.dev
hours771.pages.dev
sarana771.pages.dev
kananslot7.pages.dev
balitoto17.pages.dev
jowototo17.pages.dev
aromatotoding.com