GPT, Claude, Llama? How to tell which AI model is best


When Meta, the parent company of Facebook, announced its latest open-source large language model (LLM) on July 23rd, it claimed that the most powerful version of Llama 3.1 had “state-of-the-art capabilities that rival the best closed-source models” such as GPT-4o and Claude 3.5 Sonnet. Meta’s announcement included a table, showing the scores achieved by these and other models on a series of popular benchmarks with names such as MMLU, GSM8K and GPQA.

On MMLU, for example, the most powerful version of Llama 3.1 scored 88.6%, against 88.7% for GPT-4o and 88.3% for Claude 3.5 Sonnet, rival models made by OpenAI and Anthropic, two AI startups, respectively. Claude 3.5 Sonnet had itself been unveiled on June 20th, again with a table of impressive benchmark scores. And on July 24th, the day after Llama 3.1’s debut, Mistral, a French AI startup, announced Mistral Large 2, its latest LLM, with—you’ve guessed it—yet another table of benchmarks. Where do such numbers come from, and can they be trusted?

Having accurate, reliable benchmarks for AI models matters, and not just for the bragging rights of the firms making them. Benchmarks “define and drive progress”, telling model-makers where they stand and incentivising them to improve, says Percy Liang of the Institute for Human-Centred Artificial Intelligence at Stanford University. Benchmarks chart the field’s overall progress and show how AI systems compare with humans at specific tasks. They can also help users decide which model to use for a particular job and identify promising new entrants in the space, says Clémentine Fourrier, a specialist in evaluating LLMs at Hugging Face, a startup that provides tools for AI developers.

But, says Dr Fourrier, benchmark scores “should be taken with a pinch of salt”. Model-makers are, in effect, marking their own homework—and then using the results to hype their products and talk up their company valuations. Yet all too often, she says, their grandiose claims fail to match real-world performance, because existing benchmarks, and the ways they are applied, are flawed in various ways.

One problem with benchmarks such as MMLU (massive multi-task language understanding) is that they are simply too easy for today’s models. MMLU was created in 2020 and consists of 15,908 multiple-choice questions, each with four possible answers, across 57 topics including maths, American history, science and law. At the time, most language models scored little better than 25% on MMLU, which is what you would get by picking answers at random; OpenAI’s GPT-3 did best, with a score of 43.9%. But since then, models have improved, with the best now scoring between 88% and 90%.

This means it is difficult to draw meaningful distinctions from their scores, a problem known as “saturation” (see chart). “It’s like grading high-school students on middle-school tests,” says Dr Fourrier. More difficult benchmarks have been devised—MMLU-Pro has tougher questions and ten possible answers rather than four. GPQA is like MMLU at PhD level, on selected science topics; today’s best models tend to score between 50% and 60% on it. Another benchmark, MuSR (multi-step soft reasoning), tests reasoning ability using, for example, murder-mystery scenarios. When a person reads such a story and works out who the killer is, they are combining an understanding of motivation with language comprehension and logical deduction. AI models are not so good at this kind of “soft reasoning” over multiple steps. So far, few models score better than random on MuSR.

MMLU also highlights two other problems. One is that the answers in such tests are sometimes wrong. A study carried out by Aryo Gema of the University of Edinburgh and colleagues, published in June, found that, of the questions they sampled, 57% of MMLU’s virology questions and 26% of its logical-fallacy ones contained errors. Some had no correct answer; others had more than one. (The researchers cleaned up the MMLU questions to create a new benchmark, MMLU-Redux.)

Then there is a deeper issue, known as “contamination”. LLMs are trained using data from the internet, which may include the exact questions and answers for MMLU and other benchmarks. Intentionally or not, the models may be cheating, in short, because they have seen the tests in advance. Indeed, some model-makers may deliberately train a model with benchmark data to boost its score. But the score then fails to reflect the model’s true ability. One way to get around this problem is to create “private” benchmarks for which the questions are kept secret, or released only in a tightly controlled manner, to ensure that they are not used for training (GPQA does this). But then only those with access can independently verify a model’s scores.

To complicate matters further, it turns out that small changes in the way questions are posed to models can significantly affect their scores. In a multiple-choice test,asking an AI model to state the answer directly, or to reply with the letter or number corresponding to the correct answer, can produce different results. That affects reproducibility and comparability.

Automated testing systems are now used to test models against benchmarks in a standardised manner. Dr Liang’s team at Stanford has built one such system, called HELM (holistic evaluation of language models), which generates leaderboards showing how a range of models perform on various benchmarks. Dr Fourrier’s team at Hugging Face uses another such system, EleutherAI Harness, to generate leaderboards for open-source models. These leaderboards are more trustworthy than the tables of results provided by model-makers, because the benchmark scores have been generated in a consistent way.

The greatest trick AI ever pulled

As models gain new skills, new benchmarks are being developed to assess them. GAIA, for example, tests AI models on real-world problem-solving. (Some of the answers are kept secret to avoid contamination.) NoCha (novel challenge), announced in June, is a “long context” benchmark consisting of 1,001 questions about 67 recently published English-language novels. The answers depend on having read and understood the whole book, which is supplied to the model as part of the test. Recent novels were chosen because they are unlikely to have been used as training data. Other benchmarks assess models’ ability to solve biology problems or their tendency to hallucinate.

But new benchmarks can be expensive to develop, because they often require human experts to create a detailed set of questions and answers. One answer is to use LLMs themselves to develop new benchmarks. Dr Liang is doing this with a project called AutoBencher, which extracts questions and answers from source documents and identifies the hardest ones.

Anthropic, the startup behind the Claude LLM, has started funding the creation of benchmarks directly, with a particular emphasis on AI safety. “We are super-undersupplied on benchmarks for safety,” says Logan Graham, a researcher at Anthropic. “We are in a dark forest of not knowing what the models are capable of.” On July 1st the company began inviting proposals for new benchmarks, and tools for generating them, which it will co-fund, with a view to making them available to all. This might involve developing ways to assess a model’s ability to develop cyber-attack tools, say, or its willingness to provide advice on making chemical or biological weapons. These benchmarks can then be used to assess the safety of a model before public release.

Historically, says Dr Graham, AI benchmarks have been devised by academics. But as AI is commercialised and deployed in a range of fields, there is a growing need for reliable and specific benchmarks. Startups that specialise in providing AI benchmarks are starting to appear, he notes. “Our goal is to pump-prime the market,” he says, to give researchers, regulators and academics the tools they need to assess the capabilities of AI models, good and bad. The days of AI labs marking their own homework could soon be over.

© 2024, The Economist Newspaper Limited. All rights reserved. From The Economist, published under licence. The original content can be found on www.economist.com

Catch all the Business News, Market News, Breaking News Events and Latest News Updates on Live Mint. Download The Mint News App to get Daily Market Updates.

MoreLess



Source link

Share

Latest Updates

Frequently Asked Questions

Related Articles

Elon Musk’s X settles Donald Trump lawsuit

X has agreed to pay in the range of $10 million to settle...

As Hype Fades, OpenAI Suddenly Cancels Release of Its Hot Upcoming AI

OpenAI SeasonAs the AI industry gets more and more crowded — and its hype...

Too many models, too much confusion: OpenAI pledges to simplify its product line

Join our daily and weekly newsletters for the latest updates and exclusive content...
SULTAN88
SULTANSLOT
RAJA328
JOIN88
GFC88
HOKIBET
RUSIASLOT88
TAHU69
BONANZA99
PRAGMABET
MEGA55
LUXURY777
LUXURY333
BORJU89
QQGAMING
KEDAI168
MEGA777
NAGASLOT777
TAKSU787
KKSLOT777
MAS77TOTO
bandar55
BOS303
HOKI99
NUSA365
YUHUSLOT
KTP168
GALAXY138
NEXIA138
PETIR33
BOOM138
MEGA888
CABE888
FOSIL777
turbospin138
KAPAKBET
SUPERJP
sultankoin99
dragon88
raffi888
kenzobet
aladin666
rgo365
ubm4d
GERCEP88
VIVA99
CR777
VOXY88
delman567
intan69
CABE888
RNR303
LOGO303
PEMBURUGACOR
mpo383
cermin4d
bm88
ANGKA79
WOWHOKI
ROKET303
MPOXL
GURITA168
SUPRASLOT
SGCWIN
DESA88
ARWANA388
DAUNEMAS
ALADDIN666
BIOWIN69
SKY77
DOTA88
NAGA138
API5000
y200m
PLAYBOOK88
LUXURY12
A200M
MPO700
KENANGAN4D
cakrabola
PANDAGENDUT
MARVEL77
UG300
HOKI178
MONTE77
JASABOLA
UNTAR4D
LIDO88
MAFIABOLA77
GASPOL189
mpo999
untung138
TW88
JAGUAR33
MPOBOS
SHIO88
VIVO4D
MPOXL
JARISAKTI
BBO303
AONCASH
ANGKER4D
LEVIS4D
JAGO88
REPUBLIK365
BOSDEAL88
BOLA168
akunjp
WARTEGBET
EZEBET
88PULSA
KITAB4D
BOSDEAL88
STUDIOBET
MESINKOIN
BIMA88
PPNUSA
ABGBET88
TOP77
BAYAR77
YES77
BBTN4D
BBCA4D
VSLOTS88
MPO800
PAHALA4D
KPI4D
JURAGAN77
QQ188
BOLAPELANGI
C200M
QQ998
GWKTOGEL
MEGABANDAR
COLOWIN
VIP579
SEVEN4D
MPO188
DEWATA88
SURAT4D
SINAR123
LAMBO77
GUDANG4D
AWAN4D
PLANETLIGA
GT88
ROYALSPIN88
MAMAJITU
MITO99
PEDIA4D
WIBU69JP
333HOKI
SIDARMA88
NAGAEMAS99
HOLA88
CAKAR76
KINGTOTO
RATUGAMING
SSI168
PILAR168
ACTOTO
EYANGTOGEL
KAISAR328
SLOT628
KAISAR88
DOTA88
MAXWIN369
ALIBABA99
MM168
SQUAD777
NAGABET88
JAYABOLA
SEMPATIGAME
PANDAJAGO
PIKAT4D
SINGA77
YUYU33
MASTERPLAY99
VICTORY39
NASA4D
PERMATA55
SAKAUSLOT
CK303
MPOTOWER
CIPUTRABET
WINJUDI
DEWI5000
IYA777
MAHIRTOTO
GOSLOT88
TIPTOP4D
RAJA787
JBO680
JOKER188
EPICPLAY88
TRIVABET
KAISAR189
JOKER81
JPSPIN88
MAYORA4D
DJARUMPLAY
OVO88
BAKTI78
WINGSLOT77
ICAFE4D
PDTOTO
JETPLAY88
PORN VIDEO
https://link.space/@Hikaribet
https://bio.site/Hikaribet
https://heylink.me/Hikaribet39
CMBET88
CMBET88
didascaliasdelteatrocaminito.com
glenellynrent.com
gypsumboardequipment.com
realseller.org
https://harrysphone.com/upin
gyergyoalfalu.ro/tokek
vipokno.by/gokil
winjospg.com
winjos801.com/
www.logansquarerent.com
internationalfintech.com/bamsz
condowizard.ca
jawatoto889.com
hikaribet3.live
hikaribet1.com
heylink.me/hikaribet
www.nomadsumc.org
condowizard.ca/aromatoto
euro2024gol.com
www.imaracorp.com
daftarsekaibos.com
stuffyoucanuse.org/juragan
Toto Macau 4d
Aromatoto
Lippototo
Mbahtoto
Winjos
152.42.229.23
bandarlotre126.com
heylink.me/sekaipro
www.get-coachoutletsonline.com
wholesalejerseyslord.com
Lippototo
Zientoto
Lippototo
Situs Togel Resmi
Fajartoto
Situs Togel
Toto Macau
Winjos
Winlotre
Aromatoto
design-develop-test.com
winlotre.online
winlotre.xyz
winlotre.us
winlotrebandung.com
winlotrepalu.com
winlotresurabaya.shop
winlotrejakarta.com
winlotresemarang.shop
winlotrebali.shop
winlotreaceh.shop
winlotremakmur.com
Dadu Online
Taruhantoto
a Bandarlotre
bursaliga
lakitoto
aromatoto
Rebahin
untungslot.pages.dev
slotpoupler.pages.dev
rtpliveslot88a.pages.dev
tipsgameslot.pages.dev
pilihslot88.pages.dev
fortuertiger.pages.dev
linkp4d.pages.dev
linkslot88a.pages.dev
slotpgs8.pages.dev
markasjudi.pages.dev
saldo69.pages.dev
slotbenua.pages.dev
saingtoto.pages.dev
markastoto77.pages.dev
jowototo88.pages.dev
sungli78.pages.dev
volatilitas78.pages.dev
bonusbuy12.pages.dev
slotoffiline.pages.dev
dihindari77.pages.dev
rtpdislot1.pages.dev
agtslot77.pages.dev
congtoto15.pages.dev
hongkongtoto7.pages.dev
sinarmas177.pages.dev
hours771.pages.dev
sarana771.pages.dev
kananslot7.pages.dev
balitoto17.pages.dev
jowototo17.pages.dev
aromatotoding.com
unyagh.org
fairparkcounseling.com/gap/
impress-newtex.com/ajax/
SULTAN88
SULTANSLOT
RAJA328
JOIN88+
HOKIBET
GFC88
RusiaSlot88
Tahu69
BONANZA99
Pragmabet
mega55
luxury777
luxury333
borju89
qqgaming
KEDAI168
mega777
nagaslot777
TAKSU787
kkslot777
MAS77TOTO
BANDAR55+
BOS303
Login-HOKI99/
NUSA365
YUHUSLOT
ktp168
GALAXY138