DeepMind’s Michelangelo benchmark reveals limitations of long-context LLMs


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Large language models (LLMs) with very long context windows have been making headlines lately. The ability to cram hundreds of thousands or even millions of tokens into a single prompt unlocks many possibilities for developers. 

But how well do these long-context LLMs really understand and utilize the vast amounts of information they receive?

Researchers at Google DeepMind have introduced Michelangelo, a new benchmark designed to evaluate the long-context reasoning capabilities of LLMs. Their findings, published in a new research paper, show that while current frontier models have progressed in retrieving information from large in-context data, they still struggle with tasks that require reasoning over the data structure.

The need for better long-context benchmarks

The emergence of LLMs with extremely long context windows, ranging from 128,000 to over 1 million tokens, has prompted researchers to develop new benchmarks to evaluate their capabilities. However, most of the focus has been on retrieval tasks, such as the popular “needle-in-a-haystack” evaluation, where the model is tasked with finding a specific piece of information within a large context.

“Over time, models have grown considerably more capable in long context performance,” Kiran Vodrahalli, research scientist at Google DeepMind, told VentureBeat. “For instance, the popular needle-in-a-haystack evaluation for retrieval has now been well saturated up to extremely long context lengths. Thus, it has become important to determine whether the harder tasks models are capable of solving in short context regimes are also solvable at long ranges.”

Retrieval tasks don’t necessarily reflect a model’s capacity for reasoning over the entire context. A model might be able to find a specific fact without understanding the relationships between different parts of the text. Meanwhile, existing benchmarks that evaluate a model’s ability to reason over long contexts have limitations.

“It is easy to develop long reasoning evaluations which are solvable with a combination of only using retrieval and information stored in model weights, thus ‘short-circuiting’ the test of the model’s ability to use the long-context,” Vodrahalli said.

Michelangelo

To address the limitations of current benchmarks, the researchers introduced Michelangelo, a “minimal, synthetic, and unleaked long-context reasoning evaluation for large language models.” 

Michelangelo is based on the analogy of a sculptor chiseling away irrelevant pieces of marble to reveal the underlying structure. The benchmark focuses on evaluating the model’s ability to understand the relationships and structure of the information within its context window, rather than simply retrieving isolated facts.

The benchmark consists of three core tasks:

Latent list: The model must process a long sequence of operations performed on a Python list, filter out irrelevant or redundant statements, and determine the final state of the list. “Latent List measures the ability of a model to track a latent data structure’s properties over the course of a stream of code instructions,” the researchers write.

Multi-round co-reference resolution (MRCR): The model must produce parts of a long conversation between a user and an LLM. This requires the model to understand the structure of the conversation and resolve references to previous turns, even when the conversation contains confusing or distracting elements. “MRCR measures the model’s ability to understanding ordering in natural text, to distinguish between similar drafts of writing, and to reproduce a specified piece of previous context subject to adversarially difficult queries,” the researchers write.

“I don’t know” (IDK): The model is given a long story and asked to answer multiple-choice questions about it. For some questions, the context does not contain the answer, and the model must be able to recognize the limits of its knowledge and respond with “I don’t know.” “IDK measures the model’s ability to understand whether it knows what it doesn’t know based on the presented context,” the researchers write.

Latent Structure Queries

The tasks in Michelangelo are based on a novel framework called Latent Structure Queries (LSQ). LSQ provides a general approach for designing long-context reasoning evaluations that can be extended to arbitrary lengths. It can also test the model’s understanding of implicit information as opposed to retrieving simple facts. LSQ relies on synthesizing test data to avoid the pitfalls of test data leaking into the training corpus.

“By requiring the model to extract information from structures rather than values from keys (sculptures from marble rather than needles from haystacks), we can more deeply test language model context understanding beyond retrieval,” the researchers write.

LSQ has three key differences from other approaches to evaluating long-context LLMs. First, it has been explicitly designed to avoid short-circuiting flaws in evaluations that go beyond retrieval tasks. Second, it specifies a methodology for increasing task complexity and context length independently. And finally, it is general enough to capture a large range of reasoning tasks. The three tests used in Michelangelo cover code interpretation and reasoning over loosely written text.

“The goal is that long-context beyond-reasoning evaluations implemented by following LSQ will lead to fewer scenarios where a proposed evaluation reduces to solving a retrieval task,” Vodrahalli said.

Evaluating frontier models on Michelangelo

The researchers evaluated ten frontier LLMs on Michelangelo, including different variants of Gemini, GPT-4 and 4o, and Claude. They tested the models on contexts up to 1 million tokens. Gemini models performed best on MRCR, GPT models excelled on Latent List, and Claude 3.5 Sonnet achieved the highest scores on IDK.

However, all models exhibited a significant drop in performance as the complexity of the reasoning tasks increased, suggesting that even with very long context windows, current LLMs still have room to improve in their ability to reason over large amounts of information.

Frontier LLMs struggle with reasoning on long-context windows (source: arxiv)

“Frontier models have room to improve on all of the beyond-retrieval reasoning primitives (Latent List, MRCR, IDK) that we investigate in Michelangelo,” Vodrahalli said. “Different frontier models have different strengths and weaknesses – each class performs well on different context ranges and on different tasks. What does seem to be universal across models is the initial drop in performance on long reasoning tasks.”

The Michelangelo evaluations capture basic primitives necessary for long-context reasoning and the findings can have important implications for enterprise applications. For example, in real-world applications where the model can’t rely on its pretraining knowledge and must perform multi-hop reasoning over many disparate locations in very long contexts, Vodrahalli expects performance to drop as the context length grows.

“This is particularly true if the documents have a lot of information that is irrelevant to the task at hand, making it hard for a model to easily immediately distinguish which information is relevant or not,” Vodrahalli said. “It is also likely that models will continue to perform well on tasks where all of the relevant information to answer a question is located in one general spot in the document.”

The researchers will continue to add more evaluations to Michelangelo and hope to make them directly available so that other researchers can test their models on them.



Source link

Share

Latest Updates

Frequently Asked Questions

Related Articles

Top Divorce Lawyers in KL: Legal Support for a Stress-Free Divorce

Divorce can be one of the most emotionally and financially taxing experiences in life....

Yale Law Scholar Suspended After AI Calls Her a Terrorist

Since the first publicly available large language model (LLM) ChatGPT hit the scene...

Carl Lundstrom, Who Financed the Pirate Bay, Dies in Plane Crash

Carl Lundstrom, the heir to a Swedish crisp bread fortune who financed the...

So long, Google Assistant. It’s Gemini’s world now

The writing was already on the wall, but now it’s official: The Google...
PORN VIDEO
PORN VIDEO
PORN VIDEO
SULTAN88
SULTANSLOT
RAJA328
JOIN88
GFC88
HOKIBET
RUSIASLOT88
TAHU69
BONANZA99
PRAGMABET
MEGA55
LUXURY777
LUXURY333
BORJU89
QQGAMING
KEDAI168
MEGA777
NAGASLOT777
TAKSU787
KKSLOT777
MAS77TOTO
bandar55
BOS303
HOKI99
NUSA365
YUHUSLOT
KTP168
GALAXY138
NEXIA138
PETIR33
BOOM138
MEGA888
CABE888
FOSIL777
turbospin138
KAPAKBET
SUPERJP
sultankoin99
dragon88
raffi888
kenzobet
aladin666
rgo365
ubm4d
GERCEP88
VIVA99
CR777
VOXY88
delman567
intan69
CABE888
RNR303
LOGO303
PEMBURUGACOR
mpo383
cermin4d
bm88
ANGKA79
WOWHOKI
ROKET303
MPOXL
GURITA168
SUPRASLOT
SGCWIN
DESA88
ARWANA388
DAUNEMAS
ALADDIN666
BIOWIN69
SKY77
DOTA88
NAGA138
API5000
y200m
PLAYBOOK88
LUXURY12
A200M
MPO700
KENANGAN4D
cakrabola
PANDAGENDUT
MARVEL77
UG300
HOKI178
MONTE77
JASABOLA
UNTAR4D
LIDO88
MAFIABOLA77
GASPOL189
mpo999
untung138
TW88
JAGUAR33
MPOBOS
SHIO88
VIVO4D
MPOXL
JARISAKTI
BBO303
AONCASH
ANGKER4D
LEVIS4D
JAGO88
REPUBLIK365
BOSDEAL88
BOLA168
akunjp
WARTEGBET
EZEBET
88PULSA
KITAB4D
BOSDEAL88
STUDIOBET
MESINKOIN
BIMA88
PPNUSA
ABGBET88
TOP77
BAYAR77
YES77
BBTN4D
BBCA4D
VSLOTS88
MPO800
PAHALA4D
KPI4D
JURAGAN77
QQ188
BOLAPELANGI
C200M
QQ998
GWKTOGEL
MEGABANDAR
COLOWIN
VIP579
SEVEN4D
MPO188
DEWATA88
SURAT4D
SINAR123
LAMBO77
GUDANG4D
AWAN4D
PLANETLIGA
GT88
ROYALSPIN88
MAMAJITU
MITO99
PEDIA4D
WIBU69JP
333HOKI
SIDARMA88
NAGAEMAS99
HOLA88
CAKAR76
KINGTOTO
RATUGAMING
SSI168
PILAR168
ACTOTO
EYANGTOGEL
KAISAR328
SLOT628
KAISAR88
DOTA88
MAXWIN369
ALIBABA99
MM168
SQUAD777
NAGABET88
JAYABOLA
SEMPATIGAME
PANDAJAGO
PIKAT4D
SINGA77
YUYU33
MASTERPLAY99
VICTORY39
NASA4D
PERMATA55
SAKAUSLOT
CK303
MPOTOWER
CIPUTRABET
WINJUDI
DEWI5000
IYA777
MAHIRTOTO
GOSLOT88
TIPTOP4D
RAJA787
JBO680
JOKER188
EPICPLAY88
TRIVABET
KAISAR189
JOKER81
JPSPIN88
MAYORA4D
DJARUMPLAY
OVO88
BAKTI78
WINGSLOT77
ICAFE4D
PDTOTO
JETPLAY88
JETPLAY88
STADIUM4D
RAJAVIP777
ISB388
GASSPOL168
JITU33
ISTANA8899
CERI123
VIPPELANGI99
55WEALTH
LIGAJUARA
RAJAPKV
HMTOTO
PERKASA99
DEWIGG
MASTERKIU
DAFTARJP268
BATENGMERAH
YOGATOTO
GRAZYRICH88
RGO365
TIKI4D
GBOSKY
RANS4D
GRAND4D
GARUDABET77
BOLABESAR
KASIR777
WINPALACE88
SAMUDRBET
JAGO89
IBCBET
SUPER126
BIZZ77GAMES
ASET69
GAMESPOLLS
LOGO303
JETHOKI
FERRARITOTO
SULTAN69
BARUNATOTO
MDSBET
HOBBIQQ
SARANG188
HEPI55
NARUTOBET
ASIABET4D
PRAGMABET
OKEBOS138
HAHA55
VOCAL77
GATOT4D
LANANGBET
BONCEL4D
TUKUL777
BOOKIE7
PAJAKBOLA
5DEWA
WAHIDTOTO
CSOWIN
OMG303
WINLIVE4D
ALADDIN666
LUMIO777
GBOPLAY777
GEBER88
BETWIN89
BIBIT88
BIJITOGEL
BIMOIN88
BINGOSLOT88
BINTANG29
BINTANG4D
BISABET
BOJO88
BOLA99
BOLAKAWAN
BOROBUDURBET
BOSDEAL88
BOSKU123
HOKI138
BOSS177
BOSSKLIK
BP77
GARUDA999
ABO777
MAXBET268
BANDARSBO
UGDEWA
ANAKNAGA
BIGSLOT
FYP138
SKYWIN386
KOBOY789
YYPAUS
LUCKY77
ISTANAIMPIAN4
PEDRO4D
SEMAR123
AKSARA88
VIRGO168
JUALTOTO
KAISAR89
CAPSAWINS
SUKI99
SIARIL
BOSSLOT138
PRAGMATIC777
ARWANA89
DUKUN138
KOI77
SBA99
GOWD
ANAKTOTO
JAKJP
EU9
ZONA66
MURAH138
SULE88
PPNUSA
PENCETAJA
RAFI168
MURAH138_LOGIN
PATEN77
ACETOTO888
CUAN368
KENZO123
DEWAWIN365
KUPONTOTO
MPOTOP88
TOKYO188
SLOT88RESMI
CAPTAIN77
PECINTA4D
PANEN33
TANTAN88
OMEGA138
KUDA77
BLURAYUFR
YANDEXEU
K86SPORT
ASIAKLUB
ION55
OTW78
POOLS303
ALL303
MPOBOS
MEGA118
MAMEN123
MEVIUS88
77ROYAL
DRAGON222
337SPORTS
QQ1221
CAFE69
TKO77
GELEK4D
DOMINO76
PPSNUSA
ANDAHOKI
OASIS88
SOHIB4D
HERMES21
NEON4D
GASWIN
HOLA88
ALEXIS17
Y200M
MPLAY5000
MPOLANGIT
SIHOKI
SULTAN33
SAVAYASLOT
MONTE77
BARDI4D
PSTOTO99
SGO777
MACO4D
TAJIR77
UNOSLOT
BABE168
SULTANJP
KINGS128
KADERSLOT
TOTO911
KUATJP
LUNAS168
JOKER888
GIGASLOT88
GMSLOT88
HOBI188
IBET44
IDWIN
IGCWIN
OVOKER
TEXASPOKER
HOKIVEGAS
POKERBOYA
RGOPOKER
INDOWINBET
HKBPOKER
ROYALPOKER
HKBPOKERQQ
ALFA303
INDODINGDONG
RGOBET
EYANGPOKER
BROVEGAS
GITARTOGEL
GITARPOKER
AHABET
KTP303
MABOSWAY
KBO77
GIGASLOT88
GMSLOT88
HOBI188
IBET44
IDWIN
IGCWIN
DEWIJOKER
DRAGON303
FANTASYSLOT
FORWIN77
GBO007
GBOPLAY138
GBOSLOT
GBOWIN
NAGA168
PBOWIN
UANG77
MVP288
MURAHSLOT
MASHOKI
GITAR100
ERAPLAY88
GOLDENCROWNPOKER
HPPOKER
DNDPOKER
SUPER138
RAKSASA123
MOTORSLOT77
KUDASAKTI168
ERA77
526BET
52TOGEL
76SLOT
LEXISPOKER
LVONLINE
KAPAL4D
KAPAL4D2
MOMOPOKER
K7BOLA
NAGABOLA
TOGELHOK
WAZEPOKER
WARKOPPOKER
PORN VIDEO
https://link.space/@Hikaribet
https://bio.site/Hikaribet
https://heylink.me/Hikaribet39

Strategi Ampuh Menang di Slot Zeus: Panduan Pemula hingga Pro

Slot Zeus Online: Game RTP Tinggi yang Wajib Dicoba Pemain Slot!

Slot Gacor Paling Gacor Terbaik

Review Lengkap Slot Zeus Online: Apakah Game Ini Layak Dimainkan?

Rahasia Menang di Slot Zeus Online: Strategi dan Tips Terbaru 2025

Mitos vs Fakta: Apakah Slot Zeus Benar-benar Menguntungkan?

Keunggulan Slot Zeus Dibandingkan Game Slot Lain, Wajib Tahu!

Fakta Menarik Slot Zeus Online: Fitur Bonus dan Jackpot Besar!

Cara Bermain Slot Zeus Online Agar Maksimal dan Menghasilkan Cuan

Slot Zeus Online: Cara Memanfaatkan Free Spin untuk Maksimal Jackpot!

10 Alasan Kenapa Slot Zeus Online Jadi Favorit Para Pemain Slot

CMBET88
Gamelantogel
CMBET88
didascaliasdelteatrocaminito.com
glenellynrent.com
gypsumboardequipment.com
realseller.org
https://harrysphone.com/upin
gyergyoalfalu.ro/tokek
vipokno.by/gokil
winjospg.com
winjos801.com/
www.logansquarerent.com
internationalfintech.com/bamsz
condowizard.ca
jawatoto889.com
hikaribet3.live
hikaribet1.com
heylink.me/hikaribet
www.nomadsumc.org
condowizard.ca/aromatoto
euro2024gol.com
www.imaracorp.com
daftarsekaibos.com
stuffyoucanuse.org/juragan
Toto Macau 4d
Aromatoto
Lippototo
Mbahtoto
Winjos
152.42.229.23
bandarlotre126.com
heylink.me/sekaipro
www.get-coachoutletsonline.com
wholesalejerseyslord.com
Lippototo
Zientoto
Lippototo
Situs Togel Resmi
Fajartoto
Situs Togel
Toto Macau
Winjos
Winlotre
Aromatoto
design-develop-test.com
winlotre.online
winlotre.xyz
winlotre.us
winlotrebandung.com
winlotrepalu.com
winlotresurabaya.shop
winlotrejakarta.com
winlotresemarang.shop
winlotrebali.shop
winlotreaceh.shop
winlotremakmur.com
Dadu Online
Taruhantoto
a Bandarlotre
bursaliga
lakitoto
aromatoto
Rebahin
untungslot.pages.dev
slotpoupler.pages.dev
rtpliveslot88a.pages.dev
tipsgameslot.pages.dev
pilihslot88.pages.dev
fortuertiger.pages.dev
linkp4d.pages.dev
linkslot88a.pages.dev
slotpgs8.pages.dev
markasjudi.pages.dev
saldo69.pages.dev
slotbenua.pages.dev
saingtoto.pages.dev
markastoto77.pages.dev
jowototo88.pages.dev
sungli78.pages.dev
volatilitas78.pages.dev
bonusbuy12.pages.dev
slotoffiline.pages.dev
dihindari77.pages.dev
rtpdislot1.pages.dev
agtslot77.pages.dev
congtoto15.pages.dev
hongkongtoto7.pages.dev
sinarmas177.pages.dev
hours771.pages.dev
sarana771.pages.dev
kananslot7.pages.dev
balitoto17.pages.dev
jowototo17.pages.dev
aromatotoding.com
unyagh.org
fairparkcounseling.com/gap/
impress-newtex.com/ajax/
SULTAN88
SULTANSLOT
RAJA328
JOIN88+
HOKIBET
GFC88
RusiaSlot88
Tahu69
BONANZA99
Pragmabet
mega55
luxury777
luxury333
borju89
qqgaming
KEDAI168
mega777
nagaslot777
TAKSU787
kkslot777
MAS77TOTO
BANDAR55+
BOS303
Login-HOKI99/
NUSA365
YUHUSLOT
ktp168
GALAXY138