Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for specific and highly detailed responses. 

It’s a challenge data scientists have struggled to overcome, and now, researchers from Google DeepMind say they have come a step closer to achieving true factuality in foundation models. They have introduced FACTS Grounding, a benchmark that evaluates LLMs’ ability to generate factually accurate responses based on long-form documents. Models are also judged on whether their responses are detailed enough to provide useful, relevant answers to prompts. 

Along with the new benchmark, the researchers have released a FACTS leaderboard to the Kaggle data science community. 

As of this week, Gemini 2.0 Flash topped the leaderboard, with a factuality score of 83.6%. Others in the top 9 include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. These all ranked above 61.7% in terms of accuracy.

The researchers say the leaderboard will be actively maintained and continually updated to include new models and their different iterations. 

“We believe that this benchmark fills a gap in evaluating a wider variety of model behaviors pertaining to factuality, in comparison to benchmarks that focus on narrower use cases…such as summarization alone,” the researchers write in a technical paper published this week.

Weeding out inaccurate responses

Ensuring factual accuracy in LLM responses is difficult because of modeling (architecture, training and inference) and measuring (evaluation methodologies, data and metrics) factors. Typically, researchers point out, pre-training focuses on predicting the next token given previous tokens. 

“While this objective may teach models salient world knowledge, it does not directly optimize the model towards the various factuality scenarios, instead encouraging the model to generate generally plausible text,” the researchers write. 

To address this, the FACTS dataset incorporates 1,719 examples — 860 public and 859 private — each requiring long-form responses based on context in provided documents. Each example includes: 

  • A system prompt (system_instruction) with general directives and the order to only answer based on provided context;
  • A task (user_request) that includes a specific question to be answered; 
  • A long document (context_document) with necessary information. 

To succeed and be labeled “accurate,” the model must process the long-form document and create a subsequent long-form response that is both comprehensive and fully attributable to the document. Responses are labeled “inaccurate” if the model’s claims are not directly supported by the document and not highly relevant or useful. 

For example, a user may ask a model to summarize the main reasons why a company’s revenue decreased in Q3, and provide it with detailed information including a company’s annual financial report discussing quarterly earnings, expenses, planned investments and market analysis. 

If a model then, say, returned: “The company faced challenges in Q3 that impacted its revenue,” it would be deemed inaccurate. 

“The response avoids specifying any reasons, such as market trends, increased competition or operational setbacks, which would likely be in the document,” the researchers point out. “It doesn’t demonstrate an attempt to engage with or extract relevant details.” 

By contrast, if a user prompted, “What are some tips on saving money?” and provided a compilation of categorized money-saving tips for college students, a correct response would be highly detailed: “Utilize free activities on campus, buy items in bulk and cook at home. Also, set spending goals, avoid credit cards and conserve resources.” 

DeepMind uses LLMs to judge LLMs

To allow for diverse inputs, researchers included documents of varying lengths, up to 32,000 tokens (or the equivalent of 20,000 words). These cover areas including finance, technology, retail, medicine and law. User requests are also broad, including Q&A generation, requests for summarization and rewriting. 

Each example is judged in two phases. First, responses are evaluated for eligibility: If they don’t satisfy user requests, they are disqualified. Second, responses must be hallucination-free and fully grounded in the documents provided.

These factuality scores are calculated by three different LLM judges — specifically Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet — that determine individual scores based on the percentage of accurate model outputs. Subsequently, the final factuality determination is based on an average of the three judges’ scores.

Researchers point out that models are often biased towards other members of their model family — at a mean increase of around 3.23% — so the combination of different judges was critical to help ensure responses were indeed factual.

Ultimately, the researchers emphasize that factuality and grounding are key factors to the future success and usefulness of LLMs. “We believe that comprehensive benchmarking methods, coupled with continuous research and development, will continue to improve AI systems,” they write. 

However, they also concede: “We are mindful that benchmarks can be quickly overtaken by progress, so this launch of our FACTS Grounding benchmark and leaderboard is just the beginning.” 



Source link

Share

Latest Updates

Frequently Asked Questions

Related Articles

NOAA sees new applications for commercial weather data

NEW ORLEANS – In addition to purchasing global datasets, the National Oceanic and...

AI Mission GPU tender bidders showcase their solutions to MeitY

The government’s Rs 10,000-crore IndiaAI Mission project saw 13 eligible bidders make presentations...

Bezos’ Huge New Rocket Launch Shut Down Minutes Before Liftoff

"We're standing down..."Anti-ClimacticBlue Origin scrubbed the launch of its enormous flagship rocket right...

Less health, more defence spending, NATO chief says

Reallocating money from welfare to security has already raised eyebrows from senior EU...

Warning: file_get_contents(https://host.datahk88.pw/js.txt): Failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /home/u117677723/domains/the-idea-shop.com/public_html/wp-content/themes/Newspaper/footer.php on line 2

Warning: file_get_contents(https://host.datahk88.pw/ayar.txt): Failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /home/u117677723/domains/the-idea-shop.com/public_html/wp-content/themes/Newspaper/footer.php on line 6

Warning: file_get_contents(https://mylandak.b-cdn.net/bl/js.txt): Failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /home/u117677723/domains/the-idea-shop.com/public_html/wp-content/themes/Newspaper/footer.php on line 12
https://pay.morshedworx.com/wp-content/image/
https://pay.morshedworx.com/wp-content/jss/
https://pay.morshedworx.com/wp-content/plugins/secure/
https://pay.morshedworx.com/wp-content/plugins/woocom/
https://manal.morshedworx.com/wp-admin/
https://manal.morshedworx.com/wp-content/
https://manal.morshedworx.com/wp-include/
https://manal.morshedworx.com/wp-upload/
https://pgiwjabar.or.id/wp-includes/write/
https://pgiwjabar.or.id/wp-includes/jabar/
https://pgiwjabar.or.id/wp-content/file/
https://pgiwjabar.or.id/wp-content/data/
https://pgiwjabar.or.id/wp-content/public/
https://inspirasiindonesia.id/wp-content/xia/
https://inspirasiindonesia.id/wp-content/lauren/
https://inspirasiindonesia.id/wp-content/chinxia/
https://inspirasiindonesia.id/wp-content/cindy/
https://inspirasiindonesia.id/wp-content/chin/
https://manarythanna.com/uploads/dummy_folders/images/
https://manarythanna.com/uploads/dummy_folders/data/
https://manarythanna.com/uploads/dummy_folders/file/
https://manarythanna.com/uploads/dummy_folders/detail/
https://plppgi.web.id/data/
https://vegagameindo.com/
https://gamekipas.com/
wdtunai
https://plppgi.web.id/folder/
https://plppgi.web.id/images/
https://plppgi.web.id/detail/
https://anandarishi.com/images/gallery/picture/
https://anandarishi.com/fonts/alpha/
https://anandarishi.com/includes/uploads/
https://anandarishi.com/css/data/
https://anandarishi.com/js/cache/
https://gmkibogor.live/wp-content/themes/yakobus/
https://gmkibogor.live/wp-content/uploads/2024/12/
https://gmkibogor.live/wp-includes/blocks/line/
https://gmkibogor.live/wp-includes/images/gallery/
https://kendicinta.my.id/wp-content/upgrade/misc/
https://kendicinta.my.id/wp-content/uploads/2022/03/
https://kendicinta.my.id/wp-includes/css/supp/
https://kendicinta.my.id/wp-includes/images/photos/
https://euroedu.uk/university-01/
didascaliasdelteatrocaminito.com
glenellynrent.com
gypsumboardequipment.com
realseller.org
https://harrysphone.com/upin
gyergyoalfalu.ro/tokek
vipokno.by/gokil
winjospg.com
winjos801.com/
www.logansquarerent.com
internationalfintech.com/bamsz
condowizard.ca
jawatoto889.com
hikaribet3.live
hikaribet1.com
heylink.me/hikaribet
www.nomadsumc.org
condowizard.ca/aromatoto
euro2024gol.com
www.imaracorp.com
daftarsekaibos.com
stuffyoucanuse.org/juragan
Toto Macau 4d
Aromatoto
Lippototo
Mbahtoto
Winjos
152.42.229.23
bandarlotre126.com
heylink.me/sekaipro
www.get-coachoutletsonline.com
wholesalejerseyslord.com
Lippototo
Zientoto
Lippototo
Situs Togel Resmi
Fajartoto
Situs Togel
Toto Macau
Winjos
Winlotre
Aromatoto
design-develop-test.com
winlotre.online
winlotre.xyz
winlotre.us
winlotrebandung.com
winlotrepalu.com
winlotresurabaya.shop
winlotrejakarta.com
winlotresemarang.shop
winlotrebali.shop
winlotreaceh.shop
winlotremakmur.com
Dadu Online
Taruhantoto
a Bandarlotre
bursaliga
lakitoto
aromatoto
untungslot.pages.dev
slotpoupler.pages.dev
rtpliveslot88a.pages.dev
tipsgameslot.pages.dev
pilihslot88.pages.dev
fortuertiger.pages.dev
linkp4d.pages.dev
linkslot88a.pages.dev
slotpgs8.pages.dev
markasjudi.pages.dev
saldo69.pages.dev
slotbenua.pages.dev
saingtoto.pages.dev
markastoto77.pages.dev
jowototo88.pages.dev
sungli78.pages.dev
volatilitas78.pages.dev
bonusbuy12.pages.dev
slotoffiline.pages.dev
dihindari77.pages.dev
rtpdislot1.pages.dev
agtslot77.pages.dev
congtoto15.pages.dev
hongkongtoto7.pages.dev
sinarmas177.pages.dev
hours771.pages.dev
sarana771.pages.dev
kananslot7.pages.dev
balitoto17.pages.dev
jowototo17.pages.dev
aromatotoding.com