Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


DeepSeek-R1 has surely created a lot of excitement and concern, especially for OpenAI’s rival model o1. So, we put them to test in a side-by-side comparison on a few simple data analysis and market research tasks. 

To put the models on equal footing, we used Perplexity Pro Search, which now supports both o1 and R1. Our goal was to look beyond benchmarks and see if the models can actually perform ad hoc tasks that require gathering information from the web, picking out the right pieces of data and performing simple tasks that would require substantial manual effort. 

Both models are impressive but make errors when the prompts lack specificity. o1 is slightly better at reasoning tasks but R1’s transparency gives it an edge in cases (and there will be quite a few) where it makes mistakes.

Here is a breakdown of a few of our experiments and the links to the Perplexity pages where you can review the results yourself.

Calculating returns on investments from the web

Our first test gauged whether models could calculate returns on investment (ROI). We considered a scenario where the user has invested $140 in the Magnificent Seven (Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, Tesla) on the first day of every month from January to December 2024. We asked the model to calculate the value of the portfolio at the current date.

To accomplish this task, the model would have to pull Mag 7 price information for the first day of each month, split the monthly investment evenly across the stocks ($20 per stock), sum them up and calculate the portfolio value according to the value of the stocks on the current date.

In this task, both models failed. o1 returned a list of stock prices for January 2024 and January 2025 along with a formula to calculate the portfolio value. However, it failed to calculate the correct values and basically said that there would be no ROI. On the other hand, R1 made the mistake of only investing in January 2024 and calculating the returns for January 2025.

o1’s reasoning trace does not provide enough information

However, what was interesting was the models’ reasoning process. While o1 did not provide much details on how it had reached its results, R1’s reasoning traced showed that it did not have the correct information because Perplexity’s retrieval engine had failed to obtain the monthly data for stock prices (many retrieval-augmented generation applications fail not because of the model lack of abilities but because of bad retrieval). This proved to be an important bit of feedback that led us to the next experiment.

The R1 reasoning trace reveals that it is missing information

Reasoning over file content

We decided to run the same experiment as before, but instead of prompting the model to retrieve the information from the web, we decided to provide it in a text file. For this, we copy-pasted stock monthly data for each stock from Yahoo! Finance into a text file and gave it to the model. The file contained the name of each stock plus the HTML table that contained the price for the first day of each month from January to December 2024 and the last recorded price. The data was not cleaned to reduce the manual effort and test whether the model could pick the right parts from the data.

Again, both models failed to provide the right answer. o1 seemed to have extracted the data from the file, but suggested the calculation be done manually in a tool like Excel. The reasoning trace was very vague and did not contain any useful information to troubleshoot the model. R1 also failed and didn’t provide an answer, but the reasoning trace contained a lot of useful information.

For example, it was clear that the model had correctly parsed the HTML data for each stock and was able to extract the correct information. It had also been able to do the month-by-month calculation of investments, sum them and calculate the final value according to the latest stock price in the table. However, that final value remained in its reasoning chain and failed to make it into the final answer. The model had also been confounded by a row in the Nvidia chart that had marked the company’s 10:1 stock split on June 10, 2024, and ended up miscalculating the final value of the portfolio. 

R1 hid the results in its reasoning trace along with information about where it went wrong

Again, the real differentiator was not the result itself, but the ability to investigate how the model arrived at its response. In this case, R1 provided us with a better experience, allowing us to understand the model’s limitations and how we can reformulate our prompt and format our data to get better results in the future.

Comparing data over the web

Another experiment we carried out required the model to compare the stats of four leading NBA centers and determine which one had the best improvement in field goal percentage (FG%) from the 2022/2023 to the 2023/2024 seasons. This task required the model to do multi-step reasoning over different data points. The catch in the prompt was that it included Victor Wembanyama, who just entered the league as a rookie in 2023.

The retrieval for this prompt was much easier, since player stats are widely reported on the web and are usually included in their Wikipedia and NBA profiles. Both models answered correctly (it’s Giannis in case you were curious), although depending on the sources they used, their figures were a bit different. However, they did not realize that Wemby did not qualify for the comparison and gathered other stats from his time in the European league.

In its answer, R1 provided a better breakdown of the results with a comparison table along with links to the sources it used for its answer. The added context enabled us to correct the prompt. After we modified the prompt specifying that we were looking for FG% from NBA seasons, the model correctly ruled out Wemby from the results.

Adding a simple word to the prompt made all the difference in the result. This is something that a human would implicitly know. Be as specific as you can in your prompt, and try to include information that a human would implicitly assume.

Final verdict

Reasoning models are powerful tools, but still have a ways to go before they can be fully trusted with tasks, especially as other components of large language model (LLM) applications continue to evolve. From our experiments, both o1 and R1 can still make basic mistakes. Despite showing impressive results, they still need a bit of handholding to give accurate results.

Ideally, a reasoning model should be able to explain to the user when it lacks information for the task. Alternatively, the reasoning trace of the model should be able to guide users to better understand mistakes and correct their prompts to increase the accuracy and stability of the model’s responses. In this regard, R1 had the upper hand. Hopefully, future reasoning models, including OpenAI’s upcoming o3 series, will provide users with more visibility and control.



Source link

Share

Latest Updates

Frequently Asked Questions

Related Articles

TikTok returns on Apple, Google app stores as Donald Trump delays ban

TikTok returned on the US app stores of Apple and Google on Thursday,...

Confused Senator Rages That Self-Driving Cars Are Woke

Senator Ted Cruz (R-TX) believes that topics as diverse as solar eclipses and self-driving...

AI’s biggest obstacle? Data reliability. Astronomer’s new platform tackles the challenge

Join our daily and weekly newsletters for the latest updates and exclusive content...
SULTAN88
SULTANSLOT
RAJA328
JOIN88
GFC88
HOKIBET
RUSIASLOT88
TAHU69
BONANZA99
PRAGMABET
MEGA55
LUXURY777
LUXURY333
BORJU89
QQGAMING
KEDAI168
MEGA777
NAGASLOT777
TAKSU787
KKSLOT777
MAS77TOTO
bandar55
BOS303
HOKI99
NUSA365
YUHUSLOT
KTP168
GALAXY138
NEXIA138
PETIR33
BOOM138
MEGA888
CABE888
FOSIL777
turbospin138
KAPAKBET
SUPERJP
sultankoin99
dragon88
raffi888
kenzobet
aladin666
rgo365
ubm4d
GERCEP88
VIVA99
CR777
VOXY88
delman567
intan69
CABE888
RNR303
LOGO303
PEMBURUGACOR
mpo383
cermin4d
bm88
ANGKA79
WOWHOKI
ROKET303
MPOXL
GURITA168
SUPRASLOT
SGCWIN
DESA88
ARWANA388
DAUNEMAS
ALADDIN666
BIOWIN69
SKY77
DOTA88
NAGA138
API5000
y200m
PLAYBOOK88
LUXURY12
A200M
MPO700
KENANGAN4D
cakrabola
PANDAGENDUT
MARVEL77
UG300
HOKI178
MONTE77
JASABOLA
UNTAR4D
LIDO88
MAFIABOLA77
GASPOL189
mpo999
untung138
TW88
JAGUAR33
MPOBOS
SHIO88
VIVO4D
MPOXL
JARISAKTI
BBO303
AONCASH
ANGKER4D
LEVIS4D
JAGO88
REPUBLIK365
BOSDEAL88
BOLA168
akunjp
WARTEGBET
EZEBET
88PULSA
KITAB4D
BOSDEAL88
STUDIOBET
MESINKOIN
BIMA88
PPNUSA
ABGBET88
TOP77
BAYAR77
YES77
BBTN4D
BBCA4D
VSLOTS88
MPO800
PAHALA4D
KPI4D
JURAGAN77
QQ188
BOLAPELANGI
C200M
QQ998
GWKTOGEL
MEGABANDAR
COLOWIN
VIP579
SEVEN4D
MPO188
DEWATA88
SURAT4D
SINAR123
LAMBO77
GUDANG4D
AWAN4D
PLANETLIGA
GT88
ROYALSPIN88
MAMAJITU
MITO99
PEDIA4D
WIBU69JP
333HOKI
SIDARMA88
NAGAEMAS99
HOLA88
CAKAR76
KINGTOTO
RATUGAMING
SSI168
PILAR168
ACTOTO
EYANGTOGEL
KAISAR328
SLOT628
KAISAR88
DOTA88
MAXWIN369
ALIBABA99
MM168
SQUAD777
NAGABET88
JAYABOLA
SEMPATIGAME
PANDAJAGO
PIKAT4D
SINGA77
YUYU33
MASTERPLAY99
VICTORY39
NASA4D
PERMATA55
SAKAUSLOT
CK303
MPOTOWER
CIPUTRABET
WINJUDI
DEWI5000
IYA777
MAHIRTOTO
GOSLOT88
TIPTOP4D
RAJA787
JBO680
JOKER188
EPICPLAY88
TRIVABET
KAISAR189
JOKER81
JPSPIN88
MAYORA4D
DJARUMPLAY
OVO88
BAKTI78
WINGSLOT77
ICAFE4D
PDTOTO
JETPLAY88
PORN VIDEO
https://link.space/@Hikaribet
https://bio.site/Hikaribet
https://heylink.me/Hikaribet39
CMBET88
CMBET88
didascaliasdelteatrocaminito.com
glenellynrent.com
gypsumboardequipment.com
realseller.org
https://harrysphone.com/upin
gyergyoalfalu.ro/tokek
vipokno.by/gokil
winjospg.com
winjos801.com/
www.logansquarerent.com
internationalfintech.com/bamsz
condowizard.ca
jawatoto889.com
hikaribet3.live
hikaribet1.com
heylink.me/hikaribet
www.nomadsumc.org
condowizard.ca/aromatoto
euro2024gol.com
www.imaracorp.com
daftarsekaibos.com
stuffyoucanuse.org/juragan
Toto Macau 4d
Aromatoto
Lippototo
Mbahtoto
Winjos
152.42.229.23
bandarlotre126.com
heylink.me/sekaipro
www.get-coachoutletsonline.com
wholesalejerseyslord.com
Lippototo
Zientoto
Lippototo
Situs Togel Resmi
Fajartoto
Situs Togel
Toto Macau
Winjos
Winlotre
Aromatoto
design-develop-test.com
winlotre.online
winlotre.xyz
winlotre.us
winlotrebandung.com
winlotrepalu.com
winlotresurabaya.shop
winlotrejakarta.com
winlotresemarang.shop
winlotrebali.shop
winlotreaceh.shop
winlotremakmur.com
Dadu Online
Taruhantoto
a Bandarlotre
bursaliga
lakitoto
aromatoto
Rebahin
untungslot.pages.dev
slotpoupler.pages.dev
rtpliveslot88a.pages.dev
tipsgameslot.pages.dev
pilihslot88.pages.dev
fortuertiger.pages.dev
linkp4d.pages.dev
linkslot88a.pages.dev
slotpgs8.pages.dev
markasjudi.pages.dev
saldo69.pages.dev
slotbenua.pages.dev
saingtoto.pages.dev
markastoto77.pages.dev
jowototo88.pages.dev
sungli78.pages.dev
volatilitas78.pages.dev
bonusbuy12.pages.dev
slotoffiline.pages.dev
dihindari77.pages.dev
rtpdislot1.pages.dev
agtslot77.pages.dev
congtoto15.pages.dev
hongkongtoto7.pages.dev
sinarmas177.pages.dev
hours771.pages.dev
sarana771.pages.dev
kananslot7.pages.dev
balitoto17.pages.dev
jowototo17.pages.dev
aromatotoding.com
unyagh.org
fairparkcounseling.com/gap/
impress-newtex.com/ajax/
SULTAN88
SULTANSLOT
RAJA328
JOIN88+
HOKIBET
GFC88
RusiaSlot88
Tahu69
BONANZA99
Pragmabet
mega55
luxury777
luxury333
borju89
qqgaming
KEDAI168
mega777
nagaslot777
TAKSU787
kkslot777
MAS77TOTO
BANDAR55+
BOS303
Login-HOKI99/
NUSA365
YUHUSLOT
ktp168
GALAXY138