Tekmono
  • News
  • Guides
  • Lists
  • Reviews
  • Deals
No Result
View All Result
Tekmono
No Result
View All Result
Home News
AI Benchmarking Controversy Erupts Over Pokémon Game Test

AI Benchmarking Controversy Erupts Over Pokémon Game Test

by Tekmono Editorial Team
17/04/2025
in News
Share on FacebookShare on Twitter

Artificial intelligence benchmarking controversy has reached unexpected territories, with a recent claim that Google’s Gemini model outperformed Anthropic’s Claude model in the original Pokémon game sparking debate over benchmarking methods.

A post on X went viral last week, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. The post stated that Gemini had reached Lavender Town in a developer’s Twitch stream, while Claude was stuck at Mount Moon as of late February. The claim was supported by a screenshot of the stream, which had “119 live views only btw, incredibly underrated stream.” The post read, “Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town.”

However, it was later revealed that Gemini had an unfair advantage. Reddit users pointed out that the developer maintaining the Gemini stream had built a custom minimap that helps the model identify “tiles” in the game, such as cuttable trees. This custom minimap reduces the need for Gemini to analyze screenshots before making gameplay decisions, giving it a significant edge.

Related Reads

OpenAI spending reaches $34 billion last year in preparation for IPO

SpaceX shares soar again as ETF issuers increase their investments

Xbox experiences executive departures and Compulsion Games shutdown

Binance tops inaugural Fortune Crypto 100 list of digital asset leaders

The use of Pokémon as a benchmark, although semi-serious at best, serves as an instructive example of how different implementations of a benchmark can influence results. The controversy highlights the imperfections of AI benchmarking and how custom implementations can make it challenging to compare models accurately.

This issue is not unique to Pokémon. Anthropic reported two different scores for its Claude 3.7 Sonnet model on the SWE-bench Verified benchmark, which evaluates a model’s coding abilities. Without a “custom scaffold,” Claude 3.7 Sonnet achieved 62.3% accuracy, but with the custom scaffold, the accuracy increased to 70.3%. Similarly, Meta fine-tuned a version of its Llama 4 Maverick model to perform better on the LM Arena benchmark, and the fine-tuned version scored significantly higher than the vanilla version on the same evaluation.

Given that AI benchmarks are imperfect measures to begin with, custom and non-standard implementations further complicate the comparison of models. As a result, it is likely to become increasingly difficult to compare models as they are released.

ShareTweet

You Might Be Interested

OpenAI spending reaches  billion last year in preparation for IPO
News

OpenAI spending reaches $34 billion last year in preparation for IPO

16/06/2026
SpaceX shares soar again as ETF issuers increase their investments
News

SpaceX shares soar again as ETF issuers increase their investments

16/06/2026
Xbox experiences executive departures and Compulsion Games shutdown
News

Xbox experiences executive departures and Compulsion Games shutdown

16/06/2026
Binance tops inaugural Fortune Crypto 100 list of digital asset leaders
News

Binance tops inaugural Fortune Crypto 100 list of digital asset leaders

16/06/2026
Please login to join discussion

Recent Posts

  • OpenAI spending reaches $34 billion last year in preparation for IPO
  • SpaceX shares soar again as ETF issuers increase their investments
  • Xbox experiences executive departures and Compulsion Games shutdown
  • Binance tops inaugural Fortune Crypto 100 list of digital asset leaders
  • DeepSeek raises $7B, marking a new era in the AI battle

Recent Comments

No comments to show.
  • News
  • Guides
  • Lists
  • Reviews
  • Deals
Tekmono is a Linkmedya brand. © 2015.

No Result
View All Result
  • News
  • Guides
  • Lists
  • Reviews
  • Deals

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.