Meta’s VP of GenAI denies manipulating Llama 4’s benchmark scores

Meta’s VP of GenAI, Ahmad Al-Dahle, posted a statement on X denying allegations that the company had manipulated its AI models to perform better on certain benchmarks while hiding their limitations. He also addressed complaints that the Llama 4 models didn’t offer the high-quality performance that was promised.

“We’re already hearing lots of great results people are getting with these models. That said, we’re also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it’ll take several days for all the public implementations to get dialed in,” he noted.

He added that Meta was still working to fix bugs and that any drop in quality that users were seeing was something they would need to wait out.

“We’ve also heard claims that we trained on test sets — that’s simply not true and we would never do that,” he stated.

Test sets are generally data that is used to measure the performance of an AI model post-training. Training on a test set would indicate that the model’s benchmark scores were possibly improved so it falsely appears better than it actually is.

The rumour started after a viral post online appeared written by a former employee who claimed that they quit Meta due to the company’s grey benchmarking practices.

The viral post was not verified, but sparked questions and concerns among Meta AI users.

During the release, the company claimed that Maverick, their new mid-sized AI model, was more capable than OpenAI’s GPT-4o and just below Google’s Gemini 2.5 Pro, which currently tops the leaderboard. However, since Saturday, as testers began using the model, it didn’t match the performance claimed by Meta.

Eventually, AI researchers found that in their research paper, Meta had noted that the version of Maverick available to the public was different from the one that was submitted to the performance leaderboard, LMArena. Meta referred to this version as the “experimental chat version” of Maverick that had been “optimised for conversationality.”

A Meta spokesperson later confirmed this, saying the model version sent to LMArena was in fact “Llama-4-Maverick-03-26-Experimental.”

Published – April 09, 2025 02:35 pm IST