Meta’s VP of GenAI denies manipulating Llama 4’s benchmark scores

IO_AdminUncategorized1 month ago23 Views

FILE PHOTO: Meta’s VP of GenAI posted a statement on X denying that the company had manipulated its AI models to perform better on certain benchmarks while hiding their limitations. 
| Photo Credit: Reuters

Meta’s VP of GenAI, Ahmad Al-Dahle, posted a statement on X denying allegations that the company had manipulated its AI models to perform better on certain benchmarks while hiding their limitations. He also addressed complaints that the Llama 4 models didn’t offer the high-quality performance that was promised.

“We’re already hearing lots of great results people are getting with these models. That said, we’re also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it’ll take several days for all the public implementations to get dialed in,” he noted.

He added that Meta was still working to fix bugs and that any drop in quality that users were seeing was something they would need to wait out. 

“We’ve also heard claims that we trained on test sets — that’s simply not true and we would never do that,” he stated.

Test sets are generally data that is used to measure the performance of an AI model post-training. Training on a test set would indicate that the model’s benchmark scores were possibly improved so it falsely appears better than it actually is.

The rumour started after a viral post online appeared written by a former employee who claimed that they quit Meta due to the company’s grey benchmarking practices.

The viral post was not verified, but sparked questions and concerns among Meta AI users.

During the release, the company claimed that Maverick, their new mid-sized AI model, was more capable than OpenAI’s GPT-4o and just below Google’s Gemini 2.5 Pro, which currently tops the leaderboard. However, since Saturday, as testers began using the model, it didn’t match the performance claimed by Meta.

Eventually, AI researchers found that in their research paper, Meta had noted that the version of Maverick available to the public was different from the one that was submitted to the performance leaderboard, LMArena. Meta referred to this version as the “experimental chat version” of Maverick that had been “optimised for conversationality.”

A Meta spokesperson later confirmed this, saying the model version sent to LMArena was in fact “Llama-4-Maverick-03-26-Experimental.”

Published – April 09, 2025 02:35 pm IST

Read More

0 Votes: 0 Upvotes, 0 Downvotes (0 Points)

Leave a reply

Recent Comments

No comments to show.

Stay Informed With the Latest & Most Important News

I consent to receive newsletter via email. For further information, please review our Privacy Policy

Advertisement

Loading Next Post...
Follow
Sign In/Sign Up Sidebar Search Trending 0 Cart
Popular Now
Loading

Signing-in 3 seconds...

Signing-up 3 seconds...

Cart
Cart updating

ShopYour cart is currently is empty. You could visit our shop and start shopping.