OpenZeppelin says that the EVMbench dataset contains training data leaks


Blockchain security company OpenZeppelin says it has found methodological flaws and data contamination after testing OpenAI’s new artificial intelligence blockchain security standard, EVMbench.

EVMbench was launched in mid-February in partnership with crypto investment firm Paradigm. It is designed to evaluate how well different artificial intelligence models can detect, patch, and exploit vulnerabilities in smart contracts.

In an X message on Monday, OpenZeppelin said it welcomed the initiative, but recently decided to put EVMbench “through the same scrutiny” it applies to all protocols that help secure it, including decentralized currencies like Aave, Lido and Uniswap.

Based on its audit, OpenZeppelin said it found two key issues: data contamination studies and classification issues around several critical vulnerabilities.

“We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications, including at least four high-level issues that are not being used in practice,” OpenZeppelin said..

image.png
Source: OpenZeppelin

The EVMbench release was an assessment of how well AI agents could theoretically exploit vulnerabilities in smart contracts. Anthropic Claude Open 4.6 took first place, followed by OpenAI from OC-GPT-5.2 and Google Gemini 3 Pro.

The EVMbench test can be revised

OpenZeppelin looked at the first problem in data pollution, saying that the most important ability in “AI security is to find new vulnerabilities in code that the model has not seen before.”

However, during EVMbench testing of AI agents, OpenZeppelin said that all AI agents that received the highest scores “probably encountered vulnerability reports of this specification during preliminary training.”

The EVMbench test saw that internet access was cut off for the AI ​​agents, meaning they couldn’t simply search for a solution to the problem. However, the benchmark was based on vulnerabilities compiled from 120 audits between 2024 and mid-2025, and the learning limit for these agents was generally mid-2025.

Thus, there was a risk that the AI ​​agents already had the answers to all the problems stored in their memory.

“Although this does not necessarily allow the model to detect the problem immediately, it reduces the quality of the test. The limited size of the data set further narrows the level of evaluation, making this contamination more important,” said OpenZeppelin.

related to: Energym AI dystopia goes viral as crypto projects propose AI user agents

.Finally, OpenZeppelin said that there are some significant factual errors in the EVMbench dataset, stating that several “extreme weaknesses” are invalid.

OpenZeppelin said it has evaluated at least four vulnerabilities that were classified as high risk by EVMbench but are not actually operational. However, EVMbench correctly evaluated AI agents for finding these alleged vulnerabilities.

“These are not extreme subjective differences, they are findings that the described exploit does not work.”

Finally, OpenZeppelin reiterated that AI will have a significant impact on strengthening blockchain security, but emphasized the importance of implementing and testing the technology properly to maximize its potential.

“The question is not whether AI will change smart contract security. The question is whether the data and metrics we use to build and evaluate these tools meet the same standard as the contracts they are designed to protect.”

.Magazine: AI Won’t Make You Rich, But Crypto Gaming Can, Axie Founder Resigns: Web3 Gamer