From foundation model evaluation to securing Generative AI applications: PRISM Eval's expanding impact on industry standards.
The first quarter of 2025 marks a significant milestone in PRISM Eval's ongoing mission to enable reliable and controllable GenAI deployment at industrial scale. Having established strong methodologies for foundation model evaluation, we're now advancing to address the security challenges of generative AI applications and multi-agent systems—the natural next phase in our vision. This expansion comes at a crucial time as organizations increasingly deploy AI agents like Manus, OpenAI Operator, and products from Adept AI—often without adequate security safeguards.
Spotlight
1. Presenting our LLM Robustness Leaderboard at the AI Action Summit
.png)
PRISM Eval took center stage at the Grand Palais during the AI Action Summit in Paris (February 10-11) to unveil our comprehensive LLM Robustness Leaderboard. As part of France's AI Convergence challenges supported by the General Secretariat for Investment (SGPI) through the #France2030 initiative, we evaluated 41 state-of-the-art large language models against prompt injection attacks using our Behavior Elicitation Tool (BET) API.
The leaderboard results revealed critical vulnerabilities in all frontier LLMs with unprecedented granularity across attack scenarios (i.e.unwanted behaviors tested) and different classes of prompt injection techniques used to simulate attacks.
Key findings:
- 100% of tested models displayed unwanted behavior on all tested behaviors
- <50 estimated steps required to break reasoning models
- >400 x variation in resistance between the best and worst model
.png)
Our presentation on the mainstage of the Grand Palais highlighted the urgent need for Dynamic, Adversarially Optimized (DAO) evaluation methodologies that go beyond static benchmarks—particularly as more organizations deploy sophisticated LLM-powered agent systems at scale.
During the Scientific Days of the Summit held at Ecole Polytechnique, the team presented our approach, methodology and metric, in detail. We also joined many side events and conversations. Including at the AI & Society House held by our Advisor Rumman Chaudhury where we joined a panel discussing industry and civil society-government collaboration on AI Safety and societal impact; and at AI Safety Connect 2025 held by our advisor Cyrus Hodes, where we presented BET and the LLM Robustness Leaderboard as well as led a live Jailbreaking demo.
2. Participating in Singapore's Global AI Assurance Pilot Program
We're proud to announce our participation as a specialized testing partner in Singapore's groundbreaking Global AI Assurance Pilot—a dedicated partnership between industry partners and AI assurance providers for evaluating GenAI performance and safety.
The Pilot launched at the AI Action Summit in Paris at the Business Days at Station F on Feb. 11 by Singapore's Infocomm Media Development Authority (IMDA) and AI Verify Foundation.
This pioneering initiative affords us the opportunity to validate our product with diverse organizations from across three continents, spanning financial services, aviation, healthcare, pharma, engineering, and the public sector.
.png)
A New Testing Paradigm: What makes this program unique is its focus on testing actual deployed GenAI applications rather than just their underlying foundation models.The focus is on how models function within real-world applications, with real data, and under real operational constraints with real clients..
Standards Development: The technical testing methodologies and frameworks developed through this pilot will directly shape global best practices for AI assurance, creating a common language for what constitutes trustworthy AI applications. These insights will be showcased at Asia Tech x Singapore (27-29 May 2025), positioning PRISM Eval at the forefront of defining standards in the burgeoning AI assurance market.
3. From Foundation Models to Real-Life GenAI Applications: The Next Phase in Our Vision
As the AI landscape is fast evolving toward agent-based systems, PRISM Eval is advancing to the next phase in our roadmap: addressing the unique security challenges of multi-agent architectures. This progression from foundation models (the backbone of all GenAI systems) to complete agent applications aligns with our core mission of enabling reliable and controllable GenAI deployment at industrial scale:
.png)
Scientific Foundation: Our recent contributions include a comprehensive Multi-Agent Risk Report in collaboration with the Cooperative AI Foundation and our AAAI paper "Multi-Agent Security Tax", which examines the unique vulnerabilities introduced in agent interactions.
We've also actively contributed to the development of AILuminate v1.0—the first comprehensive industry-standard benchmark for assessing AI product risk and reliability across 12 hazard categories. This benchmark represents a crucial step toward establishing global standards for AI safety evaluation that will guide model developers, system integrators, and policymakers.
Finally, we have also actively contributed to a seminal paper proposing a framework for third-party flaw disclosure for General Purpose AI to improve the safety,security, and accountability of GPAI systems.
Other Highlights in Q1 2025
4. PRISM Eval at World Economic Forum and Continued Thought Leadership in AI Safety
Our leadership team participated in key industry events this quarter:
AI House @ World Economic Forum: Tom David participated in a roundtable discussion on AI governance challenges, examining the evolution of evaluation methodologies for increasingly autonomous AI systems. The conversation benefited from diverse perspectives, with notable AI researchers Yoshua Bengio, Dawn Song, and Peter Mattson also contributing their insights around the table.
Technology Innovation Institute AI Seminar: On February 27th, Pierre Peigné delivered a comprehensive presentation on our Behavior Elicitation Tool (BET) and its application in dynamic red teaming of LLMs. The seminar introduced our novel attack success rate metrics and vulnerability mapping methodologies to a technical audience of AI researchers.
.png)
Generative AI France Meetup: Our team presented the BET leaderboard and evaluation methodology at the Théatre de L'IA, highlighting specific vulnerabilities discovered across different LLMs.
"Oil & Water: Governing AI as Global Public Good" : Nicolas Miailhe (CEO) participated in the a Pre-AI Action Summit panel organized by Asia Society France and AI Safety Asia, discussing the tension between commercial incentives and safety imperatives in frontier AI development.
5. Global Recognition and Media Coverage
PRISM Eval's expertise in AI assurance and control has been recognized in several prominent media outlets:
- Fortune Magazine featured us in their coverage of the Paris AI Summit and Europe's AI startup ecosystem
- France24's Tech 24 program highlighted our team's demonstration of jailbreaking vulnerabilities in X.AI's Grok 3 model
- Our analysis was featured in Silicon Sands Studio's in-depth examination of AI model leaderboards
- France TV Info featured our robustness analysis in their "Vrai ou Faux" segment
- Le Figaro cited our leaderboard in their comparison of generative AI models
- Korben.info included PRISM Eval in their "11 Promising Projects" from the AI Action Summit
What's Next for PRISM Eval?
Product Developments: Our team is preparing several exciting announcements in the coming weeks related to our core technology offerings. Stay tuned for updates on how we're making our evaluation capabilities more accessible to organizations developing GenAI applications. In the meantime you can register your interest in BET API here.
AI Verify Foundation Collaboration: We'll showcase our technical testing methodology for real-life GenAI applications (not just their underlying foundation models) through our participation in the AI Verify Foundation's Global AI Assurance Pilot. This work aims to help shape global AI assurance best practices and will be featured at Asia Tech x Singapore 2025.
Research & Development: The lessons from our leaderboard and multiple agent evaluations are being incorporated into our next generation of testing methodologies, with several forthcoming publications in progress.
Follow PRISM Eval on LinkedIn for the latest updates on our research and product developments.