The impressive capabilities of „large language models“ such as ChatGPT, Claude and Gemini have sparked huge interest and promise to have a deep, potentially disruptive impact on many different sectors of society. Virtually every organisation's board and top management is pressuring everyone to use so-called „AI“, usually without specifying exactly how, and without addressing the deep and still unsolved reliability issues of this technology.
Carnegie-Mellon, Duke: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Salesforce: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
Princeton Towards a science of AI agent reliability
Interactive result dashboard HAL Reliability Dashboard
Meta's Head of AI Safety Just Made a Mistake That May Cause You a Certain Amount of Alarm
Cybersecurity: Agents of Chaos
Google NotebookLM
The link above provides some examples of public notebooks ready to use.
This is a public notebook for my students in Computer Networks and Principles of Cybersecurity course. Content in Italian but you can chat in English. The "Readme" notes describe how to obtain a self-assessment autonomously (notes in Italian, you may translate them automatically to have an idea).