LLM Model Benchmark Dataset

Study: Stanford’s VeriFact uses AI to verify LLM-generated clinical records

VeriFact analyzes statements in AI-generated clinical text against patients’ EHRs to identify factual errors, achieving 93.2% agreement with clinicians.

SiliconANGLE

MLCommons releases new AILuminate benchmark for measuring AI model safety

MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.

insideHPC

MLCommons Launches LLM Safety Benchmark

Dec. 4, 2024 — MLCommons today released AILuminate, a safety test for large language models. The v1.0 benchmark – which provides a series of safety grades for the most widely-used LLMs – is the first ...

VentureBeat

Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results