On LLM Benchmark | Jason Siu

Jason Siu

type

Post

Created date

Apr 28, 2024 05:39 AM

category

LLM

tags

Machine Learning

Artificial Intelligence

status

Published

Language

Chinese

From

summary

slug

password

Author

Priority

Featured

Featured

Cover

Origin

Type

URL

Youtube

Youtube

icon

notion image

Chain-of-thought Hub，這是一個正在進行的工作，希望成為評估語言模型推理能力的統一平台。

複雜推理任務的列表:

數學（GSM8K）、

這是一個經典的基準測試，用於衡量鏈式思維數學推理性能。這不是唯一的度量標準，但一個很好的解釋是 “在保持其他通用能力的同時，模型在數學方面的表現如何” —— 這也非常困難。

科學（MATH）、

符號（BBH）、

知識（MMLU

notion image

This graph tells:

GPT-4 在 GSM8K 和 MMLU 上明顯優於所有其他模型。

Claude 是唯一可以與 GPT 系列相媲美的模型家族。

較小的模型，如 FlanT5 11B 和 LLaMA 7B，明顯落後於排行榜，這意味著複雜推理可能只是大型模型的能力。

More to come

MMLU (Massive Multitask Language Understanding): A large-scale language benchmark that tests model performance across a wide range of subjects and types of knowledge, from professional domains to high school-level topics.

HellaSwag: A benchmark for testing a model's ability to predict the ending of a story or scenario. It challenges the AI's common sense reasoning and understanding of everyday activities.

ANLI (Adversarial NLI): A stress-test benchmark designed to evaluate the robustness of models against adversarial examples in natural language inference tasks.

GSM-8K: A benchmark focused on grade school math problems, testing the AI's ability to understand and solve mathematical questions stated in natural language.

MedQA: Involves medical question answering, where the model is tested on its understanding of medical concepts and terminology, often requiring reasoning over multiple pieces of information.

AGIEval: This benchmark likely evaluates aspects related to artificial general intelligence (AGI) through various tests, possibly focusing on reasoning, understanding, or other cognitive abilities.

TriviaQA: A question-answering benchmark that requires the model to retrieve and reason over a large collection of trivia questions and their answers.

Arc-C (ARC Challenge) and Arc-E (ARC Easy): These benchmarks from the AI2 Reasoning Challenge test the model's ability to answer more difficult (Challenge) and easier (Easy) grade-school level multiple-choice science questions.

PIQA (Physical Interaction QA): Tests a model's physical commonsense, assessing its understanding of the physical world through questions about everyday physical interactions.

SociQA: Focuses on social commonsense, evaluating how well a model understands social norms and human interactions.

BigBench-Hard: Could be part of the broader BIG-bench (Beyond the Imitation Game benchmark) designed to test AI models on tasks that require advanced reasoning, creativity, and much more.

WinoGrande: A dataset designed to test commonsense reasoning, specifically targeting the resolution of ambiguities in winograd schema-style sentences.

OpenBookQA: Aims to test a model's ability to answer open-ended questions using both reasoning and knowledge that might be found in a typical school textbook.

BoolQ: A question-answering dataset where models have to determine the truth value (True or False) based on a passage.

CommonSenseQA: A test of how well AI systems can answer questions that require commonsense knowledge to resolve.

TruthfulQA: Tests a model's ability to generate truthful and non-misleading answers, focusing on honesty and factual accuracy.

HumanEval: Commonly used to test code generation models, assessing their ability to write functional programming code based on a given prompt.

MBPP (Mostly Basic Python Problems): This typically involves assessing a model's ability to solve basic programming problems, usually in Python.

Reference

複雜推理：大語言模型的北極星能力 | 人人都是產品經理 (woshipm.com)

notion image

notion image

Author:Jason Siu
URL:https://jason-siu.com/article/64067aa0-1d3c-4734-8e05-8f1f5d25f0ff
Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts

Fundamental concepts on Neural Network

Comparison between Sigmoid and Softmax Activation Function with Python

Seminar: Revolution in Large Language Models and How You Can Build Apps With It

近年 AI 的快速發展，奠基於累積半個世紀的 8 大統計學思想！ | TechOrange 科技報橘

Confidence Intervals

Bouncing Back with Gain Recovery Calculator: The Art of Recouping Financial Losses in Leveraged Investments On LLM Settings

Jason Siu

A warm welcome! I am a tech enthusiast, passionate about learning and self-discovery.

Statistics

Number of posts:

240

Latest posts

Life - Principles

What is Containers? (Azure)

How to ask Insightful, Structured Questions

How to Win Friends and Influence People - Dale Carnegie

#4 Less paperwork, more aged care