Model for MMLU quiz¶

Description: MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

Model benchmarks

Model name	Dataset	Accuracy	Team name	Dataset size	Date submitted	Notes
meta-llama_Llama-2-7b-hf	mmlu_test	0.3423	ChemNLP	14042	01-30-2024	CSV, JSON, run.sh, Info
mistralai_Mistral-7B-v0.1	mmlu_test	0.4876	ChemNLP	14042	01-30-2024	CSV, JSON, run.sh, Info
itsliupeng_llama2_7b_zh	mmlu_test	0.4506	ChemNLP	14042	01-30-2024	CSV, JSON, run.sh, Info
openai-gpt2	mmlu_test	0.2303	ChemNLP	14042	01-30-2023	CSV, JSON, run.sh, Info
gpt2-xl	mmlu_test	0.2485	ChemNLP	14042	01-30-2024	CSV, JSON, run.sh, Info
google_t5_base	mmlu_test	0.342	ChemNLP	14042	01-30-2023	CSV, JSON, run.sh, Info
gpt2-large	mmlu_test	0.2501	ChemNLP	14042	01-30-2024	CSV, JSON, run.sh, Info
google_flan_t5_small	mmlu_test	0.301	ChemNLP	14042	01-30-2023	CSV, JSON, run.sh, Info
google_t5_xl	mmlu_test	0.4887	ChemNLP	14042	01-30-2023	CSV, JSON, run.sh, Info
meta-llama_Llama-2-7b-chat-hf	mmlu_test	0.3622	ChemNLP	14042	01-30-2024	CSV, JSON, run.sh, Info