Skip to content

Model for MMLU quiz

  • Description: MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.


Reference(s): https://huggingface.co/openai-community/gpt2-large, https://huggingface.co/meta-llama/Llama-2-7b-hf, https://huggingface.co/meta-llama/Llama-2-7b-chat-hf, https://huggingface.co/itsliupeng/llama2_7b_zh, https://huggingface.co/mistralai/Mistral-7B-v0.1, https://doi.org/10.48550/arXiv.2009.03300, https://huggingface.co/google/flan-t5-small

Model benchmarks

Model nameDataset Accuracy Team name Dataset size Date submitted Notes
itsliupeng_llama2_7b_zhmmlu_test0.4506ChemNLP1404201-30-2024CSV, JSON, run.sh, Info
google_t5_xlmmlu_test0.4887ChemNLP1404201-30-2023CSV, JSON, run.sh, Info
google_flan_t5_smallmmlu_test0.301ChemNLP1404201-30-2023CSV, JSON, run.sh, Info
meta-llama_Llama-2-7b-chat-hfmmlu_test0.3622ChemNLP1404201-30-2024CSV, JSON, run.sh, Info
google_t5_basemmlu_test0.342ChemNLP1404201-30-2023CSV, JSON, run.sh, Info
meta-llama_Llama-2-7b-hfmmlu_test0.3423ChemNLP1404201-30-2024CSV, JSON, run.sh, Info
mistralai_Mistral-7B-v0.1mmlu_test0.4876ChemNLP1404201-30-2024CSV, JSON, run.sh, Info
openai-gpt2mmlu_test0.2303ChemNLP1404201-30-2023CSV, JSON, run.sh, Info
gpt2-largemmlu_test0.2501ChemNLP1404201-30-2024CSV, JSON, run.sh, Info
gpt2-xlmmlu_test0.2485ChemNLP1404201-30-2024CSV, JSON, run.sh, Info