Shusheng Puyu (InternLM) is a multilingual large-scale language model developed by Shanghai Artificial Intelligence Laboratory and SenseTime (equal contribution), Chinese University of Hong Kong, Fudan University and Shanghai Jiaotong University.
InternLM is a multilingual base language model with 104B parameters. InternLM is pre-trained with a multi-stage progressive process on a large corpus with 1.6T tokens, and then fine-tuned to match human preferences. A training system named Uniscale-LLM has also been developed for efficient large-scale language model training.
Evaluations on multiple benchmarks show that InternLM achieves state-of-the-art performance in knowledge comprehension, reading comprehension, mathematics, and coding. With such comprehensive capabilities, InternLM has achieved excellent results in comprehensive exams including MMLU, AGIEval, C-Eval and GAOKAO-Bench without resorting to external tools.
On these benchmarks, InternLM not only significantly outperforms open-source models, but also achieves superior performance compared to ChatGPT. Furthermore, InternLM demonstrates excellent understanding of the Chinese language and Chinese culture, which makes it a suitable base model to support Chinese-oriented language applications.
main results
As the latest large-scale language models begin to exhibit human-level intelligence, exams designed for humans such as China’s college entrance examination, the US SAT, and GRE are considered important means of evaluating language models. In its technical report on GPT-4, OpenAI tested GPT-4 with exams across multiple domains, with exam scores as the key outcome.
The project team tested InternLM against others on four comprehensive exam benchmarks, as follows:
-
MMLU : A multi-task benchmark based on various US exams, covering elementary mathematics, physics, chemistry, computer science, US history, law, economics, diplomacy, etc.
-
AGIEval : A benchmark test developed by Microsoft Research to evaluate the ability of language models through human-friendly exams, including 19 task sets from various exams in China and the United States, such as China’s college entrance examination and bar exam, as well as the SAT, LSAT in the United States , GRE and GMAT.Among the 19 task sets, 9 are based on the Chinese college entrance examination (gaokao), which is picked out as an important set and named asAGIEval (GK).
-
C-Eval : A comprehensive benchmark designed for evaluating Chinese language models, containing nearly 14,000 questions in 52 subjects, covering mathematics, physics, chemistry, biology, history, politics, computer and other subjects, as well as professional examinations for civil servants, certified public accountants, lawyers and doctors.
-
GAOKAO-Bench : Based on the comprehensive benchmark of China’s Gaokao, including all subjects in the Gaokao. It offers different types of questions including multiple choice, fill in the blank and essay questions.For brevity, this benchmark is referred to simply ascollege entrance examination.
MMLU results
Results of AGIEval
Results of C-Eval
C-Eval has areal-time leaderboard. Below is a screenshot showing all results (as of June 1, 2023).
GAOKAO-Benchmark Results
For more detailed results, refer toTechnical Reports.
#ShushengPuyu #Homepage #Documentation #Downloads #Large #Multilingual #Language #Model #News Fast Delivery