The RedPajama-Data repository contains code for preparing large datasets for training large language models. This repo contains a reproducible data ingestion of RedPajama data, with the following token counts: Dataset Token Count Commoncrawl 878 Billion C4 175 Billion GitHub 59 Billion Books 26 Billion ArXiv 28 Billion Wikipedia 24 Billion StackExchange 20 Billion Total 1.2 Trillion… |
#open #source #recipe #reproduce #LLaMA #training #dataset #RedPajamaData