BigCode is an open scientific collaboration dedicated to the development of large language models. Recently, they open sourced a language model called SantaCoder, which has 1.1 billion parameters and can be used for code generation and completion suggestions in several programming languages such as Python, Java, and JavaScript.
According to the official information, the basis for training SantaCoder is The Stack (v1.1) dataset. Although SantaCoder is relatively small in size, with only 1.1 billion parameters, it is lower than InCoder (6.7 billion) or CodeGen- multi (2.7 billion), but SantaCoder’s performance is much better than these large multilingual models. However, the parameters are far less than those of GPT-3 and other super-large language models with parameters exceeding 100 billion levels. The range of programming languages applicable to SantaCoder is also relatively limited, and only Python, Java and JavaScript are supported.
In order to take care of user privacy and ensure training quality, before training the model, BigCode annotated 400 samples, and established and continuously improved RegEx rules to remove such things as email addresses, keys, and IP addresses from the code of the dataset before training and other sensitive information.
In order to allow developers to use the code generated by SantaCoder with confidence, BigCode launched Dataset Search Search Tools. Through this tool, developers can find out the source of code so that users can comply with the corresponding licensing requirements if the code generated by SantaCoder belongs to a certain project.
In addition, BigCode also launched the “Am I in The Stack?” tool, developers can check whether the warehouse under their name is part of the training data set, and can delete their open source warehouse from the data set.
BigCode has currently provided a demo of SantaCoder on the Huggingface website for anyone to explore and try.
#BigCode #open #source #lightweight #language #model #supports #Python #Java #News Fast Delivery