AutoGPTQ is a large language model quantization toolkit based on the GPTQ algorithm, which is easy to use and has a user-friendly interface. Performance comparison and inference speed The following results are generated by this script. The batch size of the text input is 1, the decoding strategy is beam search and the model is forced to generate 512 tokens. The unit of measurement for the speed is tokens/s (the larger the better). Quantized models are loaded in a way that maximizes inference speed. model GPU num_beams fp16 gptq-int4 llama-7b 1xA100-40G 1 18.87 25.53 llama-7b 1xA100-40G 4 68.79 91.30 mos… |
#Large #language #model #quantization #toolkit #AutoGPTQ