QLoRA, proposed by the University of Washington, is an efficient fine-tuning method that reduces memory usage enough to fine-tune a 65B parameter model on a single 48GB GPU while retaining full 16-bit fine-tuning task performance. QLoRA backpropagates gradients to a low-order adapter (LoRA) through a frozen 4-bit quantized pre-trained language model. The project team also released a large language model named Guanaco (original camel), which outperformed all previously publicly released models in the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT, while only requiring 24 Hours of fine-tuning. … |
#Efficient #finetuning #QLoRA #quantized #LLM