Exllama is a more memory-efficient rewrite of Llama’s HF converter implementation for quantization weights.
- Designed to quantize weights
- Fast and memory-efficient inference (not just attention)
- Map across multiple devices
- Built-in (multiple) LoRA support
- Companion library for funky sampling functions
Note that this project is in the proof-of-concept & preview stage and major changes may occur.
Hardware/Software Requirements
The author develops on RTX 4090 and RTX 3070-Ti. Both cards support CUDA cores, but may not be compatible with older cards.
The author does not have a lower graphics card, so I don’t know if the old graphics card will work.
Also, don’t know if this works on Windows/WSL.
dependencies
This list may not be complete:
torch
Tested on 2.1.0 (nightly) with cu118, may also work on older CUDA versionssafetensors
0.3.1sentencepiece
ninja
limit
As of now (processing):
- v1 model without groupsize is not supported
- Encountered models with non-standard layouts and data types (such as float32 embedded tables). It will take a while to ensure that all possible permutations are supported.
#Exllama #Homepage #Documentation #Downloads #Llama #Converter #Rewrite #News Fast Delivery