Exllama is a more memory-efficient rewrite of Llama’s HF converter implementation for quantization weights.

  • Designed to quantize weights
  • Fast and memory-efficient inference (not just attention)
  • Map across multiple devices
  • Built-in (multiple) LoRA support
  • Companion library for funky sampling functions

Note that this project is in the proof-of-concept & preview stage and major changes may occur.

Hardware/Software Requirements

The author develops on RTX 4090 and RTX 3070-Ti. Both cards support CUDA cores, but may not be compatible with older cards.

The author does not have a lower graphics card, so I don’t know if the old graphics card will work.

Also, don’t know if this works on Windows/WSL.

dependencies

This list may not be complete:

  • torchTested on 2.1.0 (nightly) with cu118, may also work on older CUDA versions
  • safetensors0.3.1
  • sentencepiece
  • ninja

limit

As of now (processing):

  • v1 model without groupsize is not supported
  • Encountered models with non-standard layouts and data types (such as float32 embedded tables). It will take a while to ensure that all possible permutations are supported.

#Exllama #Homepage #Documentation #Downloads #Llama #Converter #Rewrite #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *