Exllama Homepage, Documentation and Downloads – Llama HF Converter Rewrite – News Fast Delivery – Fast Developer World

Exllama is a more memory-efficient rewrite of Llama’s HF converter implementation for quantization weights.

Designed to quantize weights

Fast and memory-efficient inference (not just attention)

Map across multiple devices

Built-in (multiple) LoRA support

Companion library for funky sampling functions

Note that this project is in the proof-of-concept & preview stage and major changes may occur.

Hardware/Software Requirements

The author develops on RTX 4090 and RTX 3070-Ti. Both cards support CUDA cores, but may not be compatible with older cards.

The author does not have a lower graphics card, so I don’t know if the old graphics card will work.

Also, don’t know if this works on Windows/WSL.

dependencies

This list may not be complete:

torchTested on 2.1.0 (nightly) with cu118, may also work on older CUDA versions

safetensors0.3.1

sentencepiece

ninja

limit

As of now (processing):

v1 model without groupsize is not supported

Encountered models with non-standard layouts and data types (such as float32 embedded tables). It will take a while to ensure that all possible permutations are supported.

#Exllama #Homepage #Documentation #Downloads #Llama #Converter #Rewrite #News Fast Delivery

Leave a Comment Cancel Reply