Chitra

Chitra

0

0

Complete formula to get LLM VRAM usage

asked 5 months ago

1

95

I would like to find the GPU size required to run an hypothetical LLM, considering all possible factors, like:

P: Model parameters (total or MoE active parameters)
Q: Quantization bits
C: Context length cap (from what I understand, the context can be capped to allow a sort of smaller "batch-size" limit)
ATT: Type of attention used (Full attention, Flash attention...)
Other

I understand how the usual formula I can find around

Space = ((P × 4Bytes) / (32 / Q)) × overhead

does describe some part of the picture, but does not give the full idea down to the details.

1 Answer

Unknown

• answered 5 months ago

0

0

(V)RAM requirement for inference: Model size < RAM requirements < Model size * 1.2.

https://huggingface.co/spaces/hf-accelerate/model-memory-usage is helpful

1

Write your answer here