<AI>Devspace

Complete formula to get LLM VRAM usage

clock icon
asked 2 months ago
message icon
1
eye icon
36

I would like to find the GPU size required to run an hypothetical LLM, considering all possible factors, like:

  • P: Model parameters (total or MoE active parameters)
  • Q: Quantization bits
  • C: Context length cap (from what I understand, the context can be capped to allow a sort of smaller "batch-size" limit)
  • ATT: Type of attention used (Full attention, Flash attention...)
  • Other

I understand how the usual formula I can find around

Space = ((P × 4Bytes) / (32 / Q)) × overhead

does describe some part of the picture, but does not give the full idea down to the details.

1 Answer

1

Write your answer here