Figuring out general specs for running LLM models

The Problem:

I have a machine learning model with a certain number of parameters. How much GPU RAM do I need to run the model? If I don’t have enough GPU RAM, can I run the model on CPU-RAM instead? Can I run the model on a combination of GPU-RAM and CPU-RAM?

The Solutions:

Solution 1: Computing GPU RAM Requirement for LLM Models

To estimate the GPU RAM required for running Language Large Models (LLMs), consider the following formula:

total_memory = p * (params + activations)

where:

p is the precision (e.g., 32 for float32)
params is the number of model parameters
activations is the memory occupied by activations during inference

Calculating activations

The activations per layer are given by:

activations_layer = s*b*h*(34 +((5*a*s)/h))

where:

s is the sequence length
b is the batch size
h is the hidden dimension
a is the number of attention heads

The total activations across all layers are:

activations = l * (5/2)*a*b*s^2 + 17*b*h*s

where l is the number of layers.

Example

For a model with 7 billion parameters, precision of 32, batch size of 1, sequence length of 2048, 32 layers, 32 attention heads, and a hidden dimension of 4096, the estimated GPU RAM requirement is approximately 66 GB.

Optimization

Quantization techniques, such as bit quantization, can drastically reduce the memory requirements. For example, bit quantization can reduce the memory need for the above model to about 8 GB.

Solution 2:

How much VRAM do you need to run the model?

For inference, models often run in float16 precision, which uses 2 bytes per parameter. To calculate the amount of VRAM required, multiply the number of parameters in the model (in Billions) by 2. For example, a 7B parameter model would need approximately 14GB of VRAM to run in float16 precision.

Can you run the model on CPU if you have enough RAM?

Yes, it is possible to run an LLM model on CPU if you have sufficient RAM. However, this depends on the specific model and the library you are using. Some layers may not be implemented for CPU, and running the model on CPU will generally be slower than running it on a GPU.

Can you run LLM models in mixed mode (CPU/RAM and CPU-RAM)?

Yes, it is possible to run LLM models in mixed mode. Many libraries support running some layers on CPU and others on GPU. For example, the Hugging Face Transformers library supports auto-mapping layers to all available devices. To enable auto-mapping, set the device_map parameter to “auto” when loading the model.

from transformers import AutoModelForCausalLM, Autotokenizer

tokenizer = Autotokenizer.from_pretrained( "OpenAassistant/stablelm-7b-sft-v7-epoch-3" ) model = AutoModelForCausalLM.from_pretrained( "OpenAassistant/stablelm-7b-sft-v7-epoch-3", device_map="auto" )

Q&A

How can I figure out the GPU RAM required to run an LLM model given the count of its parameters in billions?

—

Refer to the paper ‘Reducing Activation Recomputation in Large Transformer Models’ for calculating the size of a Transformer layer and determine the required RAM.

Can LLM models run on CPU-RAM if sufficient RAM is available?

—

Yes, but it depends on the model and library used. Some layers may not be implemented for CPU.

Is it possible to run LLM models in mixed GPU-RAM and CPU-RAM?

—

Yes, many libraries support running layers on both CPU and GPU. For example, Huggingface transformers library allows auto mapping of layers to available devices.

Video Explanation:

The following video, titled "Sam Altman: OpenAI CEO on GPT-4, ChatGPT, and the Future of AI ...", provides additional insights and in-depth exploration related to the topics discussed in this post.

Your browser can't play this video. Learn more.

Figuring out general specs for running LLM models – Deep-learning