https://www.denvrdata.com/?utm_campaign=XAds&utm_campaign_id=1&utm_medium=paid&utm_source=X
top of page

Deploying LLaMA 3.3 with Hugging Face TGI: Performance Analysis on A100 and H100 80GB GPUs

Dec 7, 2024

4 min read

0

135

0


Meta just dropped new Llama 3.3 model which has some key improvement over earlier models. Here is my take on running and operating it using TGI.


Deploying advanced language models like LLaMA 3.3 requires meticulous planning, especially when running inference workloads on high-performance hardware like the NVIDIA A100 and H100 GPUs. In this blog, I’ll walk you through my experience of deploying LLaMA 3.3 using Hugging Face’s Text Generation Inference (TGI) framework, the challenges I encountered, and the performance insights I gathered by monitoring system metrics.


Initial Attempt: Single-GPU Deployment


The memory requirements for a 70B parameter model like LLaMA 3.3 depend on several factors, including the precision used for the model weights (e.g., FP32, FP16, BF16, or 4-bit quantization), the GPU architecture


The model we are trying is meta-llama/Llama-3.3-70B-Instruct and as per Model card



Memory requirements are BF16 (16-bit precision):


  • Each parameter requires 2 bytes.

  • For a 70B parameter model: 70×109 parameters×2 bytes/parameter=140 GB of memory70 \times 10^9 \text{ parameters} \times 2 \text{ bytes/parameter} = 140 \text{ GB of memory}70×109 parameters×2 bytes/parameter=140 GB of memory


Obviously, my first attempt at deploying LLaMA 3.3 was on a single A100-80GB GPU resulted in Out-of-Memory (OOM) errors. LLaMA 3.3, with its 70B parameter size, and BF16 proved too large for even an 80GB GPU without leveraging techniques like model sharding.


This failure underscored the importance of optimizing the deployment for memory-intensive workloads by distributing the model across multiple GPUs. Thus, I shifted to an 8-GPU HGX node, enabling sharding with TGI.


OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a
total capacity of 79.14 GiB of which 110.75 MiB is free. Process 58852 has 79.02
GiB memory in use. Of the allocated memory 78.46 GiB is allocated by PyTorch,
and 86.41 MiB is reserved by PyTorch but unallocated. If reserved but
unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) rank=0 
Error: ShardCannotStart
2024-12-07T23:34:22.357352Z ERROR text_generation_launcher: Shard 0 failed to start
2024-12-07T23:34:22.357373Z  INFO text_generation_launcher: 
Shutting down shards

Sharded Deployment on 8 GPUs


Using TGI's --sharded mode, I successfully deployed LLaMA 3.3 across all 8 GPUs on both A100 and H100 HGX nodes. Model sharding divides the parameters evenly across GPUs, ensuring memory usage is well within limits while maintaining high throughput.

Deployment Command Example:


    docker run -d --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN="<token>" \
    -p 8000:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.3-70B-Instruct \
    --num-shard 8


Monitoring Performance Metrics


Monitoring was key to understanding the resource utilization and overall performance of the inference workload. I used two primary sources for metrics:


  1. TGI's Built-in Metrics Exporter: Captures model-specific inference metrics such as token throughput and latency.

  2. NVIDIA DCGM + Node Exporter: Provides GPU-level and node-level system metrics.


Key Metrics Monitored

Metric Source

Metric

Description

TGI Metrics

tgi_requests_total

Total requests processed by the model.


tgi_latency_seconds

Latency per inference request.


tgi_throughput_tokens_sec

Tokens generated per second.

NVIDIA DCGM Metrics

nvidia_gpu_utilization

GPU utilization percentage.


nvidia_memory_utilization

Memory utilization per GPU.

Node Exporter

node_cpu_usage

CPU usage of the node hosting GPUs.


node_network_bytes

Network bandwidth during inference.


Testing the Deployment: Challenges and Approach


Performance testing large language models like LLaMA 3.3 can be as challenging as deploying them. These models, with billions of parameters, require fine-tuned environments, optimized hardware configurations, and well-structured scripts to test their capabilities and limitations effectively.

For this deployment, I used the following Python script to test the performance of the model hosted on the TGI server. Below, I’ll describe how I approached testing, the strengths and weaknesses of this method, and what I learned.


https://github.com/denvrdata/examples/blob/main/huggingface-tgi-llama-monitoring/prompt_client.py

Strengths of the Test Approach


  • Concurrent Requests Simulation:

    • Using ThreadPoolExecutor allowed me to simulate multiple concurrent requests, mimicking real-world usage scenarios where multiple users interact with the model simultaneously.

    • This was particularly useful in evaluating throughput and latency.

  • Dynamic Input Generation:

    • Generating a diverse set of prompts ensured that the model's performance was tested across a variety of input types.

    • This helped uncover how the model handled different token lengths and complexities.

  • Integration Testing:

    • The script tested the end-to-end pipeline, including the model, inference server (TGI), and the GPU infrastructure, making it a comprehensive performance test.

  • GPU Resource Validation:

    • By incorporating PyTorch’s GPU checks, I ensured that the environment was properly configured for GPU usage.


    Weaknesses and Challenges


    • Limited Scalability:

      • While the script supported concurrent requests, its scalability was limited by the local machine’s CPU and memory.

      • For larger-scale testing, a dedicated load-testing tool (e.g., Locust, JMeter) might be more appropriate.

    • Static Parameters:

      • Parameters like max_length, min_length, and temperature were fixed, which could limit the scope of performance insights. Dynamic tuning would provide more comprehensive results.

    • Network Bottlenecks:

      • Running the script locally against a hosted server introduced potential network latency, which could skew the results.

    • Lack of Real-time Metrics:

      • While results were logged, integrating real-time metrics visualization (e.g., Grafana dashboards) would make it easier to correlate system performance with the test load.


Performance Comparison: A100 vs. H100

I compared the performance metrics between the A100-80GB and H100-80GB HGX nodes. Here’s a summary of my findings:


H100 Dashboard Screenshots




A100 Dashboard Screenshots




PromQL Queries for Key Metrics


To visualize and alert on the key metrics, I used the following PromQL queries:


  • Request Latency:

rate(tgi_latency_seconds[1m])

  • Token Throughput:

rate(tgi_throughput_tokens_sec[1m])

  • GPU Utilization:

avg(nvidia_gpu_utilization{job="gpu-metrics-exporter"}) by (gpu)

  • Memory Usage Per GPU:

nvidia_memory_utilization{job="gpu-metrics-exporter"}

  • Power Consumption:

sum(nvidia_power_usage{job="gpu-metrics-exporter"}) by (node)



Challenges and Lessons Learned


  • OOM Errors on Single GPUs: Highlighted the necessity of model sharding for deploying LLaMA 3.3.

  • Metric Overheads: Continuously exporting and storing metrics can add overhead; careful aggregation is needed.

  • Scaling Optimization: While 8 GPUs delivered high throughput, tuning batch_size and max_tokens further improved performance.



Conclusion


Deploying LLaMA 3.3 on multi-GPU nodes like A100 and H100 presents challenges but also offers opportunities for optimization. Hugging Face TGI, combined with robust monitoring tools, provides a scalable solution for high-throughput inference.

If you’re working on deploying large language models, ensure you leverage both hardware and software optimizations to balance performance and resource utilization.



Related Posts

Comments

あなたの思いをシェアしませんか一番最初のコメントを書いてみましょう。
bottom of page