Meet with us at Nvidia GTC in San Jose, CA (March 17 - 21) Click here to book

Deploying Google Gemma-3 on Denvr AI Compute: Tuning Parameters for A100 40GB and 80GB Cards
0
10
0
Here is my experimentation with NVIDIA A100 40GB and 80GB cards on Denvr AI compute. I had to tweak parameters to make it run smoothly on each GPU type. Here’s how I got it working on both, using Hugging Face’s Text Generation Inference (TGI) in Docker, with practical tips if you’re trying this too.
Why Denvr and Gemma-3?
It was released on 10th March 2025, here are some of the key feature of the model
Multiple Model Sizes:
Available in four variants: 1B, 4B, 12B, and 27B parameters, allowing developers to choose based on hardware capabilities and use case needs, from lightweight mobile deployment (1B) to powerful workstation inference (27B).
Multimodality:
Supports vision-language inputs (text and images) for the 4B, 12B, and 27B models, using the SigLIP image encoder, while the 1B model is text-only. Outputs remain text-based, enabling tasks like image analysis and captioning.
Extended Context Window:
Offers a 128k-token context window for the 4B, 12B, and 27B models (32k for 1B), capable of processing large inputs like a 200-page book or extensive multi-step tasks, far exceeding earlier Gemma models.
Multilingual Support:
Pre-trained on over 140 languages with optimized performance in 35+ languages, thanks to a new tokenizer, making it highly versatile for global applications.
High Performance on Single Accelerators:
Designed to excel on a single GPU or TPU, outperforming larger models like Llama-405B, DeepSeek-V3, and OpenAI’s o3-mini in preliminary human preference evaluations on the LMArena leaderboard.
Advanced Reasoning and Functionality:
Enhanced capabilities in math, coding, and instruction-following through techniques like distillation, RLHF (Reinforcement Learning from Human Feedback), and RLEF (Reinforcement Learning from Execution Feedback). Supports function calling and structured outputs for task automation.
Quantized Variants:
Includes official quantized versions (e.g., int4), reducing model size and computational demands while maintaining accuracy, ideal for resource-constrained environments.
Optimized Deployment:
Flexible deployment options across Google Cloud (Vertex AI, TPUs), NVIDIA GPUs (via NVIDIA API Catalog), AMD GPUs (ROCm stack), and on-device platforms (Google AI Edge), with seamless integration into tools like Hugging Face Transformers, PyTorch, and JAX.
Safety Features:
Paired with ShieldGemma 2, a 4B-parameter image safety classifier that labels content for dangerous, sexually explicit, or violent categories, customizable for responsible AI use.
Community Ecosystem (Gemmaverse):
Builds on a thriving community with over 100 million downloads and 60,000 variants, encouraging fine-tuning and innovation for diverse applications.
Signup for Denvr AI Console
https://console.cloud.denvrdata.com/

Launch the VM

Choose fully loaded OS configuration

Wait few minutes for VM to be ready

You will need to register and get access to the model ( it could take up to few hours to get approval ).
You should see something like this on your Hugging-face profile

Click on "deploy" and copy paste the Huggingface docker option

Starting Point: A100 40GB Card
I began with a single A100 40GB card on Denvr. My first attempt used a basic TGI setup:
docker run -d --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -e HF_TOKEN="$HF_TOKEN" -p 8000:80 ghcr.io/huggingface/text-generation-inference:latest --model-id google/gemma-3-12b-it --max-batch-prefill-tokens 2048
The model loaded, but simple queries like "Tell me a story" crashed it with a CUDA error: an illegal memory access. Checking nvidia-smi, the 40GB was maxed out—activations and caching pushed it over the edge. I needed to slim things down.
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2025-03-12T21:07:46.993754Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Unexpected <class 'RuntimeError'>: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2025-03-12T21:07:46.995486Z ERROR text_generation_launcher: Method ClearCache encountered an error.
After some trial and error, I tweaked the parameters to fit:
docker run -d --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN="$HF_TOKEN" \
-p 8000:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id google/gemma-3-12b-it \
--max-batch-prefill-tokens 1024 \
--max-input-tokens 512 \
--max-total-tokens 1024 \
--cuda-memory-fraction 0.8
--max-batch-prefill-tokens 1024: Cut the prefill batch size in half to reduce memory during initial token processing.
--max-input-tokens 512: Limited input to 512 tokens to keep things lean.
--max-total-tokens 1024: Set total tokens (input + output) to 1024, leaving 512 for generation.
--cuda-memory-fraction 0.8: Used 80% of the 40GB (~32GB), leaving headroom.
I tested it:
curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"inputs": "Hi there", "parameters": {"max_new_tokens": 50}}'
It worked! Memory hovered around 30GB, and it handled small queries without crashing. For bigger inputs or outputs, though, I’d need to tweak further or accept the limits.
Scaling Up: A100 80GB Card
Next, I switched to an A100 80GB card on Denvr to see how much more I could push. With 80GB of VRAM, I could relax the constraints. Here’s the setup I landed on:
bash
docker run -d --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN="$HF_TOKEN" \
-p 8000:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id google/gemma-3-12b-it \
--max-batch-prefill-tokens 2048 \
--max-input-tokens 2048 \
--max-total-tokens 4096
--max-batch-prefill-tokens 2048: Doubled the prefill batch size for faster processing.
--max-input-tokens 2048: Allowed longer inputs—four times the 40GB setup.
--max-total-tokens 4096: Set total tokens to 4096, giving 2048 for output.
Quickstart Streamlit chatbot
import streamlit as st
import openai
# Default backend URL
DEFAULT_BACKEND_URL = "http://localhost:11434/v1"
# UI for setting the backend URL
st.title("Chatbot using Gemma")
st.sidebar.header("Settings")
backend_url = st.sidebar.text_input("Backend URL", DEFAULT_BACKEND_URL)
# Initialize OpenAI client with user-defined backend URL
client = openai.OpenAI(base_url=backend_url)
# Chat session state
if "messages" not in st.session_state:
st.session_state.messages = []
# Display chat history
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# User input
if prompt := st.chat_input("Ask something..."):
# Add user message to session state
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
# Call the backend API
try:
response = client.chat.completions.create(
model="google/gemma-3-12b-it", # Change this based on available models
messages=st.session_state.messages
)
reply = response.choices[0].message.content
except Exception as e:
reply = f"Error: {e}"
# Display assistant response
with st.chat_message("assistant"):
st.markdown(reply)
st.session_state.messages.append({"role": "assistant", "content": reply})
Run the chatbot on your local machine
export OPENAI_API_KEY=none
pip install streamlit
streamlit run chatbot.py
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8502
Network URL: http://192.168.2.129:8502
The chatbot interface would look like this


No issues! Memory usage hit ~50-60GB, well within the 80GB limit. It ran smoothly, even with repeated or larger requests.
What I Learned
40GB Tuning: You can run Gemma-3 12B on a single 40GB card by keeping inputs small (512 tokens) and capping memory usage. It’s tight but doable.
80GB Freedom: The 80GB card lets you crank up to 2048 input tokens and beyond, no sweat.
Multi-GPU Option: An 8-card 40GB node works too, matching the 80GB’s capacity with sharding.
Key Parameters: Adjust --max-batch-prefill-tokens, --max-input-tokens, --max-total-tokens, and --cuda-memory-fraction based on your GPU and workload.
Tips for You
Monitor with nvidia-smi: Watch VRAM usage to find your limits.
Start Small: On 40GB, try 512 tokens first, then scale up if stable.
Use Denvr’s Flexibility: Switching between 40GB and 80GB cards—or adding more—was a breeze on their platform.
Test Incrementally: Bump up token limits gradually and test with curl.
Whether you’re on 40GB or 80GB, it’s all about finding the right parameter mix. Got questions or your own tweaks? Drop me a line—I’d love to hear how it goes for you!