🚀 Gemma 3 Google AI: Best Local Vision LLM Ever?! (But…)

In this blog post, we’ll be diving into Google’s latest multimodal LLM, Gemma 3, featuring a long context window and vision capabilities. We’ll explore its strengths, weaknesses, and whether it truly lives up to the hype in vision-related tasks. 🔥

🔍 Gemma 3: A Quick Overview

Gemma 3 is available in multiple sizes, including:

✅ 1B: No vision capabilities, 32k context window (not our focus here). ✅ 4B, 12B, 27B: Vision-enabled with a 128k context window.

⚠️ Potential Context Window Error: The Hugging Face page lists 8192 as the context length for the 4B, 12B, and 27B models, which might be incorrect.

🔗 Gemma 3 Collection on Hugging Face
🔗 Gemma 3 Quants by Unsloth

In this review, we’ll be focusing on the 27B model. 💪

🛠️ Setup & Initial Impressions

We followed our Ollama Installation Guide to set up Gemma 3 (27B). Key observations:

Speed: ~13.45 tokens/sec
VRAM Usage: ~86GB across a quad-GPU setup
Optimizations Used: Flash Attention, KV cache (fp6)

📉 Text-Based Performance: WIP 😕

Gemma 3 struggles with traditional LLM tasks like code generation, reasoning, and math. Here’s how it performed:

Test	Description	Result
Flappy Bird Clone	Generated code crashed after 1st obstacle	❌ FAIL
Armageddon Trolley Problem	Unexpected ethical reasoning	✅ PASS
Word Parsing	Incorrectly counted letters in a sentence	❌ FAIL
Logical Reasoning (Arrays)	Offered a deviated but acceptable answer	✅ PASS
Number Comparison	Incorrectly stated 420.69 > 4207	❌ FAIL
Word Analysis	Incorrect vowel count in peppermint	❌ FAIL
Scheduling (Reasoning)	Correctly predicted a cat’s schedule	✅ PASS
Memorization (Pi to 100 decimals)	Provided inaccurate results	❌ FAIL
SVG Cat Drawing	Created an unrecognizable SVG cat	✅ PASS
Travel Time Calculation	Correctly determined fastest driver	✅ PASS

🔹 Summary: 4/10 tests passed. Models like QWQ or DeepSeek 67B significantly outperform Gemma 3 in text tasks. ❌

🔥 Vision Performance: A Game-Changer! 🏆

When it comes to vision tasks, Gemma 3 is exceptional. Some highlights:

Vision Task	Performance	Result
Meme Explanation	Identified humor, cultural relevance	✅ PASS
Snake Identification	Incorrect species, but emphasized safety	✅ PASS
Product Recognition (Texas Toast)	Identified packaging, pricing, features	✅ PASS
Disassembled GPU Parts	Identified die, VRMs, cooling system	✅ PASS
Server Motherboard	Failed follow-up questions	❌ FAIL
Smart Devices Recognition	Recognized Echo Dot, Samsung devices	✅ PASS
Stack of GPUs	Incorrect GPU count & wattage readout	❌ FAIL
Well Pressure Switch	Misidentified the component	❌ FAIL
Food Identification (Corn Salsa)	Listed correct ingredients & preparation	✅ PASS
Scene Analysis (Cat & Slippers)	Described environment perfectly	✅ PASS
Kilowatt Meter Readout	Incorrect power consumption reading	❌ FAIL
Handwritten Notes	Accurately transcribed notes	✅ PASS
Fiber Drop Meme Interpretation	Understood niche tech humor	✅ PASS

🔹 Summary: 9/13 vision tests passed. Gemma 3 is currently the best multimodal LLM for vision tasks! 🏆💡

📌 Key Takeaways

✅ Vision King! Gemma 3 dominates in image recognition & understanding.
❌ LLM Performance is Underwhelming: Struggles with math, logic, and reasoning.
⚠️ Inconsistencies in Context Length: Could impact text accuracy.

🏆 The Verdict

Gemma 3 is a double-edged sword:

For Vision Tasks: Incredible! Best in class. 👀
For Text-Based Reasoning: Mediocre at best. 😕

If you’re looking for image understanding, Gemma 3 is unmatched. But if you need an all-around LLM, you might want to explore alternatives.

💬 Have you tested Gemma 3?

Have you tested Gemma 3? 🤔 Drop a comment below with your experiences, setup, and results! Let’s build a community of AI enthusiasts! 🚀

📜 LM Studio Chat Template (Jinja2 Format)

To use Gemma 3 effectively in LM Studio, add this Jinja-formatted chat template:

{%- set bos_token = "<bos>" -%}
{{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- set first_user_prefix = (messages[0]['content'] + '\n\n') if messages[0]['content'] is string else (messages[0]['content'][0]['text'] + '\n\n') -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate.") }}
    {%- endif -%}
    {%- set role = "model" if message['role'] == 'assistant' else message['role'] -%}
    {{ '<start_of_turn>' + role + '\n' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {{ '<start_of_image>' if item['type'] == 'image' else item['text'] | trim }}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>\n' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{ '<start_of_turn>model\n' }}
{%- endif -%}

Use this in LM Studio to properly format chat interactions. 🚀

Mamba: Alternative to Conda Package Management

LM Studio: Step-by-Guide on Installing, Running LLMs at Home