π Gemma 3 Google AI: Best Local Vision LLM Ever?! (Butβ¦)
In this blog post, weβll be diving into Googleβs latest multimodal LLM, Gemma 3, featuring a long context window and vision capabilities. Weβll explore its strengths, weaknesses, and whether it truly lives up to the hype in vision-related tasks. π₯
π Gemma 3: A Quick Overview
Gemma 3 is available in multiple sizes, including:
β 1B: No vision capabilities, 32k context window (not our focus here). β 4B, 12B, 27B: Vision-enabled with a 128k context window.
β οΈ Potential Context Window Error: The Hugging Face page lists 8192 as the context length for the 4B, 12B, and 27B models, which might be incorrect.
π Gemma 3 Collection on Hugging Face
π Gemma 3 Quants by Unsloth
In this review, weβll be focusing on the 27B model. πͺ
π οΈ Setup & Initial Impressions
We followed our Ollama Installation Guide to set up Gemma 3 (27B). Key observations:
- Speed: ~13.45 tokens/sec
- VRAM Usage: ~86GB across a quad-GPU setup
- Optimizations Used: Flash Attention, KV cache (fp6)
π Text-Based Performance: WIP π
Gemma 3 struggles with traditional LLM tasks like code generation, reasoning, and math. Hereβs how it performed:
Test | Description | Result |
---|---|---|
Flappy Bird Clone | Generated code crashed after 1st obstacle | β FAIL |
Armageddon Trolley Problem | Unexpected ethical reasoning | β PASS |
Word Parsing | Incorrectly counted letters in a sentence | β FAIL |
Logical Reasoning (Arrays) | Offered a deviated but acceptable answer | β PASS |
Number Comparison | Incorrectly stated 420.69 > 4207 | β FAIL |
Word Analysis | Incorrect vowel count in peppermint | β FAIL |
Scheduling (Reasoning) | Correctly predicted a catβs schedule | β PASS |
Memorization (Pi to 100 decimals) | Provided inaccurate results | β FAIL |
SVG Cat Drawing | Created an unrecognizable SVG cat | β PASS |
Travel Time Calculation | Correctly determined fastest driver | β PASS |
πΉ Summary: 4/10 tests passed. Models like QWQ or DeepSeek 67B significantly outperform Gemma 3 in text tasks. β
π₯ Vision Performance: A Game-Changer! π
When it comes to vision tasks, Gemma 3 is exceptional. Some highlights:
Vision Task | Performance | Result |
---|---|---|
Meme Explanation | Identified humor, cultural relevance | β PASS |
Snake Identification | Incorrect species, but emphasized safety | β PASS |
Product Recognition (Texas Toast) | Identified packaging, pricing, features | β PASS |
Disassembled GPU Parts | Identified die, VRMs, cooling system | β PASS |
Server Motherboard | Failed follow-up questions | β FAIL |
Smart Devices Recognition | Recognized Echo Dot, Samsung devices | β PASS |
Stack of GPUs | Incorrect GPU count & wattage readout | β FAIL |
Well Pressure Switch | Misidentified the component | β FAIL |
Food Identification (Corn Salsa) | Listed correct ingredients & preparation | β PASS |
Scene Analysis (Cat & Slippers) | Described environment perfectly | β PASS |
Kilowatt Meter Readout | Incorrect power consumption reading | β FAIL |
Handwritten Notes | Accurately transcribed notes | β PASS |
Fiber Drop Meme Interpretation | Understood niche tech humor | β PASS |
πΉ Summary: 9/13 vision tests passed. Gemma 3 is currently the best multimodal LLM for vision tasks! ππ‘
π Key Takeaways
β
Vision King! Gemma 3 dominates in image recognition & understanding.
β LLM Performance is Underwhelming: Struggles with math, logic, and reasoning.
β οΈ Inconsistencies in Context Length: Could impact text accuracy.
π The Verdict
Gemma 3 is a double-edged sword:
- For Vision Tasks: Incredible! Best in class. π
- For Text-Based Reasoning: Mediocre at best. π
If youβre looking for image understanding, Gemma 3 is unmatched. But if you need an all-around LLM, you might want to explore alternatives.
π¬ Have you tested Gemma 3?
Have you tested Gemma 3? π€ Drop a comment below with your experiences, setup, and results! Letβs build a community of AI enthusiasts! π
π LM Studio Chat Template (Jinja2 Format)
To use Gemma 3 effectively in LM Studio, add this Jinja-formatted chat template:
{%- set bos_token = "<bos>" -%}
{{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
{%- set first_user_prefix = (messages[0]['content'] + '\n\n') if messages[0]['content'] is string else (messages[0]['content'][0]['text'] + '\n\n') -%}
{%- set loop_messages = messages[1:] -%}
{%- else -%}
{%- set first_user_prefix = "" -%}
{%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
{{ raise_exception("Conversation roles must alternate.") }}
{%- endif -%}
{%- set role = "model" if message['role'] == 'assistant' else message['role'] -%}
{{ '<start_of_turn>' + role + '\n' + (first_user_prefix if loop.first else "") }}
{%- if message['content'] is string -%}
{{ message['content'] | trim }}
{%- elif message['content'] is iterable -%}
{%- for item in message['content'] -%}
{{ '<start_of_image>' if item['type'] == 'image' else item['text'] | trim }}
{%- endfor -%}
{%- else -%}
{{ raise_exception("Invalid content type") }}
{%- endif -%}
{{ '<end_of_turn>\n' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{ '<start_of_turn>model\n' }}
{%- endif -%}
Use this in LM Studio to properly format chat interactions. π