6 min read

Running Gemma 3 - MultiModal LLMs at Home! Full Tutorial with Installation Guide

Table of Contents

πŸš€ Gemma 3 Google AI: Best Local Vision LLM Ever?! (But…)

In this blog post, we’ll be diving into Google’s latest multimodal LLM, Gemma 3, featuring a long context window and vision capabilities. We’ll explore its strengths, weaknesses, and whether it truly lives up to the hype in vision-related tasks. πŸ”₯

πŸ” Gemma 3: A Quick Overview

Gemma 3 is available in multiple sizes, including:

βœ… 1B: No vision capabilities, 32k context window (not our focus here). βœ… 4B, 12B, 27B: Vision-enabled with a 128k context window.

⚠️ Potential Context Window Error: The Hugging Face page lists 8192 as the context length for the 4B, 12B, and 27B models, which might be incorrect.

πŸ”— Gemma 3 Collection on Hugging Face
πŸ”— Gemma 3 Quants by Unsloth

In this review, we’ll be focusing on the 27B model. πŸ’ͺ

πŸ› οΈ Setup & Initial Impressions

We followed our Ollama Installation Guide to set up Gemma 3 (27B). Key observations:

  • Speed: ~13.45 tokens/sec
  • VRAM Usage: ~86GB across a quad-GPU setup
  • Optimizations Used: Flash Attention, KV cache (fp6)

πŸ“‰ Text-Based Performance: WIP πŸ˜•

Gemma 3 struggles with traditional LLM tasks like code generation, reasoning, and math. Here’s how it performed:

TestDescriptionResult
Flappy Bird CloneGenerated code crashed after 1st obstacle❌ FAIL
Armageddon Trolley ProblemUnexpected ethical reasoningβœ… PASS
Word ParsingIncorrectly counted letters in a sentence❌ FAIL
Logical Reasoning (Arrays)Offered a deviated but acceptable answerβœ… PASS
Number ComparisonIncorrectly stated 420.69 > 4207❌ FAIL
Word AnalysisIncorrect vowel count in peppermint❌ FAIL
Scheduling (Reasoning)Correctly predicted a cat’s scheduleβœ… PASS
Memorization (Pi to 100 decimals)Provided inaccurate results❌ FAIL
SVG Cat DrawingCreated an unrecognizable SVG catβœ… PASS
Travel Time CalculationCorrectly determined fastest driverβœ… PASS

πŸ”Ή Summary: 4/10 tests passed. Models like QWQ or DeepSeek 67B significantly outperform Gemma 3 in text tasks. ❌

πŸ”₯ Vision Performance: A Game-Changer! πŸ†

When it comes to vision tasks, Gemma 3 is exceptional. Some highlights:

Vision TaskPerformanceResult
Meme ExplanationIdentified humor, cultural relevanceβœ… PASS
Snake IdentificationIncorrect species, but emphasized safetyβœ… PASS
Product Recognition (Texas Toast)Identified packaging, pricing, featuresβœ… PASS
Disassembled GPU PartsIdentified die, VRMs, cooling systemβœ… PASS
Server MotherboardFailed follow-up questions❌ FAIL
Smart Devices RecognitionRecognized Echo Dot, Samsung devicesβœ… PASS
Stack of GPUsIncorrect GPU count & wattage readout❌ FAIL
Well Pressure SwitchMisidentified the component❌ FAIL
Food Identification (Corn Salsa)Listed correct ingredients & preparationβœ… PASS
Scene Analysis (Cat & Slippers)Described environment perfectlyβœ… PASS
Kilowatt Meter ReadoutIncorrect power consumption reading❌ FAIL
Handwritten NotesAccurately transcribed notesβœ… PASS
Fiber Drop Meme InterpretationUnderstood niche tech humorβœ… PASS

πŸ”Ή Summary: 9/13 vision tests passed. Gemma 3 is currently the best multimodal LLM for vision tasks! πŸ†πŸ’‘

πŸ“Œ Key Takeaways

βœ… Vision King! Gemma 3 dominates in image recognition & understanding.
❌ LLM Performance is Underwhelming: Struggles with math, logic, and reasoning.
⚠️ Inconsistencies in Context Length: Could impact text accuracy.

πŸ† The Verdict

Gemma 3 is a double-edged sword:

  • For Vision Tasks: Incredible! Best in class. πŸ‘€
  • For Text-Based Reasoning: Mediocre at best. πŸ˜•

If you’re looking for image understanding, Gemma 3 is unmatched. But if you need an all-around LLM, you might want to explore alternatives.

πŸ’¬ Have you tested Gemma 3?

Have you tested Gemma 3? πŸ€” Drop a comment below with your experiences, setup, and results! Let’s build a community of AI enthusiasts! πŸš€


πŸ“œ LM Studio Chat Template (Jinja2 Format)

To use Gemma 3 effectively in LM Studio, add this Jinja-formatted chat template:

{%- set bos_token = "<bos>" -%}
{{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- set first_user_prefix = (messages[0]['content'] + '\n\n') if messages[0]['content'] is string else (messages[0]['content'][0]['text'] + '\n\n') -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate.") }}
    {%- endif -%}
    {%- set role = "model" if message['role'] == 'assistant' else message['role'] -%}
    {{ '<start_of_turn>' + role + '\n' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {{ '<start_of_image>' if item['type'] == 'image' else item['text'] | trim }}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>\n' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{ '<start_of_turn>model\n' }}
{%- endif -%}

Use this in LM Studio to properly format chat interactions. πŸš€