Research engineer Florian Brand from Prime Intellect has switched to using Gemma 4 E4B (6-bit quantized) as his primary local language model on a Mac with an M4 Max chip, replacing Qwen3 (3.5 4B).

image
image

What Happened

Florian Brand reported switching to Gemma 4 E4B, utilized via LM Studio. The model occupies approximately 7 GB of RAM and demonstrates high performance speeds with response quality comparable to GPT-4o.

Context

Previously, Qwen3 (3.5 4B) was used as the primary solution for local inference. The transition to the Gemma 4 architecture highlights the capability of small language models (SLMs) to run on consumer-grade hardware like the Apple M4 Max.

Why It Matters for the Industry

This case demonstrates the successful displacement of specialized small models by universal open-weights solutions from Google. It confirms the growing efficiency of small-scale architectures and changes the economics of AI deployment, allowing cloud APIs to be replaced by local alternatives without loss of quality.

Why It Matters for Users

For professional users, this means the ability to have a full-fledged work tool running 24/7 locally, without cloud service latency and without the need to transmit data externally, which is critical for privacy and autonomy.

Sources

Author

Look at AI, Editorial Team