Skip To Main Content

The Kaitchup – AI on a Budget

Newsletter (Digital)

The Kaitchup focuses on recent advances in making AI more accessible. With this newsletter, you will learn how to run AI on your computer. Progress is fast: larger models are getting cheaper to use. Source

Actions

Share this page

Media Outlet details

Scope	National
Language	English
Country	United States of America
Similarweb UVM	Request pricing
Comscore UVM	Request pricing

Recent Articles

Search Articles

The Kaitchup – AI on a Budget

May 12, 2026 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

Like I did for Qwen3.5 and Gemma 4, let’s see how well Qwen3.6 27B holds up when quantized into different formats: FP8, INT4, and NVFP4. The goal is to measure Qwen3.6’s robustness to quantization: how much accuracy is affected, what happens to latency and token efficiency, and how fast the model can become after quantization. The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Open in Who Shared

The Kaitchup – AI on a Budget

Mar 18, 2026 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

Inference efficiency has become one of the main ways LLMs distinguish themselves. Two models can look similar on paper: roughly the same parameter count, similar context length, both marketed as fast, yet behave very differently once you actually try to serve them at scale.

Open in Who Shared

Run GLM-4.7 Flash on One GPU: VRAM Math, Quantization Options, and Benchmark Results

Feb 09, 2026 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

Z.ai’s flagship model, GLM-4.7, is among the strongest open-weight models available, but it’s also massive at 384B parameters. In practice, that means even an NVIDIA B300 (288 GB) can’t hold the full model in memory, even in FP8. A 4-bit version can fit, but inference is still expensive. Z.ai also released a smaller sibling: GLM-4.7 Flash, a 30B-A3B Mixture-of-Experts (MoE) model. It has 30B total parameters, but only ~3B are active per token during inference.

Open in Who Shared

4-bit GLM4.7 with a Single B300: High Speed and 100% Accuracy on AIME24

Jan 12, 2026 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

Image generated with ChatGPT GLM-4.7 is one of the best open-weight LLMs right now, with strong benchmark results and (just as importantly) consistently positive feedback from people actually using it. So I wanted to run it myself. GLM-4.7 is a sparse Mixture-of-Experts (MoE) model: it has a large total parameter count (358B), but only a subset of experts are active on each forward pass.

Open in Who Shared

LFM2.5 and Falcon H1R-7B: New Hybrid Models with Strong Benchmark Scores

Jan 09, 2026 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

Hi everyone, In this edition of The Weekly Kaitchup, we discuss: LFM2.5: Small, Fast, and Accurate Falcon H1R-7B: The New Best Model Under 30B Parameters? Qwen3-VL Embedding and Rerankers The Kaitchup – AI on a Budget is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. LFM2.5 1.2B scores 14.0 on AIME25, and 38.9 on GPQA-Diamond. For a 1.2B-parameter model, that’s very high.

Open in Who Shared

Nemotron 3 Nano: A Very Fast Model That Doesn't Think Too Much

Jan 05, 2026 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

Nemotron 3 Nano sits in a category I usually file under “interesting, but probably not worth the time”: hybrid Mamba-Transformer models. I’ve tested enough hybrids to recognize the usual trade-off. They’re a great research direction and often deliver excellent throughput, but accuracy tends to slip in subtle ways that only show up once you run real tasks. And this one adds another layer of complexity: it’s a sparse MoE.

Open in Who Shared

Fine-Tuning RNJ-1 with Unsloth: 4x Faster on a Single GPU

Dec 22, 2025 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

Image generated with ChatGPT Fine-tuning LLMs to your own tasks is expensive, and if you do it naively, a big chunk of that compute is simply burned on padding. LLMs train on batches of token sequences, and every sequence in a batch must have the same length. For instance, if you have a batch with four sequences of length 4, 12, 13, and 5,000 tokens, the short ones are padded with dummy tokens up to 5,000.

Open in Who Shared

Ministral 3 vs Others: Accuracy, Token Efficiency, and the Best Model per Budget

Dec 15, 2025 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

Alongside its new flagship open-weight model, Mistral Large 3, Mistral AI also released a family of smaller models under the Ministral 3 name. The lineup includes base, instruct, and reasoning variants at 3B, 8B, and 14B parameters. All are small enough to run on consumer GPUs. The instruct versions are also natively in FP8, which can improve speed and memory efficiency on RTX 50xx GPUs for a similar accuracy. Mistral AI positions these as the best small models.

Open in Who Shared

Shrink LLMs with Vocabulary Reduction: From Gemma 3 270M to 141M

Sep 29, 2025 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

State-of-the-art language models increasingly rely on large token vocabularies to cover many writing systems. Today, the vocabulary size of popular open models typically sits in the ~128k–150k range. Google’s Gemma 3 goes further with a tokenizer of 262,144 entries, designed to improve coverage for non-Latin scripts. A big vocabulary helps because it shortens sequences for Chinese, Japanese, Korean, and many Indic languages, which tends to boost multilingual performance. But it’s not free.

Open in Who Shared

RAG with Qwen3 Embedding and Qwen3 Reranker

Jun 19, 2025 |

By Benjamin Marie, Benjamin Marié

| The Kaitchup – AI on a Budget

Retrieval-Augmented Generation (RAG) is a powerful paradigm that enhances a large language model (LLM) with a retrieval mechanism, enabling it to access relevant background information, such as documents or passages, before generating a response. At the core of a RAG pipeline, we usually find two components: the embedding model and the reranker. The embedding model transforms text into dense numerical vectors (embeddings), placing semantically similar texts close together in vector space.

Open in Who Shared

The Kaitchup – AI on a Budget Journalists

Marie, Benjamin