Issue 002 May 18, 2026

Taste in the Loop

A Note from the Editors

Thank you all for subscribing to Taste in the Loop! We're thrilled to have over 130 of you join us for our 2nd issue. Enjoy! There's a lot more to come.

— Josh, Donnie, and Ben

You're not ready for minions (part 1) ↗

— Josh Carver (ContextBridge)

Stripe’s blog posts on their cloud async agents, Minions (see part 1 and part 2) caused quite a stir back in February. Now, a lot of folks are building their own Minions. The problem is most teams aren’t set up to make them work in practice, Stripe included. Getting async cloud agents to move metrics the business cares about, without accidentally burying yourself in slop is hard.

I spent a lot of time this week thinking about why this is the case, and I enumerate those issues in this post, including my thoughts on why you need to start by fixing your tickets.

Here’s a fun passage as a teaser:

Minions are like junior engineers with ADHD, amnesia and jetpacks. The quality vector of your codebase matters because … you want all the little yellow guys pointed in the right direction.

Minions · Cloud Agents

Introducing Composer 2.5 ↗

— Cursor

Cursor released their new Composer 2.5 model this week, which uses Moonshot’s Kimi 2.5 as a base. Cursor claims it benchmarks competitively with Opus 4.7 at roughly 10x cheaper pricing ($0.50/M input and $2.50/M output, vs Opus 4.7’s $5/M and $25/M).

While popular benchmarks aren’t indicative of real-world performance, it’s been interesting to see what Cursor has been able to achieve by applying proprietary RL pipelines to open-weight models. Now they’re using SpaceX’s infra to train a new model from scratch. This post also highlights several of the challenges teams face when scaling up RL pipelines, including dealing with sparse reward signals, generating synthetic data, and reward hacking — overall a good read.

LLM releases

Which Model Reviews Code Best? ↗

— Factory

A lot of us lean on AI-powered code reviews to help catch things we otherwise might miss. But AI code review agents burn tokens and inference isn’t cheap.

Factory attempts to answer “which LLM gives the best code review results for the cost?” (where ‘best’ is defined as # bugs per $ spent) in this article. While the methodology is flawed^1,2, the results are still interesting: 1) Open-weight models did well here and cost far less than proprietary frontier models. 2) New models aren’t always better. You’d expect GPT-5.5 to outperform here; it did not.

¹ Every time an engineer runs a benchmark thrice because it felt “right”, a statistician spontaneously combusts.

² Every LLM responds differently to the same prompt. So, holding the prompt constant in the benchmark isn’t exactly a fair measure of the model’s actual capabilities.

LLM releases

Learning Software Architecture ↗

— Alex Kladov

Alex Kladov (creator of rust-analyzer) responds to an email asking about how to learn software design skills in this blog post. A good portion of this post has a lot more to do with Conway’s Law and (human) incentive structures than pure software architecture.

What’s interesting here is so much of this applies to working with AI agents as well. Agents reward hack and take “shortcuts” all the time. But if we can shape their incentives (goals, reward functions, etc.), it becomes a huge point of leverage for driving better outcomes.

Architecture

From Vibe Coding to Agentic Engineering ↗

— Andrej Karpathy

Andrej Karpathy (co-founder of OpenAI, former head of AI at Tesla, and now founder of Eureka Labs) gives an interview at AI Ascent. Worth a watch as Andrej covers topics such as feeling behind as a coder, how software is changing, moving from vibe coding to “agentic engineering,” and more.

Video Interviews

Harness Curious

If you’ve long suspected that your coding agent harness (Codex, Claude Code, etc.) is a vibe-coded slop cannon but never any had proof, this section is for you. Check out this gem from the Codex system prompt:

You use Three.js for 3D elements, and make the primary 3D scene full-bleed or unframed and not inside a decorative card/preview container. Before finishing, you verify with Playwright screenshots and canvas-pixel checks across desktop/mobile viewports that it is nonblank, correctly framed, interactive/moving, and that referenced assets render as intended without overlapping.

Codex stuffs this into your context window on every API call. Whether you’re debugging a Rails migration or chasing a CSS bug, you’re wasting input tokens on Three.js rendering guidance that never gets used. Multiply by every request, every developer, every day — and the slop cannon takes shape.

Try Aether — our open-source harness

Clanker Fail of the Week

This week's clanker fail. — Each week we feature a 'clanker fail', a time where your coding agent fell flat on its face. This week Claude called 11 of its own tools in parallel and failed ALL of them.