Why Your Agent Can't See Its Own Mistakes

I've been building an agent that makes 3D models for Minecraft using Blockbench. Blockbench has an MCP, so the agent can create models, paint textures, rotate things, export. The same agent that writes Java plugins for the game also builds the assets that go into it.

The problem is when the agent needs to look at what it made and decide if it's any good.

The setup

The workflow goes like this:

Task: "make a sword model"
          |
          v
  +---------------+
  |  Coding Agent  |
  |  (Claude)      |
  +---------------+
          |
          |  MCP calls
          v
  +---------------+
  |  Blockbench    |
  |  3D editor     |
  +---------------+
          |
          |  snapshot
          v
  +---------------+
  |  Visual check  |  <-- who does this?
  +---------------+

The agent talks to Blockbench through the MCP. It creates the model geometry, applies textures, positions everything. Then it takes a snapshot of what it made and needs to evaluate: does this look right? Is the texture clean? Does the shape match what a Minecraft item should look like?

Same-context evaluation is terrible

When I had the same Claude agent do the visual check within the same conversation, it almost always said things looked fine. Even when they clearly didn't. Broken textures, weird proportions, faces that were obviously wrong. It would look at the screenshot and say "looks good" and move on.

My theory is that the agent has too much context about what it intended to make. It knows the steps it took. It knows it placed the right texture on the right face. So when it sees the result, it's biased toward thinking it worked. It's not really looking at the image with fresh eyes. It's confirming what it already believes.

What usually happened: Agent: builds model Agent: takes snapshot Agent: "Looks correct, the texture is applied and the shape matches the spec."

Me: looks at screenshot Me: That sword has the texture sideways and one side is completely black.

Fresh context fixes it

What actually worked was taking the snapshot and sending it to a completely separate agent in a new conversation. No history of what was built or how. Just the image, a short description of what it should be, and a prompt asking it to evaluate the shape, texture, and overall style.

+---------------+       snapshot        +------------------+
  |  Coding Agent  | ------------------> |  Visual QA Agent  |
  |  (Claude)      |                     |  (Gemini)         |
  +---------------+                      +------------------+
        |                                        |
        |  builds the model               checks the result
        |  via Blockbench MCP             with fresh eyes
        |                                        |
        v                                        v
   "make a sword"                    "the texture on the blade
                                      is rotated 90 degrees,
                                      and the handle color
                                      doesn't match the style"

The improvement was immediate and obvious. The fresh agent would catch things the original agent never mentioned. Sideways textures, color mismatches, proportions that didn't fit the game's style.

Gemini is better at this than Claude

I tried this with Claude as the visual QA agent too. It was better than same-context evaluation, but still missed things. Switching to Gemini for the visual QA step made a big difference. Gemini just seems to be more perceptive about visual details. It would catch subtle stuff like a texture being slightly off-palette compared to vanilla Minecraft items, or a model having too many polygons for the art style.

For everything else in the pipeline (writing code, using the MCP, reasoning about the task), Claude is better. But for "look at this image and tell me what's wrong with it," Gemini wins.

How I set it up

The visual QA call is simple. It's a separate API call, not part of the main agent's conversation. I send:

The screenshot from Blockbench
A one-liner about what the model is supposed to be
A short prompt with criteria: check the overall shape, texture alignment, color palette, and whether it matches Minecraft's visual style

That's it. No build history, no conversation context, no knowledge of what tools were used. Just "here's an image, here's what it should be, what's wrong with it?"

Full loop:
+--------+    task     +-------+   MCP    +-----------+
  | Client | ---------> | Agent | -------> | Blockbench |
  +--------+            +-------+          +-----------+
                            |                     |
                            |               snapshot.png
                            |                     |
                            v                     v
                     +------------+    image   +--------+
                     | QA Agent   | <--------- |        |
                     | (Gemini)   |            +--------+
                     +------------+
                            |
                       pass / fail
                       + feedback
                            |
                            v
                     +------------+
                     | Agent fixes |
                     | and retries |
                     +------------+

If the QA agent says it's fine, we're done. If it finds problems, the feedback goes back to the coding agent, which fixes the model and we go around again.

The takeaway

If you're building agents that produce visual output (3D models, UI screenshots, generated images, anything), don't let the same agent judge its own work in the same conversation. It's too biased by its own context. A separate call to a vision model, with minimal context and a clear evaluation prompt, does a much better job. And for that specific task, Gemini is currently the best option I've found.