Comparing GPT-5.4, Opus 4.6, GLM-5.1, Kimi K2.5, MiMo V2 Pro and MiniMax M2.7
April 14, 2026
April 14, 2026
So listen, I was just trying to eject a drive and I ended up spending the day benchmarking LLMs.
In a hurry? Jump to the results or the conclusion. Otherwise make some tea, relax and read on.
It started with this error:
Usually I just lsof /Volumes/Data and kill whatever is causing the
issue (99% Spotlight or QuickLook).
But you know what, Iām at the point where I need to renew my AI coding subscription and after a while on Claude in Cursor and a bit of GPT 5.4 in Codex, I wanted to see whatās out there, and especially compare the big two to the open weights models available through OpenCode Go.
After all, Cursor based their Composer 2 model on Kimi K2.5 so itās probably decent, and Iāve heard good things about GLM-5.1.
So I decided to compare those models to build a native macOS app to make the above experience a bit nicer.
Simple as that.
When it comes the planning and building experience, Iāll document vibes only, because nothing really stood out here. And this is all that matters these days anyway doesnāt it?
As for code analysis and final ranking, itās a bit more thorough, donāt worry.
I ran all models through OpenCode. Started in planning mode, and answered all questions they asked if any.
They all had a pretty similar plan. Slight variations in irrelevant details, but everything was reasonable when it comes to the underlying commands and APIs to built upon.
The only one that stood out was MiMo, kernel API instead of shelling out
to lsof. Do what you want with that but I donāt really care.
Itās a tie for me on this aspect.
OpenCode lets the model run shell commands so they were all able to compile the app and iterate on build failures on their own.
This means when the models were done, the code was compiling. Itās a tie again.
Despite compiling successfully, a few of them crashed as soon as I launched the app.
When you make a web app, most harnesses allow browser use for the LLM to test its own output live. This gives us the same success loop we had with compile-time errors.
But thereās no such thing for native apps yet.
Nothing to blame the actual models on though.
After fixing those, all apps had one of two outcomes:
For all of them I could just have keep the vibe coding loop until it works. (I did not, I have some real work to do and as you can notice, Iām spending way too much time on this already.)
But as far as Iām concerned itās more or less a tie. Maybe they didnāt work for different reasons, but it would take a comparable amount of effort to fix.
Note: nothing new here. If you can give the LLM a tool to test its work end to end, you spend less time writing ādoesnāt work, pls fixā. Too bad itās not a readily available option here.
Letās take the hypothesis for a minute that code quality, as subjective as it is, is still relevant.
Then, I have a slight preference for the output of GLM-5.1 and GPT-5.4.
Both are more lean and easier to reason about than the other ones.
This is usually synonymous with the software being less buggy now, as well as less buggy in the future when we add/change stuff around. And in my experience this seems to apply to both LLMs and humans.
But Iām sure you can steer all of them to output a clean and maintainable app by prompting a bit more, that is, if you know how to code and care about the code at all.
That was a fun experiment. I asked the LLMs to review each otherās work, rank it, and score the cleanliness and technical approach.
I was worried this was gonna be a slop spiral but the results were not as over the place as I expected. š
GPT-5.4 and Opus 4.6, the two frontier proprietary models, nearly always came up in the top 3, with an edge for GPT.
MiMo V2 Pro also had a tendency to rank quite high, close to the two leaders, even if the consensus wasnāt as striking.
The caveat is that it scored on the low side for cleanliness, despite standing out for technical approach.
GLM-5.1 had a pretty consistent 4th place, but it ranked really high on cleanliness specifically.
This means itās a decent choice is you want relatively clean code out of the box and donāt mind giving it guidance when it comes to technical approach.
This is actually something that works best for me. Iād rather steer the technical approach if it means I donāt have to refine output quality as much.
Kimi K2.5 and MiniMax M2.7 were consistently at the bottom, and in that order. Which is pretty consistent with their cost.
Hereās the raw data. Row is candidate, column is judge. Cleanliness and technical approach are rated out of 10.
| Overall | GPT-5.4 | Opus 4.6 | GLM-5.1 | Kimi K2.5 | MiMo V2 Pro | MiniMax M2.7 | |
|---|---|---|---|---|---|---|---|
| GPT-5.4 | š„ | 1 | 2 | 2 | 2 | 1 | 2 |
| Opus 4.6 | š„ | 3 | 3 | 3 | 1 | 3 | 1 |
| MiMo V2 Pro | š„ | 2 | 1 | 1 | 3 | 4 | 6 |
| GLM-5.1 | 4 | 4 | 4 | 4 | 4 | 2 | 3 |
| Kimi K2.5 | 5 | 5 | 5 | 5 | 5 | 5 | 4 |
| MiniMax M2.7 | 6 | 6 | 6 | 6 | 6 | 6 | 5 |
| Average | GPT-5.4 | GLM-5.1 | Kimi K2.5 | MiMo V2 Pro | MiniMax M2.7 | |
|---|---|---|---|---|---|---|
| GPT-5.4 | 8.4 | 8.5 | 8 | 9 | 8.5 | 8 |
| GLM-5.1 | 7.8 | 8 | 8 | 8 | 8 | 7 |
| Opus 4.6 | 7.7 | 6.5 | 6 | 9 | 8 | 9 |
| Kimi K2.5 | 6.5 | 6 | 5 | 7 | 7.5 | 7 |
| MiMo V2 Pro | 6.3 | 6.5 | 7 | 7 | 7 | 4 |
| MiniMax M2.7 | 5.2 | 4 | 4 | 6 | 7 | 5 |
| Average | GPT-5.4 | GLM-5.1 | Kimi K2.5 | MiMo V2 Pro | MiniMax M2.7 | |
|---|---|---|---|---|---|---|
| GPT-5.4 | 8.2 | 9 | 8 | 9 | 7 | 8 |
| Opus 4.6 | 7.9 | 8 | 7 | 9 | 7.5 | 8 |
| MiMo V2 Pro | 7.4 | 8.5 | 9 | 8 | 7.5 | 4 |
| GLM-5.1 | 6.4 | 6 | 5 | 7 | 7 | 7 |
| Kimi K2.5 | 5.6 | 4 | 6 | 7 | 6 | 5 |
| MiniMax M2.7 | 5.2 | 3 | 5 | 7 | 6 | 5 |
The most impressing thing for me was that models didnāt seem to be biased by their own performance??
All models gave their own output a score that was either very close to, or lower than their average score. Even when the average score was low.
GPT-5.4 is the only one that ranked itself first, but guess what, the consensus also agreed on that. š
Iām gonna keep using GLM-5.1 via OpenCode Go for the rest of the month or until I run out of usage. Then switch to GPT-5.4 via Codex Plus to see if itās worth the extra $$$, both in terms of quality and quantity.
Though by then Iāll probably have 5 other models to test. šš
Comparing models is peak procrastination. Any of the top models will do just fine.
And I guess GLM and MiMo are now playing in that league at a fraction of the cost.
If you have a clear idea what you want, youāll need to steer all of them in one way or another. Just pick one and get back to work.