The bot lost $160, and the audit was the real artifact

The bot lost $160 over 138 settled trades on Kalshi single-degree temperature brackets. That is the headline, and I am leading with it because the temptation in a write-up like this is to bury the loss under process language. The bot lost money. What I did next is the part worth reading.

I treated it as an investigation. The live strategy and credentials are not in the repository. What is in there is the part that turned out to be valuable: the 138-trade dataset, an evaluation framework that runs walk-forward backtests against it, and the written record of three wrong diagnoses before the right one. I committed the trade data so every number is reproducible. If I am going to show people a loss, they should be able to check my math.

The arithmetic that ended the debate

I kept reaching for model comparisons, and the answer was never a model. It was arithmetic. The realized reward-to-risk on these bets was about 0.51. The average NO win was a few dollars; the average NO loss was roughly double that. At that payout structure the break-even win rate is 66 percent. The bot's actual win rate was 54 percent.

You cannot make money betting NO on near-coin-flip events when the payout demands 66 percent and you hit 54. No probability model fixes that, because the problem is not the model. Single-degree brackets sit below forecast resolution and Kalshi prices them efficiently. This was a market-selection problem wearing a model problem's clothes, and I spent real time fooled by the costume.

How a logging gap cost me three passes

Here is the embarrassing and instructive part. The INSERT INTO trades statement listed its columns explicitly and quietly omitted three diagnostic fields. Every trade row therefore showed a model count of one and a null ensemble probability. Anyone looking at that table, me included, concluded the 31-member ensemble pipeline was dead.

It was not. The ensemble ran correctly on every trade. The columns were simply never written. That single gap sent three consecutive audit passes down wrong roads. Pass one blamed a dead proxy node and shipped a fix for a non-problem. Pass two repeated the dead-ensemble error. Pass three over-corrected to a sigma-floor theory that the data also disproves. I kept the honest record of that oscillation in the repo, because a post-mortem that hides the wrong turns is not a post-mortem. It is marketing.

What I built so it cannot happen again

The lasting output is the evaluation framework. It is standard-library only, reads the committed data directly, and runs a chronological walk-forward comparing two probability models at four edge thresholds, training each on prior trades only. It reports Brier scores, win rates, and per-model profit.

The piece I would build first next time is the gate harness: a pre-committed pass-fail verdict on shadow-mode performance, with the criteria written down before I look at any numbers. The cheapest audit is the one you run before any capital is at risk. I learned that by skipping it and paying $160 for the lesson, which is a fair price for a lesson I will not forget.