Qwen2.5-0.5B Chess LoRA
A LoRA adapter that teaches Qwen2.5-0.5B-Instruct-4bit to solve chess puzzles. Trained in ~1 minute on a Mac using MLX.
Found autonomously by an AI agent running 67 experiments in 2 hours.
Does it actually work?
We validated on 100 fresh puzzles (zero overlap with training data) across all 4 combinations:
| No legal moves in prompt | With legal moves in prompt | |
|---|---|---|
| Base model | 0% exact, 0% legal, 100% invalid | 9% exact, 49% legal, 51% invalid |
| + LoRA | 0% exact, 4% legal, 96% invalid | 12% exact, 96% legal, 4% invalid |
What the LoRA actually learned:
- Format following — reliably picks one move from the provided legal move list (4% invalid vs 51% for base)
- Slightly better move selection — 12% vs 9% exact accuracy when both models see the legal moves
- No chess knowledge — without the legal move list in the prompt, it's just as lost as the base model (0% exact)
The 22% accuracy on the original 59-puzzle test set was inflated by the tiny sample size. On 100 fresh puzzles, the real improvement is modest: 9% to 12% exact.
Validation by rating band (100 fresh puzzles, with legal moves)
| Rating | Correct | Total | Accuracy |
|---|---|---|---|
| < 1400 | 9 | 45 | 20.0% |
| 1400-1700 | 0 | 19 | 0.0% |
| 1700-2000 | 3 | 16 | 18.8% |
| 2000+ | 0 | 13 | 0.0% |
How it was made
An AI agent (pi) ran 67 autonomous experiments in a single session, trying different hyperparameters, data curation strategies, model variants, and training techniques. It kept what improved the metric and discarded the rest. The whole process took about 2 hours on an Apple Silicon Mac.
0% -----> 8.47% -----> 11.86% -----> 13.56% -----> 15.25% -----> 16.95% -----> 18.64% -----> 22.03%
baseline +legal +steps +SAN +rating +fewer +band +LR
moves 512 notation curation steps 1000-1900 6e-5
(Accuracy numbers above are on the original 59-puzzle test set used during the search. See validation section above for held-out results.)
Usage
pip install mlx-lm
# Generate a move
python -m mlx_lm.generate \
--model mlx-community/Qwen2.5-0.5B-Instruct-4bit \
--adapter-path victor/qwen2.5-0.5B-chess-lora \
--max-tokens 8 \
--prompt "You are a careful chess tactician. Choose the strongest move for the side to move.
Return only one move from the legal move list in SAN notation like Nf3 or Qh5+.
Side to move: White
FEN: r1bqkb1r/pppppppp/2n2n2/8/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 2 3
Legal moves: Ba6 Bb5 Bc4 Bd3 Be2 Na3 Nc3 Qe2 Ke2 a3 a4 b3 b4 c3 c4 d3 d4 e5 g3 g4 h3 h4 Rg1
Best move:"
Training config
| Parameter | Value |
|---|---|
| Base model | mlx-community/Qwen2.5-0.5B-Instruct-4bit |
| Method | LoRA (rank 8, scale 20.0) |
| Layers | Top 8 |
| Steps | 256 |
| Batch size | 2 |
| Learning rate | 6e-5 |
| Max seq length | 384 |
| Optimizer | Adam |
| Prompt masking | Yes |
| Train examples | 384 (Lichess puzzles, rating 1000-1900) |
| Valid examples | 64 (Lichess puzzles, rating 1000-1900) |
| Test examples | 59 (Lichess puzzles, rating 1000-2400) |
| Training time | ~50 seconds on Apple Silicon |
Experiment log
67 experiments were run autonomously. Here are the key milestones:
| # | What changed | Accuracy (59-puzzle test) | Kept? |
|---|---|---|---|
| 1 | Baseline: FEN-only prompt | 0% | yes |
| 4 | Added legal move lists + longer sequences | 8.47% | yes |
| 5 | Doubled training to 512 steps | 11.86% | yes |
| 9 | Switched from UCI to SAN notation | 13.56% | yes |
| 31 | Curated training data to rating 1000-2000 | 15.25% | yes |
| 34 | Narrowed to 1000-1800, reduced to 256 steps | 16.95% | yes |
| 37 | Refined band to 1000-1900 | 18.64% | yes |
| 47 | Bumped learning rate from 5e-5 to 6e-5 | 22.03% | yes |
The remaining 59 experiments tried: ASCII board diagrams, more/fewer LoRA layers, different optimizers (AdamW, Muon), DoRA, Qwen3-0.6B, raw-text training, gradient accumulation, batch size changes, LR schedules, curriculum learning, data mixing, checkpoint selection, and various other tweaks. None beat the configuration above.
What actually mattered
- Providing legal moves in the prompt (0% -> 8.47%) - The model can't play chess without seeing what moves are available
- SAN notation over UCI (11.86% -> 13.56%) - Shorter, more natural chess language
- Data curation by rating (13.56% -> 18.64%) - Training on easier puzzles (1000-1900) generalized better to the full test range (1000-2400)
- Learning rate tuning (18.64% -> 22.03%) - A small bump from 5e-5 to 6e-5
What didn't matter
- Model size (Qwen3-0.6B was worse)
- Optimizer choice (Adam, AdamW, Muon all similar)
- LoRA method (DoRA no better than LoRA)
- More training steps (overfitting after 256-512)
- More training data (smaller curated subsets won)
- Fancy LR schedules (constant rate was best)
Limitations
- The model doesn't know chess. Without legal moves in the prompt, it outputs garbage. It's a format follower, not a chess engine
- The 22% accuracy is on 59 examples; on 100 fresh puzzles it drops to 12%
- Only predicts one move per position (not useful for full games)
- Trained and evaluated on Lichess puzzles only
- The legal move list in the prompt does the heavy lifting for the 96% legal-move rate
Session data
The full autonomous experiment session logs are available at victor/autoresearch-chess-sessions.
License
Apache 2.0, same as the base model.
Quantized
Model tree for victor/qwen2.5-0.5B-chess-lora
Base model
Qwen/Qwen2.5-0.5B