Qwen2.5-0.5B Chess LoRA

A LoRA adapter that teaches Qwen2.5-0.5B-Instruct-4bit to solve chess puzzles. Trained in ~1 minute on a Mac using MLX.

Found autonomously by an AI agent running 67 experiments in 2 hours.

Does it actually work?

We validated on 100 fresh puzzles (zero overlap with training data) across all 4 combinations:

No legal moves in prompt With legal moves in prompt
Base model 0% exact, 0% legal, 100% invalid 9% exact, 49% legal, 51% invalid
+ LoRA 0% exact, 4% legal, 96% invalid 12% exact, 96% legal, 4% invalid

What the LoRA actually learned:

  • Format following — reliably picks one move from the provided legal move list (4% invalid vs 51% for base)
  • Slightly better move selection — 12% vs 9% exact accuracy when both models see the legal moves
  • No chess knowledge — without the legal move list in the prompt, it's just as lost as the base model (0% exact)

The 22% accuracy on the original 59-puzzle test set was inflated by the tiny sample size. On 100 fresh puzzles, the real improvement is modest: 9% to 12% exact.

Validation by rating band (100 fresh puzzles, with legal moves)

Rating Correct Total Accuracy
< 1400 9 45 20.0%
1400-1700 0 19 0.0%
1700-2000 3 16 18.8%
2000+ 0 13 0.0%

How it was made

An AI agent (pi) ran 67 autonomous experiments in a single session, trying different hyperparameters, data curation strategies, model variants, and training techniques. It kept what improved the metric and discarded the rest. The whole process took about 2 hours on an Apple Silicon Mac.

0% -----> 8.47% -----> 11.86% -----> 13.56% -----> 15.25% -----> 16.95% -----> 18.64% -----> 22.03%
baseline   +legal       +steps        +SAN          +rating       +fewer        +band          +LR
           moves        512           notation      curation      steps         1000-1900      6e-5

(Accuracy numbers above are on the original 59-puzzle test set used during the search. See validation section above for held-out results.)

Usage

pip install mlx-lm

# Generate a move
python -m mlx_lm.generate \
  --model mlx-community/Qwen2.5-0.5B-Instruct-4bit \
  --adapter-path victor/qwen2.5-0.5B-chess-lora \
  --max-tokens 8 \
  --prompt "You are a careful chess tactician. Choose the strongest move for the side to move.
Return only one move from the legal move list in SAN notation like Nf3 or Qh5+.
Side to move: White
FEN: r1bqkb1r/pppppppp/2n2n2/8/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 2 3
Legal moves: Ba6 Bb5 Bc4 Bd3 Be2 Na3 Nc3 Qe2 Ke2 a3 a4 b3 b4 c3 c4 d3 d4 e5 g3 g4 h3 h4 Rg1
Best move:"

Training config

Parameter Value
Base model mlx-community/Qwen2.5-0.5B-Instruct-4bit
Method LoRA (rank 8, scale 20.0)
Layers Top 8
Steps 256
Batch size 2
Learning rate 6e-5
Max seq length 384
Optimizer Adam
Prompt masking Yes
Train examples 384 (Lichess puzzles, rating 1000-1900)
Valid examples 64 (Lichess puzzles, rating 1000-1900)
Test examples 59 (Lichess puzzles, rating 1000-2400)
Training time ~50 seconds on Apple Silicon

Experiment log

67 experiments were run autonomously. Here are the key milestones:

# What changed Accuracy (59-puzzle test) Kept?
1 Baseline: FEN-only prompt 0% yes
4 Added legal move lists + longer sequences 8.47% yes
5 Doubled training to 512 steps 11.86% yes
9 Switched from UCI to SAN notation 13.56% yes
31 Curated training data to rating 1000-2000 15.25% yes
34 Narrowed to 1000-1800, reduced to 256 steps 16.95% yes
37 Refined band to 1000-1900 18.64% yes
47 Bumped learning rate from 5e-5 to 6e-5 22.03% yes

The remaining 59 experiments tried: ASCII board diagrams, more/fewer LoRA layers, different optimizers (AdamW, Muon), DoRA, Qwen3-0.6B, raw-text training, gradient accumulation, batch size changes, LR schedules, curriculum learning, data mixing, checkpoint selection, and various other tweaks. None beat the configuration above.

What actually mattered

  1. Providing legal moves in the prompt (0% -> 8.47%) - The model can't play chess without seeing what moves are available
  2. SAN notation over UCI (11.86% -> 13.56%) - Shorter, more natural chess language
  3. Data curation by rating (13.56% -> 18.64%) - Training on easier puzzles (1000-1900) generalized better to the full test range (1000-2400)
  4. Learning rate tuning (18.64% -> 22.03%) - A small bump from 5e-5 to 6e-5

What didn't matter

  • Model size (Qwen3-0.6B was worse)
  • Optimizer choice (Adam, AdamW, Muon all similar)
  • LoRA method (DoRA no better than LoRA)
  • More training steps (overfitting after 256-512)
  • More training data (smaller curated subsets won)
  • Fancy LR schedules (constant rate was best)

Limitations

  • The model doesn't know chess. Without legal moves in the prompt, it outputs garbage. It's a format follower, not a chess engine
  • The 22% accuracy is on 59 examples; on 100 fresh puzzles it drops to 12%
  • Only predicts one move per position (not useful for full games)
  • Trained and evaluated on Lichess puzzles only
  • The legal move list in the prompt does the heavy lifting for the 96% legal-move rate

Session data

The full autonomous experiment session logs are available at victor/autoresearch-chess-sessions.

License

Apache 2.0, same as the base model.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for victor/qwen2.5-0.5B-chess-lora

Adapter
(2)
this model

Dataset used to train victor/qwen2.5-0.5B-chess-lora