Qwen2.5-0.5B Chess LoRA

A LoRA adapter that teaches Qwen2.5-0.5B-Instruct-4bit to solve chess puzzles. Trained in ~1 minute on a Mac using MLX.

Found autonomously by an AI agent running 67 experiments in 2 hours.

Does it actually work?

We validated on 100 fresh puzzles (zero overlap with training data) across all 4 combinations:

	No legal moves in prompt	With legal moves in prompt
Base model	0% exact, 0% legal, 100% invalid	9% exact, 49% legal, 51% invalid
+ LoRA	0% exact, 4% legal, 96% invalid	12% exact, 96% legal, 4% invalid

What the LoRA actually learned:

Format following — reliably picks one move from the provided legal move list (4% invalid vs 51% for base)
Slightly better move selection — 12% vs 9% exact accuracy when both models see the legal moves
No chess knowledge — without the legal move list in the prompt, it's just as lost as the base model (0% exact)

The 22% accuracy on the original 59-puzzle test set was inflated by the tiny sample size. On 100 fresh puzzles, the real improvement is modest: 9% to 12% exact.

Validation by rating band (100 fresh puzzles, with legal moves)

Rating	Correct	Total	Accuracy
< 1400	9	45	20.0%
1400-1700	0	19	0.0%
1700-2000	3	16	18.8%
2000+	0	13	0.0%

How it was made

An AI agent (pi) ran 67 autonomous experiments in a single session, trying different hyperparameters, data curation strategies, model variants, and training techniques. It kept what improved the metric and discarded the rest. The whole process took about 2 hours on an Apple Silicon Mac.

0% -----> 8.47% -----> 11.86% -----> 13.56% -----> 15.25% -----> 16.95% -----> 18.64% -----> 22.03%
baseline   +legal       +steps        +SAN          +rating       +fewer        +band          +LR
           moves        512           notation      curation      steps         1000-1900      6e-5

(Accuracy numbers above are on the original 59-puzzle test set used during the search. See validation section above for held-out results.)

Usage

pip install mlx-lm

# Generate a move
python -m mlx_lm.generate \
  --model mlx-community/Qwen2.5-0.5B-Instruct-4bit \
  --adapter-path victor/qwen2.5-0.5B-chess-lora \
  --max-tokens 8 \
  --prompt "You are a careful chess tactician. Choose the strongest move for the side to move.
Return only one move from the legal move list in SAN notation like Nf3 or Qh5+.
Side to move: White
FEN: r1bqkb1r/pppppppp/2n2n2/8/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 2 3
Legal moves: Ba6 Bb5 Bc4 Bd3 Be2 Na3 Nc3 Qe2 Ke2 a3 a4 b3 b4 c3 c4 d3 d4 e5 g3 g4 h3 h4 Rg1
Best move:"

Training config

Parameter	Value
Base model	`mlx-community/Qwen2.5-0.5B-Instruct-4bit`
Method	LoRA (rank 8, scale 20.0)
Layers	Top 8
Steps	256
Batch size	2
Learning rate	6e-5
Max seq length	384
Optimizer	Adam
Prompt masking	Yes
Train examples	384 (Lichess puzzles, rating 1000-1900)
Valid examples	64 (Lichess puzzles, rating 1000-1900)
Test examples	59 (Lichess puzzles, rating 1000-2400)
Training time	~50 seconds on Apple Silicon

Experiment log

67 experiments were run autonomously. Here are the key milestones:

#	What changed	Accuracy (59-puzzle test)	Kept?
1	Baseline: FEN-only prompt	0%	yes
4	Added legal move lists + longer sequences	8.47%	yes
5	Doubled training to 512 steps	11.86%	yes
9	Switched from UCI to SAN notation	13.56%	yes
31	Curated training data to rating 1000-2000	15.25%	yes
34	Narrowed to 1000-1800, reduced to 256 steps	16.95%	yes
37	Refined band to 1000-1900	18.64%	yes
47	Bumped learning rate from 5e-5 to 6e-5	22.03%	yes

The remaining 59 experiments tried: ASCII board diagrams, more/fewer LoRA layers, different optimizers (AdamW, Muon), DoRA, Qwen3-0.6B, raw-text training, gradient accumulation, batch size changes, LR schedules, curriculum learning, data mixing, checkpoint selection, and various other tweaks. None beat the configuration above.

What actually mattered

Providing legal moves in the prompt (0% -> 8.47%) - The model can't play chess without seeing what moves are available
SAN notation over UCI (11.86% -> 13.56%) - Shorter, more natural chess language
Data curation by rating (13.56% -> 18.64%) - Training on easier puzzles (1000-1900) generalized better to the full test range (1000-2400)
Learning rate tuning (18.64% -> 22.03%) - A small bump from 5e-5 to 6e-5

What didn't matter

Model size (Qwen3-0.6B was worse)
Optimizer choice (Adam, AdamW, Muon all similar)
LoRA method (DoRA no better than LoRA)
More training steps (overfitting after 256-512)
More training data (smaller curated subsets won)
Fancy LR schedules (constant rate was best)

Limitations

The model doesn't know chess. Without legal moves in the prompt, it outputs garbage. It's a format follower, not a chess engine
The 22% accuracy is on 59 examples; on 100 fresh puzzles it drops to 12%
Only predicts one move per position (not useful for full games)
Trained and evaluated on Lichess puzzles only
The legal move list in the prompt does the heavy lifting for the 96% legal-move rate

Session data

The full autonomous experiment session logs are available at victor/autoresearch-chess-sessions.

License

Apache 2.0, same as the base model.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Model tree for victor/qwen2.5-0.5B-chess-lora

Base model

Qwen/Qwen2.5-0.5B

Finetuned

mlx-community/Qwen2.5-0.5B-Instruct-4bit

Adapter

(2)

this model

victor
/

qwen2.5-0.5B-chess-lora