Issue #32 cleanup — drop the last 5 mis-oriented vocab pairs

Two small fixes after the LLM-vision pass:

1. merge_pdf_into_book.py — when the LLM classifies an image as 'hybrid'
   but extracts zero pairs (e.g., a conjugation table whose only English
   text is on the section header that was excluded by the prompt rules),
   respect that decision instead of falling through to the bbox/heuristic
   pipeline. Previously: 1 chapter-2 estar conjugation table generated
   4 bad pairs from the heuristic fallback.

2. fix_vocab.py language_score — recognize Spanish present-perfect
   ('he tenido', 'He andado por este pueblo') as Spanish. The classifier
   was treating the auxiliary 'he'/'has'/'ha' as English subject pronouns,
   producing false-positive mis-orientation flags on 4 chapter-15/20/23
   present-perfect example tables.

Result: mis-oriented vocab pairs across the book go from 5 → 0.
textbookDataVersion bumped to 14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey T
2026-05-03 18:52:53 -05:00
parent f368c24ad6
commit 05a0cc0d17
5 changed files with 21 additions and 223 deletions
@@ -307,10 +307,13 @@ def main() -> None:
# Choose pair source. For reference_only (Spanish-only tables)
# we deliberately produce no cards — the UI will fall back to
# rendering the flat OCR lines as a reference list.
if llm_kind == "reference_only":
# rendering the flat OCR lines as a reference list. Same for
# hybrid images where the LLM determined no genuine pair rows
# exist (e.g. estar conjugations with English glosses on the
# header row only).
if llm_kind == "reference_only" or (llm_kind == "hybrid" and not llm_pairs):
cards_for_block = []
pair_source = "llm-reference"
pair_source = "llm-no-pairs"
elif llm_pairs:
cards_for_block = [
{"front": p["es"], "back": p["en"]}