Issue #32 cleanup — drop the last 5 mis-oriented vocab pairs

Two small fixes after the LLM-vision pass: 1. merge_pdf_into_book.py — when the LLM classifies an image as 'hybrid' but extracts zero pairs (e.g., a conjugation table whose only English text is on the section header that was excluded by the prompt rules), respect that decision instead of falling through to the bbox/heuristic pipeline. Previously: 1 chapter-2 estar conjugation table generated 4 bad pairs from the heuristic fallback. 2. fix_vocab.py language_score — recognize Spanish present-perfect ('he tenido', 'He andado por este pueblo') as Spanish. The classifier was treating the auxiliary 'he'/'has'/'ha' as English subject pronouns, producing false-positive mis-orientation flags on 4 chapter-15/20/23 present-perfect example tables. Result: mis-oriented vocab pairs across the book go from 5 → 0. textbookDataVersion bumped to 14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:52:53 -05:00
parent f368c24ad6
commit 05a0cc0d17
5 changed files with 21 additions and 223 deletions
@@ -46,6 +46,9 @@ SPANISH_ARTICLES = {"el", "la", "los", "las", "un", "una", "unos", "unas"}
 ENGLISH_STARTERS = {"the", "a", "an", "to", "my", "his", "her", "our", "their"}


+HABER_FORMS = {"he", "has", "ha", "hemos", "habéis", "han"}
+
+
 def language_score(s: str) -> "tuple[int, int]":
    """Return (es_score, en_score) for a string."""
    es = 0
@@ -56,9 +59,17 @@ def language_score(s: str) -> "tuple[int, int]":
    if not words:
        return (es, en)
    first = words[0].strip(",.;:")
-    if first in SPANISH_ARTICLES:
+    second = words[1].strip(",.;:") if len(words) > 1 else ""
+    # Spanish present-perfect ("he tenido", "Ha andado") starts with a haber
+    # form followed by an -ado/-ido past participle. Recognise this pattern
+    # before the bare-pronoun check so "he" isn't mistaken for English "he".
+    if first in HABER_FORMS and (
+        second.endswith(("ado", "ido", "to", "cho", "sto", "esto"))
+    ):
+        es += 3
+    elif first in SPANISH_ARTICLES:
        es += 2
-    if first in ENGLISH_STARTERS:
+    elif first in ENGLISH_STARTERS:
        en += 2
    # Spanish-likely endings on later words
    for w in words:
@@ -307,10 +307,13 @@ def main() -> None:

                # Choose pair source. For reference_only (Spanish-only tables)
                # we deliberately produce no cards — the UI will fall back to
-                # rendering the flat OCR lines as a reference list.
-                if llm_kind == "reference_only":
+                # rendering the flat OCR lines as a reference list. Same for
+                # hybrid images where the LLM determined no genuine pair rows
+                # exist (e.g. estar conjugations with English glosses on the
+                # header row only).
+                if llm_kind == "reference_only" or (llm_kind == "hybrid" and not llm_pairs):
                    cards_for_block = []
-                    pair_source = "llm-reference"
+                    pair_source = "llm-no-pairs"
                elif llm_pairs:
                    cards_for_block = [
                        {"front": p["es"], "back": p["en"]}