Issue #32 cleanup — drop the last 5 mis-oriented vocab pairs
Two small fixes after the LLM-vision pass:
1. merge_pdf_into_book.py — when the LLM classifies an image as 'hybrid'
but extracts zero pairs (e.g., a conjugation table whose only English
text is on the section header that was excluded by the prompt rules),
respect that decision instead of falling through to the bbox/heuristic
pipeline. Previously: 1 chapter-2 estar conjugation table generated
4 bad pairs from the heuristic fallback.
2. fix_vocab.py language_score — recognize Spanish present-perfect
('he tenido', 'He andado por este pueblo') as Spanish. The classifier
was treating the auxiliary 'he'/'has'/'ha' as English subject pronouns,
producing false-positive mis-orientation flags on 4 chapter-15/20/23
present-perfect example tables.
Result: mis-oriented vocab pairs across the book go from 5 → 0.
textbookDataVersion bumped to 14.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -46,6 +46,9 @@ SPANISH_ARTICLES = {"el", "la", "los", "las", "un", "una", "unos", "unas"}
|
||||
ENGLISH_STARTERS = {"the", "a", "an", "to", "my", "his", "her", "our", "their"}
|
||||
|
||||
|
||||
HABER_FORMS = {"he", "has", "ha", "hemos", "habéis", "han"}
|
||||
|
||||
|
||||
def language_score(s: str) -> "tuple[int, int]":
|
||||
"""Return (es_score, en_score) for a string."""
|
||||
es = 0
|
||||
@@ -56,9 +59,17 @@ def language_score(s: str) -> "tuple[int, int]":
|
||||
if not words:
|
||||
return (es, en)
|
||||
first = words[0].strip(",.;:")
|
||||
if first in SPANISH_ARTICLES:
|
||||
second = words[1].strip(",.;:") if len(words) > 1 else ""
|
||||
# Spanish present-perfect ("he tenido", "Ha andado") starts with a haber
|
||||
# form followed by an -ado/-ido past participle. Recognise this pattern
|
||||
# before the bare-pronoun check so "he" isn't mistaken for English "he".
|
||||
if first in HABER_FORMS and (
|
||||
second.endswith(("ado", "ido", "to", "cho", "sto", "esto"))
|
||||
):
|
||||
es += 3
|
||||
elif first in SPANISH_ARTICLES:
|
||||
es += 2
|
||||
if first in ENGLISH_STARTERS:
|
||||
elif first in ENGLISH_STARTERS:
|
||||
en += 2
|
||||
# Spanish-likely endings on later words
|
||||
for w in words:
|
||||
|
||||
Reference in New Issue
Block a user