Books — capture <li> vocab bullets the extractor was silently dropping

extract_epub.py was walking <p> only, but every "Vocabulario" section in the Olly Richards EPUB lives inside <ul><li>...</li></ul>. That meant the heading made it through but the entries didn't — 680 vocab lines across 24 sections in this book were missing from the bundled JSON. Audit (text-node owner by closest block ancestor) confirmed <li> is the only silent drop: 5,260 nodes in <p>, 1,960 in <li>, 0 anywhere else. No <h1>-<h6>, tables, or blockquotes in this EPUB at all. Fix: walk find_all(["p", "li"]) in document order so bullet entries slot in right after their "Vocabulario" / list heading. Re-extracted (2,646 → 3,326 paragraphs), re-translated all 118 jobs in parallel Claude Code subagents. translate_chapters.py prompt template now tells subagents to keep bilingual `palabra = meaning` lines verbatim — both sides already coexist on the line. Bumped bookDataVersion to 2 so refreshBooksDataIfNeeded re-seeds. Verified in simulator: all 13 chapter row sizes grew (e.g. ch6 18,295→20,951 chars). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 10:10:34 -05:00
parent 09e49bda2c
commit 05a367fdbe
4 changed files with 2436 additions and 1073 deletions
@@ -57,6 +57,10 @@ Notes for translation quality:
  ONLY if it reads more naturally; otherwise keep them as em-dashes.
 - Do NOT add explanatory parentheticals; the in-app dictionary handles
  per-word lookup.
+- Some paragraphs are vocabulary entries shaped like `palabra = meaning`
+  (e.g. `alto = tall`, `el dueño = owner`). Keep these verbatim — both the
+  Spanish word and its English gloss already coexist on the line, and the
+  bilingual reader UI shows the same line in both views.

 Write the output as JSON with shape:
    {{"jobId": "<the jobId from the input>", "paragraphsEN": [...]}}