Books — capture <li> vocab bullets the extractor was silently dropping

extract_epub.py was walking <p> only, but every "Vocabulario" section in the Olly Richards EPUB lives inside <ul><li>...</li></ul>. That meant the heading made it through but the entries didn't — 680 vocab lines across 24 sections in this book were missing from the bundled JSON. Audit (text-node owner by closest block ancestor) confirmed <li> is the only silent drop: 5,260 nodes in <p>, 1,960 in <li>, 0 anywhere else. No <h1>-<h6>, tables, or blockquotes in this EPUB at all. Fix: walk find_all(["p", "li"]) in document order so bullet entries slot in right after their "Vocabulario" / list heading. Re-extracted (2,646 → 3,326 paragraphs), re-translated all 118 jobs in parallel Claude Code subagents. translate_chapters.py prompt template now tells subagents to keep bilingual `palabra = meaning` lines verbatim — both sides already coexist on the line. Bumped bookDataVersion to 2 so refreshBooksDataIfNeeded re-seeds. Verified in simulator: all 13 chapter row sizes grew (e.g. ch6 18,295→20,951 chars). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 10:10:34 -05:00
parent 09e49bda2c
commit 05a367fdbe
4 changed files with 2436 additions and 1073 deletions
@@ -9,7 +9,7 @@ actor DataLoader {
    static let textbookDataVersion = 14
    static let textbookDataKey = "textbookDataVersion"
-    static let bookDataVersion = 1
+    static let bookDataVersion = 2  // bump: vocab <li> bullets now extracted
    static let bookDataKey = "bookDataVersion"
    /// Quick check: does the DB need seeding or course data refresh?
@@ -141,14 +141,13 @@ def _extract_paragraphs(zf: zipfile.ZipFile, zip_path: str) -> list[str]:
        return []
    soup = BeautifulSoup(html, "lxml")
    paragraphs: list[str] = []
-    for p in soup.find_all("p"):
+    # Walk <p> and <li> in document order so vocab bullets (rendered as
-        # Drop nav-anchor wrappers that contain no real text.
+    # <ul><li>...</li></ul> in this EPUB family) are kept alongside narrative
-        text = _normalise(p.get_text(" ", strip=True))
+    # paragraphs. `<li>` rolls up its inline <b>/<span> children via get_text.
    for el in soup.find_all(["p", "li"]):
        text = _normalise(el.get_text(" ", strip=True))
        if not text:
            continue
        # Drop chapter-heading paragraphs that only echo the title — handled
        # separately by the TOC. Heuristic: very short paragraph that's just
        # numbers + the chapter title pattern. Keep everything else.
        paragraphs.append(text)
    return paragraphs
@@ -57,6 +57,10 @@ Notes for translation quality:
  ONLY if it reads more naturally; otherwise keep them as em-dashes.
 - Do NOT add explanatory parentheticals; the in-app dictionary handles
  per-word lookup.
 - Some paragraphs are vocabulary entries shaped like `palabra = meaning`
  (e.g. `alto = tall`, `el dueño = owner`). Keep these verbatim — both the
  Spanish word and its English gloss already coexist on the line, and the
  bilingual reader UI shows the same line in both views.
 Write the output as JSON with shape:
    {{"jobId": "<the jobId from the input>", "paragraphsEN": [...]}}