Books — capture <li> vocab bullets the extractor was silently dropping

extract_epub.py was walking <p> only, but every "Vocabulario" section in
the Olly Richards EPUB lives inside <ul><li>...</li></ul>. That meant
the heading made it through but the entries didn't — 680 vocab lines
across 24 sections in this book were missing from the bundled JSON.

Audit (text-node owner by closest block ancestor) confirmed <li> is the
only silent drop: 5,260 nodes in <p>, 1,960 in <li>, 0 anywhere else.
No <h1>-<h6>, tables, or blockquotes in this EPUB at all.

Fix: walk find_all(["p", "li"]) in document order so bullet entries
slot in right after their "Vocabulario" / list heading. Re-extracted
(2,646 → 3,326 paragraphs), re-translated all 118 jobs in parallel
Claude Code subagents. translate_chapters.py prompt template now tells
subagents to keep bilingual `palabra = meaning` lines verbatim — both
sides already coexist on the line.

Bumped bookDataVersion to 2 so refreshBooksDataIfNeeded re-seeds.
Verified in simulator: all 13 chapter row sizes grew (e.g. ch6
18,295→20,951 chars).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey T
2026-05-11 10:10:34 -05:00
parent 09e49bda2c
commit 05a367fdbe
4 changed files with 2436 additions and 1073 deletions
+5 -6
View File
@@ -141,14 +141,13 @@ def _extract_paragraphs(zf: zipfile.ZipFile, zip_path: str) -> list[str]:
return []
soup = BeautifulSoup(html, "lxml")
paragraphs: list[str] = []
for p in soup.find_all("p"):
# Drop nav-anchor wrappers that contain no real text.
text = _normalise(p.get_text(" ", strip=True))
# Walk <p> and <li> in document order so vocab bullets (rendered as
# <ul><li>...</li></ul> in this EPUB family) are kept alongside narrative
# paragraphs. `<li>` rolls up its inline <b>/<span> children via get_text.
for el in soup.find_all(["p", "li"]):
text = _normalise(el.get_text(" ", strip=True))
if not text:
continue
# Drop chapter-heading paragraphs that only echo the title — handled
# separately by the TOC. Heuristic: very short paragraph that's just
# numbers + the chapter title pattern. Keep everything else.
paragraphs.append(text)
return paragraphs
@@ -57,6 +57,10 @@ Notes for translation quality:
ONLY if it reads more naturally; otherwise keep them as em-dashes.
- Do NOT add explanatory parentheticals; the in-app dictionary handles
per-word lookup.
- Some paragraphs are vocabulary entries shaped like `palabra = meaning`
(e.g. `alto = tall`, `el dueño = owner`). Keep these verbatim — both the
Spanish word and its English gloss already coexist on the line, and the
bilingual reader UI shows the same line in both views.
Write the output as JSON with shape:
{{"jobId": "<the jobId from the input>", "paragraphsEN": [...]}}