Books — capture <li> vocab bullets the extractor was silently dropping
extract_epub.py was walking <p> only, but every "Vocabulario" section in the Olly Richards EPUB lives inside <ul><li>...</li></ul>. That meant the heading made it through but the entries didn't — 680 vocab lines across 24 sections in this book were missing from the bundled JSON. Audit (text-node owner by closest block ancestor) confirmed <li> is the only silent drop: 5,260 nodes in <p>, 1,960 in <li>, 0 anywhere else. No <h1>-<h6>, tables, or blockquotes in this EPUB at all. Fix: walk find_all(["p", "li"]) in document order so bullet entries slot in right after their "Vocabulario" / list heading. Re-extracted (2,646 → 3,326 paragraphs), re-translated all 118 jobs in parallel Claude Code subagents. translate_chapters.py prompt template now tells subagents to keep bilingual `palabra = meaning` lines verbatim — both sides already coexist on the line. Bumped bookDataVersion to 2 so refreshBooksDataIfNeeded re-seeds. Verified in simulator: all 13 chapter row sizes grew (e.g. ch6 18,295→20,951 chars). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -9,7 +9,7 @@ actor DataLoader {
|
|||||||
static let textbookDataVersion = 14
|
static let textbookDataVersion = 14
|
||||||
static let textbookDataKey = "textbookDataVersion"
|
static let textbookDataKey = "textbookDataVersion"
|
||||||
|
|
||||||
static let bookDataVersion = 1
|
static let bookDataVersion = 2 // bump: vocab <li> bullets now extracted
|
||||||
static let bookDataKey = "bookDataVersion"
|
static let bookDataKey = "bookDataVersion"
|
||||||
|
|
||||||
/// Quick check: does the DB need seeding or course data refresh?
|
/// Quick check: does the DB need seeding or course data refresh?
|
||||||
|
|||||||
+2426
-1066
File diff suppressed because it is too large
Load Diff
@@ -141,14 +141,13 @@ def _extract_paragraphs(zf: zipfile.ZipFile, zip_path: str) -> list[str]:
|
|||||||
return []
|
return []
|
||||||
soup = BeautifulSoup(html, "lxml")
|
soup = BeautifulSoup(html, "lxml")
|
||||||
paragraphs: list[str] = []
|
paragraphs: list[str] = []
|
||||||
for p in soup.find_all("p"):
|
# Walk <p> and <li> in document order so vocab bullets (rendered as
|
||||||
# Drop nav-anchor wrappers that contain no real text.
|
# <ul><li>...</li></ul> in this EPUB family) are kept alongside narrative
|
||||||
text = _normalise(p.get_text(" ", strip=True))
|
# paragraphs. `<li>` rolls up its inline <b>/<span> children via get_text.
|
||||||
|
for el in soup.find_all(["p", "li"]):
|
||||||
|
text = _normalise(el.get_text(" ", strip=True))
|
||||||
if not text:
|
if not text:
|
||||||
continue
|
continue
|
||||||
# Drop chapter-heading paragraphs that only echo the title — handled
|
|
||||||
# separately by the TOC. Heuristic: very short paragraph that's just
|
|
||||||
# numbers + the chapter title pattern. Keep everything else.
|
|
||||||
paragraphs.append(text)
|
paragraphs.append(text)
|
||||||
return paragraphs
|
return paragraphs
|
||||||
|
|
||||||
|
|||||||
@@ -57,6 +57,10 @@ Notes for translation quality:
|
|||||||
ONLY if it reads more naturally; otherwise keep them as em-dashes.
|
ONLY if it reads more naturally; otherwise keep them as em-dashes.
|
||||||
- Do NOT add explanatory parentheticals; the in-app dictionary handles
|
- Do NOT add explanatory parentheticals; the in-app dictionary handles
|
||||||
per-word lookup.
|
per-word lookup.
|
||||||
|
- Some paragraphs are vocabulary entries shaped like `palabra = meaning`
|
||||||
|
(e.g. `alto = tall`, `el dueño = owner`). Keep these verbatim — both the
|
||||||
|
Spanish word and its English gloss already coexist on the line, and the
|
||||||
|
bilingual reader UI shows the same line in both views.
|
||||||
|
|
||||||
Write the output as JSON with shape:
|
Write the output as JSON with shape:
|
||||||
{{"jobId": "<the jobId from the input>", "paragraphsEN": [...]}}
|
{{"jobId": "<the jobId from the input>", "paragraphsEN": [...]}}
|
||||||
|
|||||||
Reference in New Issue
Block a user