Spanish

admin/Spanish

Fork 0

Commit Graph

Author	SHA1	Message	Date
Trey T	f368c24ad6	Fixes #32 — LLM vision pass for vocab pairs, fixes scrambled English/Spanish The bbox-OCR pipeline mis-paired ~114 vocab tables across the book — the chapter 7 "Other Idioms" image (issue #32) being the most visible. Three failure modes were collapsing the data: 1) classifier blind to subject pronouns ("yo", "I", etc.) 2) right-then-left OCR reads on 2-col tables 3) Y-cluster drift on multi-line cells in 4-col layouts Replaced the entire vocab-extraction tier with a Claude vision pass over all 931 vocab images. Output is keyed by image with three classifications: - pair_table (extract all Spanish↔English pairs) - reference_only (Spanish-only conjugation tables — no pairs, UI shows the flat OCR lines as a reference list instead) - hybrid (some header pairs + reference content beneath; only the genuine pairs become cards) merge_pdf_into_book.py now picks pair source by priority: llm-vision → bounding-box OCR → block-alternation heuristic. Numbers (across the whole book): - mis-oriented tables: 114 → 5 - quarantined cards: 250 → 2 - extracted pairs: 2832 → 4569 textbookDataVersion bumped to 13. Per-batch agent outputs gitignored under Conjuga/Scripts/textbook/paired_vocab_llm/ — only the merged paired_vocab_llm.json (also gitignored) is needed to rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 18:48:04 -05:00
Trey T	5f90a01314	Render textbook vocab as paired Spanish→English grid Previously the chapter reader showed vocab tables as a flat list of OCR lines — because Vision reads columns top-to-bottom, the Spanish column appeared as one block followed by the English column, making pairings illegible. Now every vocab table renders as a 2-column grid with Spanish on the left and English on the right. Supporting changes: - New ocr_all_vocab.swift: bounding-box OCR over all 931 vocab images, cluster lines into rows by Y-coordinate, split rows by largest X-gap, detect 2- / 3- / 4-column layouts automatically. ~2800 pairs extracted this pass vs ~1100 from the old block-alternation heuristic. - merge_pdf_into_book.py now prefers bounding-box pairs when present, falls back to the heuristic, embeds the resulting pairs as vocab_table.cards in book.json. - DataLoader passes cards through to TextbookBlock on seed. - TextbookChapterView renders cards via SwiftUI Grid (2 cols). - fix_vocab.py quarantine rule relaxed — only mis-pairs where both sides are clearly the same language are removed. "unknown" sides stay (bbox pipeline already oriented them correctly). Textbook card count jumps from 1044 → 3118 active pairs. textbookDataVersion bumped to 9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:58:41 -05:00
Trey T	63dfc5e41a	Add textbook reader, exercise grading, stem-change toggle, extraction pipeline Major changes: - Textbook UI: chapter list, reader, and interactive exercise view (keyboard + Apple Pencil) surfaced under the Course tab. 30 chapters, 251 exercises. - Stem-change conjugation toggle on Week 4 flashcard decks (E-IE, E-I, O-UE). Uses existing VerbForm + IrregularSpan data to render highlighted present tense conjugations inline. - Deterministic on-device answer grader with partial credit (correct / close for accent-stripped or single-char-typo / wrong). 11 unit tests cover it. - SharedModels: TextbookChapter (local), TextbookExerciseAttempt (cloud- synced), AnswerGrader helpers. Bumped schema. - DataLoader: textbook seeder (version 8) + refresh helpers that preserve LanGo course decks when textbook data is re-seeded. - Local extraction pipeline in Conjuga/Scripts/textbook/ — XHTML chapter parser, answer-key parser, macOS Vision image OCR + PDF page OCR, merger, NSSpellChecker validator, language-aware auto-fixer, and repair pass that re-pairs quarantined vocab rows using bounding-box coordinates. - UI test target (ConjugaUITests) with three tests: end-to-end textbook flow, all-chapters screenshot audit, and stem-change toggle verification. Generated textbook content (textbook_data.json, textbook_vocab.json) and third-party source files are gitignored — re-run Scripts/textbook/run_pipeline.sh locally to regenerate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:12:55 -05:00

Author

SHA1

Message

Date

Trey T

f368c24ad6

Fixes #32 — LLM vision pass for vocab pairs, fixes scrambled English/Spanish

The bbox-OCR pipeline mis-paired ~114 vocab tables across the book — the
chapter 7 "Other Idioms" image (issue #32) being the most visible.
Three failure modes were collapsing the data:
  1) classifier blind to subject pronouns ("yo", "I", etc.)
  2) right-then-left OCR reads on 2-col tables
  3) Y-cluster drift on multi-line cells in 4-col layouts

Replaced the entire vocab-extraction tier with a Claude vision pass over
all 931 vocab images. Output is keyed by image with three classifications:
  - pair_table       (extract all Spanish↔English pairs)
  - reference_only   (Spanish-only conjugation tables — no pairs, UI shows
                      the flat OCR lines as a reference list instead)
  - hybrid           (some header pairs + reference content beneath; only
                      the genuine pairs become cards)

merge_pdf_into_book.py now picks pair source by priority:
  llm-vision → bounding-box OCR → block-alternation heuristic.

Numbers (across the whole book):
  - mis-oriented tables: 114 → 5
  - quarantined cards:   250 → 2
  - extracted pairs:     2832 → 4569

textbookDataVersion bumped to 13. Per-batch agent outputs gitignored
under Conjuga/Scripts/textbook/paired_vocab_llm/ — only the merged
paired_vocab_llm.json (also gitignored) is needed to rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 18:48:04 -05:00

Trey T

5f90a01314

Render textbook vocab as paired Spanish→English grid

Previously the chapter reader showed vocab tables as a flat list of OCR
lines — because Vision reads columns top-to-bottom, the Spanish column
appeared as one block followed by the English column, making pairings
illegible.

Now every vocab table renders as a 2-column grid with Spanish on the
left and English on the right. Supporting changes:

- New ocr_all_vocab.swift: bounding-box OCR over all 931 vocab images,
  cluster lines into rows by Y-coordinate, split rows by largest X-gap,
  detect 2- / 3- / 4-column layouts automatically. ~2800 pairs extracted
  this pass vs ~1100 from the old block-alternation heuristic.
- merge_pdf_into_book.py now prefers bounding-box pairs when present,
  falls back to the heuristic, embeds the resulting pairs as
  vocab_table.cards in book.json.
- DataLoader passes cards through to TextbookBlock on seed.
- TextbookChapterView renders cards via SwiftUI Grid (2 cols).
- fix_vocab.py quarantine rule relaxed — only mis-pairs where both
  sides are clearly the same language are removed. "unknown" sides
  stay (bbox pipeline already oriented them correctly).

Textbook card count jumps from 1044 → 3118 active pairs.
textbookDataVersion bumped to 9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 15:58:41 -05:00

Trey T

63dfc5e41a

Add textbook reader, exercise grading, stem-change toggle, extraction pipeline

Major changes:
- Textbook UI: chapter list, reader, and interactive exercise view (keyboard
  + Apple Pencil) surfaced under the Course tab. 30 chapters, 251 exercises.
- Stem-change conjugation toggle on Week 4 flashcard decks (E-IE, E-I, O-UE).
  Uses existing VerbForm + IrregularSpan data to render highlighted present
  tense conjugations inline.
- Deterministic on-device answer grader with partial credit (correct / close
  for accent-stripped or single-char-typo / wrong). 11 unit tests cover it.
- SharedModels: TextbookChapter (local), TextbookExerciseAttempt (cloud-
  synced), AnswerGrader helpers. Bumped schema.
- DataLoader: textbook seeder (version 8) + refresh helpers that preserve
  LanGo course decks when textbook data is re-seeded.
- Local extraction pipeline in Conjuga/Scripts/textbook/ — XHTML chapter
  parser, answer-key parser, macOS Vision image OCR + PDF page OCR, merger,
  NSSpellChecker validator, language-aware auto-fixer, and repair pass that
  re-pairs quarantined vocab rows using bounding-box coordinates.
- UI test target (ConjugaUITests) with three tests: end-to-end textbook
  flow, all-chapters screenshot audit, and stem-change toggle verification.

Generated textbook content (textbook_data.json, textbook_vocab.json) and
third-party source files are gitignored — re-run Scripts/textbook/run_pipeline.sh
locally to regenerate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 15:12:55 -05:00

3 Commits