f368c24ad6
The bbox-OCR pipeline mis-paired ~114 vocab tables across the book — the chapter 7 "Other Idioms" image (issue #32) being the most visible. Three failure modes were collapsing the data: 1) classifier blind to subject pronouns ("yo", "I", etc.) 2) right-then-left OCR reads on 2-col tables 3) Y-cluster drift on multi-line cells in 4-col layouts Replaced the entire vocab-extraction tier with a Claude vision pass over all 931 vocab images. Output is keyed by image with three classifications: - pair_table (extract all Spanish↔English pairs) - reference_only (Spanish-only conjugation tables — no pairs, UI shows the flat OCR lines as a reference list instead) - hybrid (some header pairs + reference content beneath; only the genuine pairs become cards) merge_pdf_into_book.py now picks pair source by priority: llm-vision → bounding-box OCR → block-alternation heuristic. Numbers (across the whole book): - mis-oriented tables: 114 → 5 - quarantined cards: 250 → 2 - extracted pairs: 2832 → 4569 textbookDataVersion bumped to 13. Per-batch agent outputs gitignored under Conjuga/Scripts/textbook/paired_vocab_llm/ — only the merged paired_vocab_llm.json (also gitignored) is needed to rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
57 lines
1.1 KiB
Plaintext
57 lines
1.1 KiB
Plaintext
# Xcode
|
|
build/
|
|
DerivedData/
|
|
*.xcodeproj/xcuserdata/
|
|
*.xcworkspace/xcuserdata/
|
|
*.xcodeproj/project.xcworkspace/xcshareddata/IDEWorkspaceChecks.plist
|
|
xcuserdata/
|
|
|
|
# Swift Package Manager
|
|
.build/
|
|
.swiftpm/
|
|
Packages/
|
|
|
|
# macOS
|
|
.DS_Store
|
|
*.swp
|
|
*~
|
|
|
|
# CocoaPods
|
|
Pods/
|
|
|
|
# Secrets / env
|
|
.env
|
|
*.p12
|
|
*.mobileprovision
|
|
|
|
# Archives
|
|
*.xcarchive
|
|
|
|
# Claude
|
|
.claude/
|
|
|
|
# Reference/research docs (not part of the app)
|
|
screens/
|
|
conjugato/
|
|
conjuu-es/
|
|
|
|
# Video scraping pipeline (kept locally for reruns, not committed)
|
|
scrape/
|
|
*.webm
|
|
*.mp4
|
|
*.mkv
|
|
|
|
# Third-party textbook sources (not redistributable)
|
|
*.pdf
|
|
*.epub
|
|
epub_extract/
|
|
|
|
# Textbook extraction artifacts — regenerate locally via run_pipeline.sh.
|
|
# Scripts are committed; their generated outputs are not.
|
|
Conjuga/Scripts/textbook/*.json
|
|
Conjuga/Scripts/textbook/review.html
|
|
Conjuga/Scripts/textbook/paired_vocab_llm/
|
|
# Note: the app-bundle copies (Conjuga/Conjuga/textbook_{data,vocab}.json)
|
|
# ARE committed so `xcodebuild` works on a fresh clone without first running
|
|
# the pipeline. They're regenerated from the scripts when content changes.
|