7da98d786c
Add SRS-driven noun and adjective flashcards modeled on the existing verb flashcard flow: - SharedModels/Lexeme — catalog of non-verb vocab, frequency-ranked, with gender for nouns and optional example sentences. Seeded from a bundled vocab_lexemes.json built by Scripts/vocab/build_lexemes.py, which joins frequency.csv + es-en.data from a pinned doozan/spanish_data commit (CC-BY-SA: hermitdave/FrequencyWords + Wiktionary). 1,449 nouns and 600 adjectives, each with Wiktionary-sourced gender and (where available) an example sentence with English translation. - LexemeReviewCard + LexemeReviewStore — cloud-synced SM-2 SRS, keyed by partOfSpeech + lexemeId + drillMode so future drill modes can coexist. - LexemeSessionQueue + LexemePool — parallel to VocabSessionQueue; fresh cards sort by frequency rank. - LexemeStudyGroup — cloud-synced resumable session per (partOfSpeech, drillMode). - NounFlashcardPracticeView + AdjectiveFlashcardPracticeView — same flow as VocabFlashcardPracticeView: English prompt → tap to reveal Spanish → Again/Hard/Good/Easy. Nouns reveal with their article (la taza, el problema) so gender is taught alongside meaning, not as a separate quiz. Example sentence shown when present. CEFR-style level toggles: - LexemeLevel enum (A1/A2/B1/B2/C1+) derived from frequencyRank with standard Spanish-frequency-dictionary cutoffs (250/500/1000/2000). - UserProgress.selectedLexemeLevels — cloud-synced multi-select, defaults to A1+A2 on first launch. - SettingsView gains a "Vocabulary Levels" section with five toggles; the existing "Levels" section is renamed "Verb Levels" for clarity. - Due SRS cards always surface regardless of toggles. Disabling a level only stops new cards from that band entering the pool. PracticeView gets "Nouns" and "Adjectives" rows under "Books". DataLoader: new lexemeDataVersion gate that re-seeds the Lexeme table from vocab_lexemes.json independent of book seeding. project.yml lists the new JSON resource and the existing book_olly-vol2.json (which the previous build was silently excluding because xcodegen rewrote the project from project.yml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
63 lines
1.8 KiB
Markdown
63 lines
1.8 KiB
Markdown
# Vocab catalog build
|
|
|
|
`build_lexemes.py` produces `Conjuga/vocab_lexemes.json`, the bundled catalog
|
|
of frequency-ranked Spanish nouns and adjectives that powers the Noun /
|
|
Adjective flashcard study modes.
|
|
|
|
## Run
|
|
|
|
```sh
|
|
python3 build_lexemes.py
|
|
```
|
|
|
|
Downloads `frequency.csv` + `es-en.data` from a pinned commit of
|
|
[`doozan/spanish_data`](https://github.com/doozan/spanish_data), caches them
|
|
under `.cache/<commit>/`, joins them, and writes the JSON. Re-running is
|
|
fast — only the join step happens after the first download.
|
|
|
|
Override defaults:
|
|
|
|
```sh
|
|
python3 build_lexemes.py --max-nouns 3000 --max-adjectives 1000
|
|
python3 build_lexemes.py --output /tmp/vocab.json
|
|
```
|
|
|
|
## Data sources & attribution
|
|
|
|
All datasets are CC-licensed; the bundled catalog inherits CC-BY-SA. Credit
|
|
in the app's About screen must read:
|
|
|
|
> Vocabulary data: Wiktionary (CC-BY-SA), OpenSubtitles via FrequencyWords
|
|
> (CC-BY-SA 3.0).
|
|
|
|
- **`frequency.csv`** — derived from
|
|
[hermitdave/FrequencyWords](https://github.com/hermitdave/FrequencyWords)
|
|
(OpenSubtitles corpus), packaged by doozan. License: CC-BY-SA 3.0.
|
|
- **`es-en.data`** — Spanish→English Wiktionary export in the
|
|
[`enwiktionary_wordlist`](https://github.com/doozan/enwiktionary_wordlist)
|
|
format. License: CC-BY-SA.
|
|
|
|
The pinned doozan commit is at the top of `build_lexemes.py`
|
|
(`DOOZAN_COMMIT`). Bump it to refresh; the cache key includes the commit so
|
|
old data is auto-replaced.
|
|
|
|
## Output shape
|
|
|
|
```json
|
|
[
|
|
{
|
|
"baseForm": "casa",
|
|
"english": "house",
|
|
"partOfSpeech": "noun",
|
|
"gender": "f",
|
|
"frequencyRank": 142,
|
|
"exampleES": "La casa es grande",
|
|
"exampleEN": "The house is big"
|
|
},
|
|
...
|
|
]
|
|
```
|
|
|
|
Sorted by `frequencyRank` ascending so the fresh-card path in `LexemePool`
|
|
surfaces the most useful words first.
|