Quick Start Guide
Installation
Install xpfcorpus via pip:
pip install xpfcorpus
For YAML support (optional):
pip install xpfcorpus[yaml]
Basic Usage
Simple Transcription
from xpfcorpus import Transcriber
# Create a transcriber for Spanish
es = Transcriber("es")
# Transcribe a word
result = es.transcribe("ejemplo")
print(result) # ['e', 'x', 'e', 'm', 'p', 'l', 'o']
Multi-Script Languages
Some languages have multiple scripts and require explicit script selection:
from xpfcorpus import Transcriber
# Tatar has both Latin and Cyrillic scripts
tt_latin = Transcriber("tt", "latin")
tt_cyrillic = Transcriber("tt", "cyrillic")
# Without explicit script, this raises ScriptRequiredError
# tt = Transcriber("tt") # Error!
BCP-47 Language Codes
The package supports BCP-47 style language codes with script and region components:
from xpfcorpus import Transcriber
# Region codes are stripped
es_es = Transcriber("es-ES") # Same as Transcriber("es")
# Script codes are extracted (ISO 15924 format)
yi = Transcriber("yi-Latn") # Uses Latin script
# Lowercase script names also work
tt = Transcriber("tt-cyrillic") # Uses Cyrillic script
# Complex codes: script extracted, region stripped
zh = Transcriber("zh-Hans-CN") # Uses Hans script
# Explicit script parameter overrides the code
yi = Transcriber("yi-Latn", script="hebrew") # Uses Hebrew, not Latin
Verification
By default, rules are verified on load. You can skip verification:
from xpfcorpus import Transcriber
# Skip verification for faster loading
es = Transcriber("es", verify=False)
# Manually verify later
passed, errors = es.verify()
if not passed:
for error in errors:
print(error)
Available Languages
List all available languages:
from xpfcorpus import available_languages
langs = available_languages()
print(langs["es"])
# {"scripts": ["latin"], "default": "latin"}
print(langs["tt"])
# {"scripts": ["latin", "cyrillic"], "default": None}
External Data Sources
Load from external YAML or legacy files:
from xpfcorpus import Transcriber
# From YAML file (requires PyYAML)
custom = Transcriber("custom", yaml_file="my_lang.yaml")
# From legacy .rules/.verify files
legacy = Transcriber("test",
rules_file="es.rules",
verify_file="es.verify")
Command-Line Interface
Transcribe Words
# Transcribe words from command line
xpfcorpus transcribe es ejemplo hola mundo
# From a file (extracts first word from each line)
xpfcorpus transcribe es -f words.txt
# From stdin
echo -e "mundo\nbueno" | xpfcorpus transcribe es
cat words.txt | xpfcorpus transcribe es -f -
# Combine command-line words and file
xpfcorpus transcribe es ejemplo hola -f more_words.txt
# JSON output
xpfcorpus transcribe es ejemplo --json
List Languages
# List all available languages
xpfcorpus list
# JSON format
xpfcorpus list --json
Export Language Data
# Export language rules as YAML
xpfcorpus export es
# Save to file
xpfcorpus export es -o spanish.yaml
Verify Language Rules
# Verify a single language
xpfcorpus verify es
# Verify with details
xpfcorpus verify es -v
# Verify all languages
xpfcorpus verify --all
Supported Languages
The package includes 201 languages with 203 language/script combinations.
Languages with multiple scripts:
iu (Inuktitut): latin, syllabics
tt (Tatar): latin, cyrillic
Use xpfcorpus list or available_languages() for the full list.
Citation
If you use this package in your research, please cite the XPF Corpus:
@misc{xpf_corpus,
title={The Cross-linguistic Phonological Frequencies (XPF) Corpus},
author={Cohen Priva, Uriel and Gleason, Emily},
year={2022},
url={https://cohenpr-xpf.github.io/XPF/}
}