Quick Start Guide

Installation

Install xpfcorpus via pip:

pip install xpfcorpus

For YAML support (optional):

pip install xpfcorpus[yaml]

Basic Usage

Simple Transcription

from xpfcorpus import Transcriber

# Create a transcriber for Spanish
es = Transcriber("es")

# Transcribe a word
result = es.transcribe("ejemplo")
print(result)  # ['e', 'x', 'e', 'm', 'p', 'l', 'o']

Multi-Script Languages

Some languages have multiple scripts and require explicit script selection:

from xpfcorpus import Transcriber

# Tatar has both Latin and Cyrillic scripts
tt_latin = Transcriber("tt", "latin")
tt_cyrillic = Transcriber("tt", "cyrillic")

# Without explicit script, this raises ScriptRequiredError
# tt = Transcriber("tt")  # Error!

BCP-47 Language Codes

The package supports BCP-47 style language codes with script and region components:

from xpfcorpus import Transcriber

# Region codes are stripped
es_es = Transcriber("es-ES")  # Same as Transcriber("es")

# Script codes are extracted (ISO 15924 format)
yi = Transcriber("yi-Latn")   # Uses Latin script

# Lowercase script names also work
tt = Transcriber("tt-cyrillic")  # Uses Cyrillic script

# Complex codes: script extracted, region stripped
zh = Transcriber("zh-Hans-CN")  # Uses Hans script

# Explicit script parameter overrides the code
yi = Transcriber("yi-Latn", script="hebrew")  # Uses Hebrew, not Latin

Verification

By default, rules are verified on load. You can skip verification:

from xpfcorpus import Transcriber

# Skip verification for faster loading
es = Transcriber("es", verify=False)

# Manually verify later
passed, errors = es.verify()
if not passed:
    for error in errors:
        print(error)

Available Languages

List all available languages:

from xpfcorpus import available_languages

langs = available_languages()
print(langs["es"])
# {"scripts": ["latin"], "default": "latin"}

print(langs["tt"])
# {"scripts": ["latin", "cyrillic"], "default": None}

External Data Sources

Load from external YAML or legacy files:

from xpfcorpus import Transcriber

# From YAML file (requires PyYAML)
custom = Transcriber("custom", yaml_file="my_lang.yaml")

# From legacy .rules/.verify files
legacy = Transcriber("test",
                    rules_file="es.rules",
                    verify_file="es.verify")

Command-Line Interface

Transcribe Words

# Transcribe words from command line
xpfcorpus transcribe es ejemplo hola mundo

# From a file (extracts first word from each line)
xpfcorpus transcribe es -f words.txt

# From stdin
echo -e "mundo\nbueno" | xpfcorpus transcribe es
cat words.txt | xpfcorpus transcribe es -f -

# Combine command-line words and file
xpfcorpus transcribe es ejemplo hola -f more_words.txt

# JSON output
xpfcorpus transcribe es ejemplo --json

List Languages

# List all available languages
xpfcorpus list

# JSON format
xpfcorpus list --json

Export Language Data

# Export language rules as YAML
xpfcorpus export es

# Save to file
xpfcorpus export es -o spanish.yaml

Verify Language Rules

# Verify a single language
xpfcorpus verify es

# Verify with details
xpfcorpus verify es -v

# Verify all languages
xpfcorpus verify --all

Supported Languages

The package includes 201 languages with 203 language/script combinations.

Languages with multiple scripts:

  • iu (Inuktitut): latin, syllabics

  • tt (Tatar): latin, cyrillic

Use xpfcorpus list or available_languages() for the full list.

Citation

If you use this package in your research, please cite the XPF Corpus:

@misc{xpf_corpus,
  title={The Cross-linguistic Phonological Frequencies (XPF) Corpus},
  author={Cohen Priva, Uriel and Gleason, Emily},
  year={2022},
  url={https://cohenpr-xpf.github.io/XPF/}
}