API Reference

Main API

Transcriber

High-level grapheme-to-phoneme transcriber.

Supports multiple data sources: - Package repository (default): bundled JSON data - External YAML file: custom language definitions - Legacy format: .rules and .verify files

Examples

# Basic usage - language with default script >>> es = Transcriber(“es”) >>> es.transcribe(“ejemplo”) [‘e’, ‘x’, ‘e’, ‘m’, ‘p’, ‘l’, ‘o’]

# Explicit script >>> tt = Transcriber(“tt”, “cyrillic”)

# External YAML file >>> custom = Transcriber(“custom”, yaml_file=”my_lang.yaml”)

# Legacy format >>> legacy = Transcriber(“test”, rules_file=”es.rules”)

Initialize a Transcriber for a language.

Parameters:

language – Language code (e.g., “es”, “tt”, “aak”). Supports BCP-47 style codes with script and region: - “es-ES” (region preserved, treated as variant) - “yi-Latn” (script extracted) - “tt-cyrillic” (script extracted) - “zh-Hans-CN” (script extracted, region preserved)
script – Script to use (e.g., “latin”, “cyrillic”). Required for multi-script languages without a default. If provided, overrides any script in the language code.
verify – If True, verify the rules on initialization. Raises VerificationError if verification fails.
yaml_file – Path to an external YAML file (requires PyYAML).
rules_file – Path to a legacy .rules file.
verify_file – Path to a legacy .verify file.

transcribe(word: str) → list[str][source]

Transcribe a word from graphemes to phonemes.

Parameters:: word – The word to transcribe.
Returns:: List of phoneme strings.

verify() → tuple[bool, list[str]][source]

Verify the loaded rules against the verification data.

Returns:: Tuple of (all_passed, list_of_error_messages).

property language: str: The language code.

property script: str | None: The script being used.

property variant: str | None

The language variant (region code) if a variant file was loaded.

Returns:: Region code (e.g., “ES”, “MX”) if variant file exists, otherwise None.

Examples

>>> es = Transcriber("es")
>>> es.variant  # None (base language)
>>>
>>> # If es-ES.json exists:
>>> es_es = Transcriber("es-ES")
>>> es_es.variant  # "ES"
>>>
>>> # If es-ES.json doesn't exist (falls back to es.json):
>>> es_es = Transcriber("es-ES")
>>> es_es.variant  # None

property name: str: The language name.

property family: str: The language family.

property is_compromised: bool: Whether this language has known issues.

__repr__() → str[source]: Return repr(self).

available_languages

xpfcorpus.available_languages() → dict[str, dict][source]

Get all available languages from the package repository.

Returns:

{: “es”: {“scripts”: [“latin”], “default”: “latin”}, “tt”: {“scripts”: [“latin”, “cyrillic”], “default”: null}, …

}

Return type:

Dict mapping language codes to metadata

Exceptions

Custom exceptions for xpfcorpus.

exception xpfcorpus.exceptions.XPFCorpusError[source]

Bases: Exception

Base exception for all xpfcorpus errors.

exception xpfcorpus.exceptions.LanguageNotFoundError(code: str, available: list[str] | None = None)[source]

Bases: XPFCorpusError

Raised when a requested language is not available.

__init__(code: str, available: list[str] | None = None)[source]

exception xpfcorpus.exceptions.ScriptNotFoundError(code: str, script: str, available: list[str])[source]

Bases: XPFCorpusError

Raised when a requested script is not available for a language.

__init__(code: str, script: str, available: list[str])[source]

exception xpfcorpus.exceptions.ScriptRequiredError(code: str, available: list[str])[source]

Bases: XPFCorpusError

Raised when a language requires explicit script selection.

__init__(code: str, available: list[str])[source]

exception xpfcorpus.exceptions.VerificationError(code: str, errors: list[str])[source]

Bases: XPFCorpusError

Raised when language verification fails.

__init__(code: str, errors: list[str])[source]

exception xpfcorpus.exceptions.RulesParseError(path: str, detail: str = '')[source]

Bases: XPFCorpusError

Raised when a rules file cannot be parsed.

__init__(path: str, detail: str = '')[source]

Engine Layer

TranscriptionProcessor

class xpfcorpus.engine.processor.TranscriptionProcessor(rules: RuleSet, missing: str = '@')[source]

Core transcription engine that converts graphemes to phonemes.

This is a pure transcription class with no I/O operations. The algorithm is adapted from XPF Corpus’s translate04.py.

__init__(rules: RuleSet, missing: str = '@')[source]

Initialize the processor with a rule set.

Parameters:

rules – The RuleSet containing all transcription rules.
missing – Character to use for untranscribable graphemes.

transcribe(word: str) → list[str][source]

Transcribe a word from graphemes to phonemes.

Parameters:: word – The word to transcribe.
Returns:: List of phoneme strings.

verify(entries: list[VerifyEntry], *, stop_on_first: bool = False) → tuple[bool, list[str]][source]

Verify transcription against expected outputs.

Parameters:

entries – List of VerifyEntry objects with word/phonemes pairs.
stop_on_first – If True, stop at the first failure.

Returns:

Tuple of (all_passed, list_of_error_messages).

Data Classes

class xpfcorpus.engine.rules.RuleSet(classes: dict[str, str]=<factory>, pre: dict[str, str]=<factory>, matches: dict[str, str]=<factory>, subs: list[SubRule] = <factory>, ipasubs: list[SubRule] = <factory>, words: dict[str, list[str]]=<factory>)[source]

A complete set of rules for translating a script.

Contains: - classes: character class definitions for use in other rules - pre: character-level preprocessing (as a translation table) - matches: simple character-to-phoneme mappings (no context) - subs: context-sensitive substitution rules - ipasubs: post-processing substitution rules on IPA output - words: whole-word exception mappings

classes: dict[str, str]

pre: dict[str, str]

matches: dict[str, str]

subs: list[SubRule]

ipasubs: list[SubRule]

words: dict[str, list[str]]

get_pre_translation_table() → dict[int, str][source]: Build a str.maketrans table from the pre rules.

__init__(classes: dict[str, str]=<factory>, pre: dict[str, str]=<factory>, matches: dict[str, str]=<factory>, subs: list[SubRule] = <factory>, ipasubs: list[SubRule] = <factory>, words: dict[str, list[str]]=<factory>) → None

class xpfcorpus.engine.rules.SubRule(sfrom: str, sto: str, weight: float = 1.0, precede: str = '', follow: str = '', _sfrom_re: Pattern | None = None, _precede_re: Pattern | None = None, _follow_re: Pattern | None = None)[source]

A substitution rule with pattern matching and context.

Wraps a regex-based substitution with optional precede/follow context and a weight for rule prioritization.

sfrom: str

sto: str

weight: float = 1.0

precede: str = ''

follow: str = ''

property sfrom_re: Pattern

property precede_re: Pattern

property follow_re: Pattern

matches(sfrom: str, precede: str, follow: str) → float | None[source]

Check if this rule matches the given context.

Returns the rule weight if matched, None otherwise.

substitute(text: str) → str[source]: Apply this rule’s substitution to the given text.

__init__(sfrom: str, sto: str, weight: float = 1.0, precede: str = '', follow: str = '', _sfrom_re: Pattern | None = None, _precede_re: Pattern | None = None, _follow_re: Pattern | None = None) → None

class xpfcorpus.engine.rules.LanguageData(code: str, name: str = '', family: str = '', macroarea: str = '', compromised: dict | bool | None = None, default_script: str | None = None, scripts: dict[str, ~xpfcorpus.engine.rules.ScriptData]=<factory>)[source]

Complete data for a language, including all scripts.

A language may have multiple scripts (e.g., tt-latin, tt-cyrillic). If there’s a default_script, that script is used when no script is explicitly specified.

code: str

name: str = ''

family: str = ''

macroarea: str = ''

compromised: dict | bool | None = None

default_script: str | None = None

scripts: dict[str, ScriptData]

get_script_data(script: str | None = None) → ScriptData[source]

Get the ScriptData for the specified script, or the default.

Raises ValueError if no script specified and no default exists.

__init__(code: str, name: str = '', family: str = '', macroarea: str = '', compromised: dict | bool | None = None, default_script: str | None = None, scripts: dict[str, ~xpfcorpus.engine.rules.ScriptData]=<factory>) → None

class xpfcorpus.engine.rules.ScriptData(rules: RuleSet, verify: list[VerifyEntry] = <factory>)[source]

Data for a single script of a language.

rules: RuleSet

verify: list[VerifyEntry]

__init__(rules: RuleSet, verify: list[VerifyEntry] = <factory>) → None

class xpfcorpus.engine.rules.VerifyEntry(word: str, phonemes: str, comment: str = '')[source]

A single verification entry: word and expected phonemes.

word: str

phonemes: str

comment: str = ''

__init__(word: str, phonemes: str, comment: str = '') → None

I/O Layer

PackageRepository

class xpfcorpus.io.repository.PackageRepository[source]

Access bundled language data from the package.

This uses importlib.resources to access JSON files bundled with the package.

classmethod available_languages() → dict[str, dict][source]

Get a dictionary of available languages.

Returns:

{: “es”: {“scripts”: [“latin”], “default”: “latin”}, “tt”: {“scripts”: [“latin”, “cyrillic”], “default”: null}, …

}

Return type:

Dict mapping language codes to metadata

classmethod has_language(code: str) → bool[source]

Check if a language is available.

Checks both the index and the filesystem for language files. This allows variants (e.g., es-ES.json) to exist without being in index.json.

classmethod get_scripts(code: str) → list[str][source]

Get available scripts for a language.

Raises:: LanguageNotFoundError – If the language is not available.

classmethod get_default_script(code: str) → str | None[source]

Get the default script for a language.

Returns None if no default is set.

Raises:: LanguageNotFoundError – If the language is not available.

classmethod load_language(code: str) → LanguageData[source]

Load language data from the bundled JSON files.

Parameters:: code – Language code (e.g., “es”, “tt”).
Returns:: LanguageData object.
Raises:: LanguageNotFoundError – If the language is not available.

classmethod export_language_yaml(code: str) → str[source]

Export a language’s data as a YAML string.

Parameters:: code – Language code.
Returns:: YAML-formatted string.

classmethod clear_cache()[source]: Clear the loaded language cache.

Loaders

class xpfcorpus.io.json_loader.JSONLoader[source]

Load language data from JSON files.

classmethod load(path: Path | str) → LanguageData[source]

Load language data from a JSON file.

Parameters:: path – Path to the JSON file.
Returns:: LanguageData object.

classmethod load_string(content: str) → LanguageData[source]: Load language data from a JSON string.

classmethod from_dict(data: dict[str, Any]) → LanguageData[source]

Convert a dictionary to LanguageData.

Expected structure: {

“metadata”: {
“code”: “es”, “name”: “Spanish”, “family”: “…”, “macroarea”: “…”, “compromised”: false, “default_script”: “latin”

}, “scripts”: {

“latin”: {
“verify”: […], “rules”: {…}

}

}

}

class xpfcorpus.io.yaml_loader.YAMLLoader[source]

Load language data from YAML files.

Requires PyYAML to be installed. Install with:: pip install xpfcorpus[yaml]
or:: pip install pyyaml

classmethod load(path: Path | str) → LanguageData[source]

Load language data from a YAML file.

Parameters:: path – Path to the YAML file.
Returns:: LanguageData object.
Raises:: ImportError – If PyYAML is not installed.

classmethod load_string(content: str) → LanguageData[source]: Load language data from a YAML string.

classmethod from_dict(data: dict[str, Any]) → LanguageData[source]

Convert a dictionary to LanguageData.

Uses the same format as JSONLoader, so we delegate to it.

class xpfcorpus.io.legacy_loader.LegacyLoader[source]

Load language data from .rules and .verify files.

classmethod load_rules(path: Path | str) → RuleSet[source]

Load rules from a .rules file.

Parameters:: path – Path to the .rules file.
Returns:: RuleSet object.

classmethod load_verify(path: Path | str) → list[VerifyEntry][source]

Load verification entries from a .verify file.

Parameters:: path – Path to the .verify or .verify.csv file.
Returns:: List of VerifyEntry objects.

classmethod load_from_files(rules_path: Path | str, verify_path: Path | str | None = None) → ScriptData[source]

Load script data from .rules and .verify files.

Parameters:

rules_path – Path to the .rules file.
verify_path – Path to the .verify file (optional).

Returns:

ScriptData object.

Language Code Parsing

Language code parsing utilities for BCP-47 style codes.

xpfcorpus.io.language_code.normalize_script(script: str) → str[source]

Normalize script name to a standard form.

Handles both ISO 15924 4-letter codes and common names.

Parameters:: script – Script name or code (e.g., “Latn”, “latin”, “cyrillic”, “Cyrl”)
Returns:: Normalized lowercase script name.

Examples

>>> normalize_script("Latn")
'latin'
>>> normalize_script("cyrillic")
'cyrillic'
>>> normalize_script("Syll")
'syllabics'

xpfcorpus.io.language_code.parse_language_code(code: str, explicit_script: str | None = None) → Tuple[str, str | None, str | None][source]

Parse a language code with optional script and region components.

Supports BCP-47 style codes like: - “es” → (“es”, None, None) - “es-ES” → (“es”, None, “ES”) - region preserved - “yi-Latn” → (“yi”, “latin”, None) - script extracted - “tt-cyrillic” → (“tt”, “cyrillic”, None) - script extracted - “zh-Hans-CN” → (“zh”, “hans”, “CN”) - script extracted, region preserved

If an explicit script is provided, it always takes precedence over any script extracted from the code.

Parameters:

code – Language code, possibly with script/region subtags.
explicit_script – Optional explicit script that overrides extracted script.

Returns:

Tuple of (language_code, script_or_none, region_or_none).

Examples

>>> parse_language_code("es")
('es', None, None)
>>> parse_language_code("es-ES")
('es', None, 'ES')
>>> parse_language_code("yi-Latn")
('yi', 'latin', None)
>>> parse_language_code("tt-cyrillic")
('tt', 'cyrillic', None)
>>> parse_language_code("zh-Hans-CN")
('zh', 'hans', 'CN')
>>> parse_language_code("yi-Latn", "hebrew")
('yi', 'hebrew', None)