Distance Functions¶
Grapheme-aware string distance functions that correctly measure edit distances over visual character units rather than raw Unicode code points.
Module: graphemes_plusplus.distance
Functions¶
levenshtein(s1: str, s2: str) → int¶
Computes the grapheme-aware Levenshtein distance (minimum edit distance) between two strings.
| Parameter | Type | Description |
|---|---|---|
s1 |
str |
First input string |
s2 |
str |
Second input string |
| Returns | int |
Minimum number of grapheme-level insertions, deletions, and substitutions |
>>> from graphemes_plusplus import levenshtein
>>> levenshtein("ஸ்ரீ", "ஸ்ரி")
1
>>> levenshtein("ක්රම", "කම")
1
Why grapheme-aware? Standard Levenshtein on raw Unicode would count
ஸ்ரீ→ஸ்ரிas multiple edits because the underlying code points differ in count. Grapheme-aware Levenshtein correctly identifies this as a single grapheme substitution.
hamming(s1: str, s2: str) → int¶
Computes the grapheme-aware Hamming distance between two strings. Both strings must have the same number of graphemes.
| Parameter | Type | Description |
|---|---|---|
s1 |
str |
First input string |
s2" |str` |
Second input string (must have equal grapheme count) | |
| Returns | int |
Number of positions where graphemes differ |
Equal length required The Hamming distance is only defined for strings with equal numbers of graphemes. If the grapheme counts differ,
textdistancewill handle the mismatch according to its own convention.
How It Works¶
Both functions internally:
- Create
Graphemizerinstances for each input string - Convert to
list[str]of grapheme clusters - Pass these lists to
textdistancealgorithms
graph LR
A["Input Strings"] --> B["Graphemizer(s1)"]
A --> C["Graphemizer(s2)"]
B --> D["list[str]"]
C --> E["list[str]"]
D --> F["textdistance algorithm"]
E --> F
F --> G["Distance (int)"]
style A fill:#7c4dff,color:#fff
style G fill:#00bfa5,color:#fff
See Also¶
- Graphemizer — Creates the grapheme lists used for comparison
- Metrics — Higher-level evaluation metrics built on distance functions