Tamil Scripts¶
A practical guide to using graphemes++ with Tamil (தமிழ்) text.
Why Grapheme-Aware Processing Matters for Tamil¶
Tamil has a complex writing system where a single visual character can consist of multiple Unicode code points:
ஸ்ரீ 1 grapheme · 4 Unicode code points
| Representation | Count | Result |
|---|---|---|
len("ஸ்ரீ") |
4 | Unicode code points |
len(list("ஸ்ரீ")) |
4 | Individual characters |
len(Graphemizer("ஸ்ரீ")) |
1 | ✅ Correct grapheme count |
Tamil Grapheme Segmentation¶
Basic Examples¶
from graphemes_plusplus import Graphemizer
# Simple word
g = Graphemizer("வணக்கம்")
print(g.graphemes)
# ['வ', 'ண', 'க்', 'க', 'ம்']
# Complex conjunct: க்ஷ
g = Graphemizer("அக்ஷரம்")
print(g.graphemes)
# ['அ', 'க்ஷ', 'ர', 'ம்']
# Sri marker: ஸ்ரீ
g = Graphemizer("ஸ்ரீமான்")
print(g.graphemes)
# ['ஸ்ரீ', 'மா', 'ன்']
Special Tamil Conjuncts¶
graphemes++ handles these Tamil-specific conjuncts that the standard grapheme library splits incorrectly:
| Conjunct | Components | graphemes++ |
Standard grapheme |
|---|---|---|---|
| க்ஷ | க் + ஷ | ✅ Single grapheme | ❌ Split into 2 |
| ஸ்ரீ | ஸ் + ரீ | ✅ Single grapheme | ❌ Split into 2 |
| ஶ்ரீ | ஶ் + ரீ | ✅ Single grapheme | ❌ Split into 2 |
Tamil Decomposition¶
Decompose Tamil graphemes into their phonetic components (mei + uyir):
from graphemes_plusplus import decompose, compose
# Decompose: உயிர்மெய் → மெய் + உயிர்
print(decompose("கா")) # க் + ஆ
print(decompose("தி")) # த் + இ
print(decompose("பூ")) # ப் + ஊ
# Compose back
print(compose(decompose("கா"))) # கா
The Tamil Phonetic System¶
graph TD
A["Tamil Letters"] --> B["உயிர் (Vowels)"]
A --> C["மெய் (Consonants)"]
A --> D["உயிர்மெய் (Combined)"]
B --> E["அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ"]
C --> F["க், ங், ச், ஞ், ட், ண், ..."]
D --> G["க + அ = க, க + ஆ = கா, ..."]
style A fill:#7c4dff,color:#fff
style D fill:#00bfa5,color:#fff
Tamil Normalization¶
The Normalizer automatically fixes common Tamil encoding issues:
from graphemes_plusplus.utils import Normalizer
n = Normalizer()
# Fix reversed vowel orders
n.normalize("ாெ") # → ொ
n.normalize("ாே") # → ோ
# Fix incorrect sequences
n.normalize("ா்") # → ர்
n.normalize("ாி") # → ரி
Normalization is automatic When you use
Graphemizer, normalization is applied automatically. These examples show theNormalizerfor educational purposes.