|
Viewing Issue Simple Details
[ Jump to Notes ]
|
[ View Advanced ]
[ Issue History ]
[ Print ]
|
|
ID |
Category |
Severity |
Reproducibility |
Date Submitted |
Last Update |
|
0000719 |
[uClibc] Internationalization / Localization |
major |
always |
02-12-06 23:09 |
02-12-06 23:09 |
|
|
Reporter |
rfelker |
View Status |
public |
|
|
Assigned To |
uClibc |
|
Priority |
normal |
Resolution |
open |
|
|
Status |
assigned |
|
Product Version |
|
|
|
Summary |
0000719: Many non-european letters are classified non-alphabetic |
|
Description |
uClibc inherits this bug from glibc, which incorrectly derives the alphabetic property. In addition to the L*, Nl, Sl, etc. Unicode character classes for letters, Unicode also includes an "Other_Alphabetic" class in http://www.unicode.org/Public/UNIDATA/PropList.txt [^] of combining marks (Mn and Mc) in South Asian scripts which are certainly letters. Arguably all combining marks should be included in class alpha (otherwise decomposed alphabetic strings with accents/diacritics will be nonalphabetic), but the ones in Other_Alphabetic MUST be included.
This bug results in most words in most South Asian scripts being classified nonalphabetic; thus I consider it major. |
|
Additional Information |
DerivedCoreProperties.txt from Unicode contains the full list of characters considered alphabetic by Unicode. IMO it's insufficient, but it's a minimal list of what must be included.
This bug cannot be easily fixed without processing the Unicode data directly rather than mirroring glibc, unless glibc also fixes their bug.
|
|
|
Attached Files |
|
|
|