0000719: Many non-european letters are classified non-alphabetic

BusyBox Bug and Patch Tracking

Viewing Issue Simple Details [ Jump to Notes ]

[ View Advanced ] [ Issue History ] [ Print ]

Category

Severity

Reproducibility

Date Submitted

Last Update

0000719

[uClibc] Internationalization / Localization

major

always

02-12-06 23:09

Reporter

rfelker

View Status

public

Assigned To

uClibc

Priority

normal

Resolution

open

Status

assigned

Product Version

Summary

0000719: Many non-european letters are classified non-alphabetic

Description

uClibc inherits this bug from glibc, which incorrectly derives the alphabetic property. In addition to the L*, Nl, Sl, etc. Unicode character classes for letters, Unicode also includes an "Other_Alphabetic" class in http://www.unicode.org/Public/UNIDATA/PropList.txt [^] of combining marks (Mn and Mc) in South Asian scripts which are certainly letters. Arguably all combining marks should be included in class alpha (otherwise decomposed alphabetic strings with accents/diacritics will be nonalphabetic), but the ones in Other_Alphabetic MUST be included.

This bug results in most words in most South Asian scripts being classified nonalphabetic; thus I consider it major.

Additional Information

DerivedCoreProperties.txt from Unicode contains the full list of characters considered alphabetic by Unicode. IMO it's insufficient, but it's a minimal list of what must be included.

This bug cannot be easily fixed without processing the Unicode data directly rather than mirroring glibc, unless glibc also fixes their bug.

Attached Files

Relationships

There are no notes attached to this issue.

Issue History
Date Modified	Username	Field	Change
02-12-06 23:09	rfelker	New Issue
02-12-06 23:09	rfelker	Status	new => assigned
02-12-06 23:09	rfelker	Assigned To	=> uClibc