wiki.techinc.nl/includes/normal
2004-11-15 00:59:40 +00:00
..
.cvsignore Ignore some Mac-related files 2004-11-14 02:25:44 +00:00
CleanUpTest.php Fix composition bug: completed hangul syllable should not be merged with another following final jamo 2004-11-15 00:59:40 +00:00
Makefile Support using ICU to do most of the heavy lifting in cleanUp() if the extension is loaded. 2004-11-14 05:17:29 +00:00
RandomTest.php Test: feeds random strings to both pure PHP and ICU code paths looking for differences. 2004-11-14 21:40:44 +00:00
README Add UtfNormal::cleanUp() function: strips XML-unsafe characters and illegal UTF-8 sequences, then normalizes to form C. 2004-09-03 05:39:30 +00:00
Utf8Test.php Munge doc comments. Mark as its own package for docs. 2004-10-28 02:56:13 +00:00
UtfNormal.php Fix composition bug: completed hangul syllable should not be merged with another following final jamo 2004-11-15 00:59:40 +00:00
UtfNormalBench.php Add a Russian test file to the benchmark (2-byte characters, using ASCII spacing and punctuation) 2004-11-11 07:05:21 +00:00
UtfNormalData.inc Load form C data on demand; if we are dealing in all-ASCII text we can save some memory and time by not loading it. 2004-10-09 08:08:26 +00:00
UtfNormalDataK.inc Change the way comment are generated so they are compatible with phpdocumentor. Changes already existing files as well. 2004-09-03 22:52:28 +00:00
UtfNormalGenerate.php Munge doc comments. Mark as its own package for docs. 2004-10-28 02:56:13 +00:00
UtfNormalTest.php Don't run the control characters through the invariant test, as they are stripped by cleanUp() for XML safety. 2004-11-06 03:00:29 +00:00
UtfNormalUtil.php Add a utf-8 to hex sequence function for debugging 2004-11-15 00:58:36 +00:00

This directory contains some Unicode normalization routines. These routines
are meant to be reusable in other projects, so I'm not tying them to the
MediaWiki utility functions.

The main function to care about is UtfNormal::toNFC(); this will convert
a given UTF-8 string to Normalization Form C if it's not already such.
The function assumes that the input string is already valid UTF-8; if there
are corrupt characters this may produce erroneous results.

To also check for illegal characters, use UtfNormal::cleanUp(). This will
strip illegal UTF-8 sequences and characters that are illegal in XML, and
if necessary convert to normalization form C.

Performance is kind of stinky in absolute terms, though it should be speedy
on pure ASCII text. ;) On text that can be determined quickly to already be
in NFC it's not too awful but it can quickly get uncomfortably slow,
particularly for Korean text (the hangul decomposition/composition code is
extra slow).


== Regenerating data tables ==

UtfNormalData.inc and UtfNormalDataK.inc are generated from the Unicode
Character Database by the script UtfNormalGenerate.php. On a *nix system
'make' should fetch the necessary files and regenerate it if the scripts
have been changed or you remove it.


== Testing ==

'make test' will run the conformance test (UtfNormalTest.php), fetching the
data from from the net if necessary. If it reports failure, something is
going wrong!