wiki.techinc.nl/includes/normal
Tim Starling ed4303922f Merged filerepo-work branch:
* Added support for configuration of an arbitrary number of commons-style file repositories.
* Split Image.php into filerepo/File.php and filerepo/LocalFile.php
* Renamed Image::getImagePath() to File::getPath()
* Added initial support for timestamp-based file fetching (OldLocalFile), to be expanded upon by aaron.
* Changed the interface for Image/File object creation: use wfFindFile() or wfLocalFile() depending on semantics
* ImageGallery::add() now accepts a title object as the first parameter
* Moved file handling operations on upload from SpecialUpload to File
* Removed path-related functions from ImageFunctions.php. Removed static path accessors from File. 
* Added a Content-Disposition header to thumb.php output
* Improved thumb.php error handling
* Updated the unit test suite to kind of partially work with modern computers. RunTests.php doesn't work just yet. Fixed an actual regression that the test suite detected -- moved some defines to Defines.php where they will be loaded consistently.
2007-05-30 21:02:32 +00:00
..
CleanUpTest.php Doc tweaks: 2007-04-24 06:53:31 +00:00
Makefile fix benchmark test data downloads; fix link for english text; find another page for korean text (page was deleted) 2007-01-13 02:57:58 +00:00
RandomTest.php Cleanup from r19742: 2007-02-04 18:42:07 +00:00
README adjust CleanUpTest to run with PHPUnit 3 2007-01-13 02:15:19 +00:00
Utf8Test.php Use Doxygen @addtogroup instead of phpdoc @package && @subpackage 2007-01-20 15:09:52 +00:00
UtfNormal.php Merged filerepo-work branch: 2007-05-30 21:02:32 +00:00
UtfNormalBench.php Use Doxygen @addtogroup instead of phpdoc @package && @subpackage 2007-01-20 15:09:52 +00:00
UtfNormalData.inc Use Doxygen @addtogroup instead of phpdoc @package && @subpackage 2007-01-20 15:09:52 +00:00
UtfNormalDataK.inc Use Doxygen @addtogroup instead of phpdoc @package && @subpackage 2007-01-20 15:09:52 +00:00
UtfNormalGenerate.php Use Doxygen @addtogroup instead of phpdoc @package && @subpackage 2007-01-20 15:09:52 +00:00
UtfNormalTest.php Use Doxygen @addtogroup instead of phpdoc @package && @subpackage 2007-01-20 15:09:52 +00:00
UtfNormalUtil.php Use Doxygen @addtogroup instead of phpdoc @package && @subpackage 2007-01-20 15:09:52 +00:00

This directory contains some Unicode normalization routines. These routines
are meant to be reusable in other projects, so I'm not tying them to the
MediaWiki utility functions.

The main function to care about is UtfNormal::toNFC(); this will convert
a given UTF-8 string to Normalization Form C if it's not already such.
The function assumes that the input string is already valid UTF-8; if there
are corrupt characters this may produce erroneous results.

To also check for illegal characters, use UtfNormal::cleanUp(). This will
strip illegal UTF-8 sequences and characters that are illegal in XML, and
if necessary convert to normalization form C.

Performance is kind of stinky in absolute terms, though it should be speedy
on pure ASCII text. ;) On text that can be determined quickly to already be
in NFC it's not too awful but it can quickly get uncomfortably slow,
particularly for Korean text (the hangul decomposition/composition code is
extra slow).


== Regenerating data tables ==

UtfNormalData.inc and UtfNormalDataK.inc are generated from the Unicode
Character Database by the script UtfNormalGenerate.php. On a *nix system
'make' should fetch the necessary files and regenerate it if the scripts
have been changed or you remove it.


== Testing ==

'make test' will run the conformance test (UtfNormalTest.php), fetching the
data from from the net if necessary. If it reports failure, something is
going wrong!

You may have to set up PHPUnit first.

$ pear channel-discover pear.phpunit.de
$ pear install phpunit/PHPUnit

== Benchmarks ==

Run 'make bench' to download some sample texts from Wikipedia and run some
cheap benchmarks of some of the functions. Take all numbers with large
grains of salt.


== PHP module extension ==

There's an experimental PHP extension module which wraps the ICU library's
normalization functions. This is *MUCH* faster than doing this work in pure
PHP code. This is in the 'normal' directory in MediaWiki's CVS extensions
module. It is known to work with PHP 4.3.8 and 5.0.2 on Linux/x86 but hasn't
been thoroughly tested on other configurations.

If the php_normal.so module is loaded in php.ini, the normalization functions
will automatically use it. If you can't (or don't want to) load it in php.ini,
you may be able to load it using the dl() function before include()ing or
require()ing UtfNormal.php, and it will be picked up.