Commit graph

32 commits

Author SHA1 Message Date
Ævar Arnfjörð Bjarmason
a26d5a49d7 * s~\t+$~~ 2006-01-07 13:31:29 +00:00
Ævar Arnfjörð Bjarmason
7bbe971aec * s~ +$~~ 2006-01-07 13:09:30 +00:00
Brion Vibber
af2177edfd Code cleanup: normalize case for intval(), strval(), floatval() calls. 2005-08-16 23:36:16 +00:00
Brion Vibber
727e4d1aab Fix composition bug: completed hangul syllable should not be merged with another following final jamo 2004-11-15 00:59:40 +00:00
Brion Vibber
c6340de5b3 Fix regression in ICU-mode UTF-8 verification: U+FFFF is forbidden 2004-11-14 21:36:43 +00:00
Brion Vibber
e4e75a58a6 Support using ICU to do most of the heavy lifting in cleanUp() if the extension is loaded.
Modestly faster for roman text (1-2x), 16-20x faster than the PHP looping for already normalized Russian, Japanese, and Korean text.
2004-11-14 05:17:29 +00:00
Brion Vibber
4a4f248655 Fix regression: surrogate half followed by extra tail bytes 2004-11-14 04:27:03 +00:00
Brion Vibber
9535fc035b Fix UTF-8 validation regression: well-formed but forbidden UTF-8 sequence followed by bogus tail bytes 2004-11-14 04:07:28 +00:00
Brion Vibber
dd69eb14f5 Fix UTF-8 validation regression where a bad head byte is followed by ascii, then bad tail byte. 2004-11-14 03:48:49 +00:00
Brion Vibber
7bf6095d73 Fix UTF-8 validation bug where some cases didn't get replacement chars inserted correctly 2004-11-14 02:24:44 +00:00
Brion Vibber
eae361e2f0 cleanUp() optimization: speed up Japanese, Korean tests by another 15% by rearranging the loop and avoiding rebuilding the string if there are no illegal characters.
Removed restrictions on U+FDD0 and friends; these do seem to be allowed by XML, though they 'recommend' you avoid them.
2004-11-07 11:28:00 +00:00
Brion Vibber
7434438b98 Don't forgot to actually _make_ the replacements for illegal chars. :P 2004-11-06 02:52:25 +00:00
Brion Vibber
51dd271399 Shave off a few more milliseconds from cleanUp() inner loop. 2004-11-05 09:13:02 +00:00
Brion Vibber
97f577163c Shave a few more percentage points from times on cleanUp() on unicode text by building a combined NFC-check hash. 2004-11-05 08:22:56 +00:00
Brion Vibber
0db79dbed6 More incremental optimization on cleanUp():
* when splitting ascii vs non-ascii chunks, don't split punctuation and control chars as aggressively; this benefits the Korean test data
* use output buffer and echo; it's _slightly_ faster than string concatenation.
* Separate the surrogate check from the others; many Korean letters fall in the adjacent area with the same head byte, so this gives a small speed boost on Korean text
2004-11-05 04:07:04 +00:00
Brion Vibber
874f8b48c6 cleanUp() optimization: about 1/8 speed boost on unicode-dominant text (Japanese, Korean test data) 2004-11-05 00:47:03 +00:00
Brion Vibber
9ba6a6c74a cleanUp() optimization: split the string into pure ASCII chunks and chunks which need to be checked byte by byte. Over 5x speedup for German text sample. 2004-11-05 00:26:09 +00:00
Brion Vibber
48cb181bd2 Optimization on cleanUp(): roughly 1/3 speed boost on ascii-dominant but not ascii-pure text (eg German) 2004-11-04 23:53:44 +00:00
Brion Vibber
5f530ba1f3 Optimize inner loop in cleanUp(): boosts performance on non-ASCII text by about 20%.
Also, trim the XML-illegal control characters from pure ASCII as well as non-ASCII strings.
2004-11-04 11:44:45 +00:00
Brion Vibber
1897c54f2a The pass-by-reference on the string on fastCompose() really slows things down sometimes in PHP4. Taking it out speeds up processing of Japanese text significantly. 2004-10-30 12:35:37 +00:00
Brion Vibber
286dd13042 More inlining; fastCompose() is now twice as fast on hangul chars, which cuts down the NFC() time on Korean text a fair chunk. 2004-10-30 12:06:31 +00:00
Brion Vibber
de3549d9e9 Optimize inner loops a bit. 2004-10-30 06:02:30 +00:00
Brion Vibber
d2e152e6de Munge doc comments. Mark as its own package for docs. 2004-10-28 02:56:13 +00:00
Brion Vibber
6377e82b76 Load form C data on demand; if we are dealing in all-ASCII text we can save some memory and time by not loading it. 2004-10-09 08:08:26 +00:00
Brion Vibber
0824182956 Add support for using ICU to perform normalization, which is much much faster than the PHP code!
Still need to add support for cleanup/verification.
2004-10-07 05:59:10 +00:00
Brion Vibber
f0610d0f67 Doc comments 2004-09-27 02:59:24 +00:00
Brion Vibber
dd195aa594 Some more phpdoc bits 2004-09-04 09:35:01 +00:00
Antoine Musso
ba2afcd9fa Split files and classes in different packages for phpdocumentor. I probably changed some double quotes to single and used function foo () { shema 2004-09-03 23:00:01 +00:00
Brion Vibber
9857a47c3f Correction to the \r stripping 2004-09-03 06:44:57 +00:00
Brion Vibber
ed46bd50fe Add UtfNormal::cleanUp() function: strips XML-unsafe characters and illegal UTF-8 sequences, then normalizes to form C. 2004-09-03 05:39:30 +00:00
Brion Vibber
53e71c1702 Split the data arrays for form KC, KD to a separate include file and load it on demand.
These are less likely to be used, so save the memory and parse time...
2004-09-02 07:39:06 +00:00
Brion Vibber
a5cfdf0360 Unicode normalization routines.
See: http://www.unicode.org/reports/tr15/
2004-08-29 10:30:23 +00:00