Ævar Arnfjörð Bjarmason
a26d5a49d7
* s~\t+$~~
2006-01-07 13:31:29 +00:00
Ævar Arnfjörð Bjarmason
7bbe971aec
* s~ +$~~
2006-01-07 13:09:30 +00:00
Brion Vibber
af2177edfd
Code cleanup: normalize case for intval(), strval(), floatval() calls.
2005-08-16 23:36:16 +00:00
Brion Vibber
727e4d1aab
Fix composition bug: completed hangul syllable should not be merged with another following final jamo
2004-11-15 00:59:40 +00:00
Brion Vibber
c6340de5b3
Fix regression in ICU-mode UTF-8 verification: U+FFFF is forbidden
2004-11-14 21:36:43 +00:00
Brion Vibber
e4e75a58a6
Support using ICU to do most of the heavy lifting in cleanUp() if the extension is loaded.
...
Modestly faster for roman text (1-2x), 16-20x faster than the PHP looping for already normalized Russian, Japanese, and Korean text.
2004-11-14 05:17:29 +00:00
Brion Vibber
4a4f248655
Fix regression: surrogate half followed by extra tail bytes
2004-11-14 04:27:03 +00:00
Brion Vibber
9535fc035b
Fix UTF-8 validation regression: well-formed but forbidden UTF-8 sequence followed by bogus tail bytes
2004-11-14 04:07:28 +00:00
Brion Vibber
dd69eb14f5
Fix UTF-8 validation regression where a bad head byte is followed by ascii, then bad tail byte.
2004-11-14 03:48:49 +00:00
Brion Vibber
7bf6095d73
Fix UTF-8 validation bug where some cases didn't get replacement chars inserted correctly
2004-11-14 02:24:44 +00:00
Brion Vibber
eae361e2f0
cleanUp() optimization: speed up Japanese, Korean tests by another 15% by rearranging the loop and avoiding rebuilding the string if there are no illegal characters.
...
Removed restrictions on U+FDD0 and friends; these do seem to be allowed by XML, though they 'recommend' you avoid them.
2004-11-07 11:28:00 +00:00
Brion Vibber
7434438b98
Don't forgot to actually _make_ the replacements for illegal chars. :P
2004-11-06 02:52:25 +00:00
Brion Vibber
51dd271399
Shave off a few more milliseconds from cleanUp() inner loop.
2004-11-05 09:13:02 +00:00
Brion Vibber
97f577163c
Shave a few more percentage points from times on cleanUp() on unicode text by building a combined NFC-check hash.
2004-11-05 08:22:56 +00:00
Brion Vibber
0db79dbed6
More incremental optimization on cleanUp():
...
* when splitting ascii vs non-ascii chunks, don't split punctuation and control chars as aggressively; this benefits the Korean test data
* use output buffer and echo; it's _slightly_ faster than string concatenation.
* Separate the surrogate check from the others; many Korean letters fall in the adjacent area with the same head byte, so this gives a small speed boost on Korean text
2004-11-05 04:07:04 +00:00
Brion Vibber
874f8b48c6
cleanUp() optimization: about 1/8 speed boost on unicode-dominant text (Japanese, Korean test data)
2004-11-05 00:47:03 +00:00
Brion Vibber
9ba6a6c74a
cleanUp() optimization: split the string into pure ASCII chunks and chunks which need to be checked byte by byte. Over 5x speedup for German text sample.
2004-11-05 00:26:09 +00:00
Brion Vibber
48cb181bd2
Optimization on cleanUp(): roughly 1/3 speed boost on ascii-dominant but not ascii-pure text (eg German)
2004-11-04 23:53:44 +00:00
Brion Vibber
5f530ba1f3
Optimize inner loop in cleanUp(): boosts performance on non-ASCII text by about 20%.
...
Also, trim the XML-illegal control characters from pure ASCII as well as non-ASCII strings.
2004-11-04 11:44:45 +00:00
Brion Vibber
1897c54f2a
The pass-by-reference on the string on fastCompose() really slows things down sometimes in PHP4. Taking it out speeds up processing of Japanese text significantly.
2004-10-30 12:35:37 +00:00
Brion Vibber
286dd13042
More inlining; fastCompose() is now twice as fast on hangul chars, which cuts down the NFC() time on Korean text a fair chunk.
2004-10-30 12:06:31 +00:00
Brion Vibber
de3549d9e9
Optimize inner loops a bit.
2004-10-30 06:02:30 +00:00
Brion Vibber
d2e152e6de
Munge doc comments. Mark as its own package for docs.
2004-10-28 02:56:13 +00:00
Brion Vibber
6377e82b76
Load form C data on demand; if we are dealing in all-ASCII text we can save some memory and time by not loading it.
2004-10-09 08:08:26 +00:00
Brion Vibber
0824182956
Add support for using ICU to perform normalization, which is much much faster than the PHP code!
...
Still need to add support for cleanup/verification.
2004-10-07 05:59:10 +00:00
Brion Vibber
f0610d0f67
Doc comments
2004-09-27 02:59:24 +00:00
Brion Vibber
dd195aa594
Some more phpdoc bits
2004-09-04 09:35:01 +00:00
Antoine Musso
ba2afcd9fa
Split files and classes in different packages for phpdocumentor. I probably changed some double quotes to single and used function foo () { shema
2004-09-03 23:00:01 +00:00
Brion Vibber
9857a47c3f
Correction to the \r stripping
2004-09-03 06:44:57 +00:00
Brion Vibber
ed46bd50fe
Add UtfNormal::cleanUp() function: strips XML-unsafe characters and illegal UTF-8 sequences, then normalizes to form C.
2004-09-03 05:39:30 +00:00
Brion Vibber
53e71c1702
Split the data arrays for form KC, KD to a separate include file and load it on demand.
...
These are less likely to be used, so save the memory and parse time...
2004-09-02 07:39:06 +00:00
Brion Vibber
a5cfdf0360
Unicode normalization routines.
...
See: http://www.unicode.org/reports/tr15/
2004-08-29 10:30:23 +00:00