I've cautiously moved the regex out of the most used code path.
There is no string that will match that regex check that will not also be
passed by mb_check_encoding. I think the regex was intended as a shortcut
evaluation, but it is no faster than mb_check_encoding which will often
need to be run anyway.
I think it could just be deleted, but I have limited motivation to
risk introducing a bug to improve performance on old PHP vesions and
unusual configurations, so I've moved it to the fallback code path.
Change-Id: Ie9425cc23ba032e5aff42beeb44cbb1146050452
RFC 3629 defines the legal range of characters as U+0000..U+10FFFF
and forbids overlong forms (encodings of a character that use more
bytes than necessary). Let's make StringUtils::isUtf8() match the
specification.
* Changed the maximum value in the pure PHP code path and added a
check for overlong forms.
* Added another check, specific to PHP 5.3's mbstring extension,
for values above U+10FFFF.
* Fixed the mbstring test errors in PHP 5.4 using changes to
StringUtilsTest by Platonides <platonides@gmail.com>.
* Uncommented some other tests that could fail because of the
missing check for overlong forms.
* Added additional tests for extra continuation bytes, overlong
sequences/forms, and values in the UTF-16 surrogate range.
The changes to the function were so extensive that I might as
well say I rewrote it.
Bug: 43679
Change-Id: I56ae496d17ffc3747550e06a72dacab3ac55da61
And added/removed spaces around some other tokens,
like +, -, *, /, <, >, =, !
Fixed windows newline style
Change-Id: I0b9c8c408f3f6bfc0d685a074d7ec468fb848fc8
* Ran spell-checker over code comments in /includes/
* A few spellchecking fixes for wfDebug() calls
Found one very strange (NOOP?) line in Linker.php - see "TODO: BUG?"
Change-Id: Ibb86b51073b980eda9ecce2cf0b8dd33f058adbf
Doxygen expects parameter types to come before the
parameter name in @param tags. Used a quick regex
to switch everything around where possible. This
only fixes cases where a primitve variable (or a
primitive followed by other types) is the variable
type. Other cases will need to be fixed manually.
Change-Id: Ic59fd20856eb0489d70f3469a56ebce0efb3db13
Language class had a code snippet to verify whether a text is valid
UTF-8 though that could not be used from another place. The snippet use
mb_check_encoding() and fallback to some regex whenever mbstring is not
available.
* introduce StringUtils::isUtf8() which is mostly code moved out of the
language class.
* Enhance regex readability by using an expanded regex (//x)
* Made the regex to recognize longer sequences
* Add some unit tests to the mbstring and the PHP native implementation
* An optional second parameter can be passed to isUtf8() to force the
use of our PHP implementation. This is used for unit testing.
Change-Id: I4cf4dfe2eb02f046db1726f4654ba649e01419f2
Also made file/class documentation more consistent and removed a duplicate comment from SpecialPageFactory.php in SpecialPage.php.
Change-Id: I99dd2de7fe461f2fad4e0bd315ebc2899958a90f
* Split link placeholder/replacement handling into a separate object, LinkHolderArray.
* Remove Title objects from LinkCache, they apparently weren't being used at all. Same unconstrained memory usage as the former $parser->mLinkHolders.
* Introduced ExplodeIterator -- a workalike for explode() which doesn't use a significant amount of memory
* Introduced StringUtils::explode() -- select whether to use the simulated or native explode() depending on how many items there are
* Migrated most instances of explode() in Parser.php to StringUtils::explode()
* Renamed some variables in Parser::doBlockLevels()
* In Parser.php: $fname => __METHOD__, Parser => self/__CLASS__, to support Parser_DiffTest more easily
* Doc update in includes/MessageCache.php for r39412
* MW_TITLECACHE_MAX => Title::CACHE_MAX, nicer name, easier to access from another module