Commit graph

37 commits

Author SHA1 Message Date
lwelling
7a344588d3 Remove reduntant regex from calls to StringUtils::isUtf8()
I've cautiously moved the regex out of the most used code path.
There is no string that will match that regex check that will not also be
passed by mb_check_encoding.  I think the regex was intended as a shortcut
evaluation, but it is no faster than mb_check_encoding which will often
need to be run anyway.

I think it could just be deleted, but I have limited motivation to
risk introducing a bug to improve performance on old PHP vesions and
unusual configurations, so I've moved it to the fallback code path.

Change-Id: Ie9425cc23ba032e5aff42beeb44cbb1146050452
2013-09-21 02:59:49 +01:00
jenkins-bot
320a2971e0 Merge "Adapt StringUtils::isUtf8 to the top of Unicode at U+10FFFF" 2013-09-19 16:12:39 +00:00
Kevin Israel
7447669e83 Adapt StringUtils::isUtf8 to the top of Unicode at U+10FFFF
RFC 3629 defines the legal range of characters as U+0000..U+10FFFF
and forbids overlong forms (encodings of a character that use more
bytes than necessary). Let's make StringUtils::isUtf8() match the
specification.

* Changed the maximum value in the pure PHP code path and added a
  check for overlong forms.
* Added another check, specific to PHP 5.3's mbstring extension,
  for values above U+10FFFF.
* Fixed the mbstring test errors in PHP 5.4 using changes to
  StringUtilsTest by Platonides <platonides@gmail.com>.
* Uncommented some other tests that could fail because of the
  missing check for overlong forms.
* Added additional tests for extra continuation bytes, overlong
  sequences/forms, and values in the UTF-16 surrogate range.

The changes to the function were so extensive that I might as
well say I rewrote it.

Bug: 43679
Change-Id: I56ae496d17ffc3747550e06a72dacab3ac55da61
2013-09-18 17:23:15 -04:00
Timo Tijhof
6edb2c8d73 doc: Clean up documentation for StringUtils classes
Change-Id: Ie016c1d1686c9dce7944864e77e6c3bdf001d8c3
2013-09-18 13:46:18 +02:00
umherirrender
6f79eef473 Fixed spacing around parenthesis in includes
Change-Id: Ie8adc00f4ee8ecec4554e584c18d5d2073415397
2013-04-28 15:50:07 +00:00
umherirrender
ef2f507d23 Fixed spacing in files direct in includes folder
Added spaces before if, foreach
Added some braces for one line statements

Change-Id: Ibb8dd102db045522d12ff939075ba7420d95ab6b
2013-04-21 06:38:49 +00:00
umherirrender
15abcf71ca Added/Removed spaces around string concatenation
And added/removed spaces around some other tokens,
like +, -, *, /, <, >, =, !

Fixed windows newline style

Change-Id: I0b9c8c408f3f6bfc0d685a074d7ec468fb848fc8
2013-04-13 13:36:24 +02:00
Yuri Astrakhan
9506e3d812 Spellchecked /includes directory
* Ran spell-checker over code comments in /includes/
* A few spellchecking fixes for wfDebug() calls

Found one very strange (NOOP?) line in Linker.php - see "TODO: BUG?"

Change-Id: Ibb86b51073b980eda9ecce2cf0b8dd33f058adbf
2013-03-13 03:42:41 -04:00
Tyler Anthony Romeo
4dcc7961df Fixed @param tags to conform with Doxygen format.
Doxygen expects parameter types to come before the
parameter name in @param tags. Used a quick regex
to switch everything around where possible. This
only fixes cases where a primitve variable (or a
primitive followed by other types) is the variable
type. Other cases will need to be fixed manually.

Change-Id: Ic59fd20856eb0489d70f3469a56ebce0efb3db13
2013-03-11 13:15:01 -04:00
Siebrand Mazeland
9b7889b84b Use American English spelling for behavior
Spotted in ipbreason-dropdown by Shirayuki.

Change-Id: I576ed4bc0abe5ab980aaee3fb9f9e4b43087311f
2013-03-04 10:24:57 +01:00
umherirrender
1044b0b8df fix some spacing
Change-Id: I8f976013f33c5818e4402604fe8610aa3f43b0c6
2013-02-04 20:18:33 +00:00
Antoine Musso
750db30d9b abstract utf8 validation fallback
Language class had a code snippet to verify whether a text is valid
UTF-8 though that could not be used from another place. The snippet use
mb_check_encoding() and fallback to some regex whenever mbstring is not
available.

* introduce StringUtils::isUtf8() which is mostly code moved out of the
  language class.
* Enhance regex readability by using an expanded regex (//x)
* Made the regex to recognize longer sequences
* Add some unit tests to the mbstring and the PHP native implementation
* An optional second parameter can be passed to isUtf8() to force the
  use of our PHP implementation. This is used for unit testing.

Change-Id: I4cf4dfe2eb02f046db1726f4654ba649e01419f2
2012-12-12 11:24:38 +00:00
umherirrender
85d8ee1f87 Remove a bunch of trailing spaces and unneeded newlines
Change-Id: I00f369641320acd7f087427ef031f3ee7efa0997
2012-10-10 20:14:40 +02:00
Antoine Musso
d5737f8f17 update @param @return doc in several files
Change-Id: I0e23227330f90dc4121fd2a313d2e9a33c3c97a7
2012-07-10 17:08:52 +02:00
Alexandre Emsenhuber
bc9d9f1f9c Added missing GPLv2 headers in some places.
Also made file/class documentation more consistent and removed a duplicate comment from SpecialPageFactory.php in SpecialPage.php.

Change-Id: I99dd2de7fe461f2fad4e0bd315ebc2899958a90f
2012-05-10 17:51:44 +02:00
Reedy
3a6211a28c Document StringUtils
Change-Id: I5a99ed602c6bf99473e2deb1c5f38faa98def30e
2012-04-07 20:00:10 +01:00
Antoine Musso
73247df204 Remove backslash from @return types
Ping r111103
2012-02-13 16:35:59 +00:00
Sam Reed
12a9b1d2fb More documentation tweaks and updates 2011-05-21 19:54:24 +00:00
Tim Starling
5eac114e5a (bug 27093, CVE-2011-0047): Fixed CSS injection vulnerability. The StringUtils.php patch is by Roan, the Sanitizer.php patch is by me. 2011-02-01 22:36:43 +00:00
Alexandre Emsenhuber
6e2ecb581f Fixed a doxygen warning 2010-10-02 14:23:26 +00:00
Niklas Laxström
2b042ba2ae Mark the comment as documentation 2010-08-08 06:36:44 +00:00
Alexandre Emsenhuber
55234801e1 Fixed some doxygen warnings 2010-03-29 20:10:29 +00:00
Tim Starling
8b9bedbad7 Revert r61528, r61527, r61526, r61525, r61519, r61515, r61053, r61052 (Parser::doQuotes() rewrite). Lots of issues to discuss, needs more review than I have time to give it pre-1.16. I'll split it out to a branch. 2010-01-27 02:41:22 +00:00
Platonides
11f8b8390c Step 4: Profit!!
Add and use PregSplitIterator instead of a direct preg_split.
Slower, but with an upper bound on memory usage.
2010-01-26 18:58:07 +00:00
Brion Vibber
5b5f7b30b3 Revert r40837, r40839, r40840 (bug 332 - broken UTF-8)
Char-by-char scan of all output will perform very poorly and fails to address the root problem of bad internal treatment of strings.
2008-09-15 17:51:53 +00:00
Fran Rogers
ad5f1acdb3 Fix for bug #332 - all UTF-8 output is now cleaned of invalid forms as defined by RFC 3629. All output from MediaWiki should now be valid UTF-8 in all circumstances. 2008-09-15 00:42:17 +00:00
Fran Rogers
3ad5bfb749 Fix for problems with r39414; LinkHolderArray::replaceInterwiki() was badly broken 2008-08-16 10:13:35 +00:00
Siebrand Mazeland
2dedbbdfa1 Revert r39414. Breaks processing links like [[:wikipedia:nl:User:Siebrand|Dutch language Wikipedia]]. It will add a comment like "<!--IWLINK 0-->" in the HTML output. Happens even if there is one such link on a page. 2008-08-16 09:33:11 +00:00
Tim Starling
c45292ac40 * In the parser: do link existence tests in batches of 1000. Avoids using excessive memory to store Title objects.
* Split link placeholder/replacement handling into a separate object, LinkHolderArray.
* Remove Title objects from LinkCache, they apparently weren't being used at all. Same unconstrained memory usage as the former $parser->mLinkHolders.
* Introduced ExplodeIterator -- a workalike for explode() which doesn't use a significant amount of memory
* Introduced StringUtils::explode() -- select whether to use the simulated or native explode() depending on how many items there are
* Migrated most instances of explode() in Parser.php to StringUtils::explode()
* Renamed some variables in Parser::doBlockLevels()
* In Parser.php: $fname => __METHOD__, Parser => self/__CLASS__, to support Parser_DiffTest more easily
* Doc update in includes/MessageCache.php for r39412
* MW_TITLECACHE_MAX => Title::CACHE_MAX, nicer name, easier to access from another module
2008-08-15 16:35:03 +00:00
Shinjiman
69dbeb97f1 * (bug 14604) Introduced the following features for the LanguageConverter: Multi-tag support, single conversion flag, remove conversion flag on a single page, description flag, variant name, multi-variant fallbacks.
patch by fdcn
* Added zh-mo and zh-my variants for the zh language
2008-06-26 03:00:34 +00:00
Siebrand Mazeland
79d5225c0e * remove end of line whitespace
* remove empty lines at end of file
* remove "?>" where still present
2008-04-14 07:45:50 +00:00
Aryeh Gregor
a15c419b3d Remove ?>'s from files. They're pointless, and just asking for people to mess with the files and add trailing whitespace. (Yes, I looked over every one and reverted those that were bogus. Slash-enter a million times in less worked well enough, although it was a bit mind-numbing.) 2007-06-29 01:19:14 +00:00
Antoine Musso
16558d1dbc Added some comments to our classes. 2007-04-21 12:42:27 +00:00
Nick Jenkins
ae8554c45b Completing code housekeeping stuff for rest of includes/ directory: removing unused local vars, removing unused globals, replacing extract() where simple to do, declaring output arrays before calling preg_match(), and so forth. 2006-11-29 11:43:58 +00:00
Tim Starling
61af76f260 Implementation of delimiterReplace() with a behaviour much closer to that of the model regex. Tested using comparitive fuzz testing. The only known difference now is where the start delimiter ends with an initial substring of the end delimiter, e.g. the previously mentioned case of C-style comments. 2006-11-22 07:08:50 +00:00
Tim Starling
674f3561dd profiling 2006-11-21 11:20:04 +00:00
Tim Starling
1d2dc36ac1 Collection of generic string functions and classes 2006-11-21 10:38:07 +00:00