FormatJson: Optimize encode() for supported PHP versions

- Removed the str_replace() call to replace unescaped line terminators
  if UTF8_OK is set. PHP 7.1 and later escape these by default.

  The speedup isn't much at all (about 1% in my testing when encoding an
  API siteinfo result taken from enwiki). Perhaps it's not surprising
  given the way str_replace() works[1]. Still, it's better not to spend
  CPU time looking for characters that will not occur.

- Changed the algorithm for the optional spaces-to-tabs conversion when
  pretty printing. Instead of replacing one indent level throughout the
  entire string before replacing the next level, use a regex to replace
  in one pass. This is usually faster now that PHP 7 enables PCRE's JIT
  compiler by default. Without JIT, the regex was often slower.

  The speedup can be large for deeply nested data. For example, in my
  testing the languages/i18n data took about 8% less time to encode as
  tab-indented JSON, yet the API site info result took about 45% less.
  (This, of course, isn't actually relevant to the API even when pretty
  printed output is requested, because ApiFormatJson uses the default
  indent string of four spaces, which will always be faster unless
  support for tab indentation is added to PHP's json extension.)

- Set options using if statements instead of the ternary operator. This
  is the clearer way, and maybe the slightly faster one, skipping the
  assignment when the flags do not need to be set.

[1]: https://github.com/php/php-src/blob/PHP-8.0.10/ext/standard/string.c#L2969

Change-Id: Iebb1df0264e335a1819956710eeacf6d6b8f1471
This commit is contained in:
Kevin Israel 2021-08-20 08:03:11 -04:00
parent 03329fbc7a
commit 210a34369a
2 changed files with 22 additions and 46 deletions

View file

@ -76,25 +76,6 @@ class FormatJson {
*/
public const STRIP_COMMENTS = 0x400;
/**
* Characters problematic in JavaScript.
*
* @note These are listed in ECMA-262 (5.1 Ed.), §7.3 Line Terminators along with U+000A (LF)
* and U+000D (CR). However, PHP already escapes LF and CR according to RFC 4627.
*/
private const BAD_CHARS = [
"\u{2028}", // U+2028 LINE SEPARATOR
"\u{2029}", // U+2029 PARAGRAPH SEPARATOR
];
/**
* Escape sequences for characters listed in FormatJson::BAD_CHARS.
*/
private const BAD_CHARS_ESCAPED = [
'\u2028', // U+2028 LINE SEPARATOR
'\u2029', // U+2029 PARAGRAPH SEPARATOR
];
/**
* Returns the JSON representation of a value.
*
@ -107,42 +88,33 @@ class FormatJson {
*
* @param mixed $value The value to encode. Can be any type except a resource.
* @param string|bool $pretty If a string, add non-significant whitespace to improve
* readability, using that string for indentation. If true, use the default indent
* string (four spaces).
* readability, using that string for indentation (must consist only of whitespace
* characters). If true, use the default indent string (four spaces).
* @param int $escaping Bitfield consisting of _OK class constants
* @return string|false String if successful; false upon failure
*/
public static function encode( $value, $pretty = false, $escaping = 0 ) {
if ( !is_string( $pretty ) ) {
$pretty = $pretty ? ' ' : false;
}
// PHP escapes '/' to prevent breaking out of inline script blocks using '</script>',
// which is hardly useful when '<' and '>' are escaped (and inadequate), and such
// escaping negatively impacts the human readability of URLs and similar strings.
$options = JSON_UNESCAPED_SLASHES;
$options |= $pretty !== false ? JSON_PRETTY_PRINT : 0;
$options |= ( $escaping & self::UTF8_OK ) ? JSON_UNESCAPED_UNICODE : 0;
$options |= ( $escaping & self::XMLMETA_OK ) ? 0 : ( JSON_HEX_TAG | JSON_HEX_AMP );
$json = json_encode( $value, $options );
if ( $json === false ) {
return false;
}
if ( $pretty !== false && $pretty !== ' ' ) {
// Change the four-space indent to a tab indent
$json = str_replace( "\n ", "\n\t", $json );
while ( strpos( $json, "\t " ) !== false ) {
$json = str_replace( "\t ", "\t\t", $json );
}
if ( $pretty !== "\t" ) {
// Change the tab indent to the provided indent
$json = str_replace( "\t", $pretty, $json );
}
if ( $pretty || is_string( $pretty ) ) {
$options |= JSON_PRETTY_PRINT;
}
if ( $escaping & self::UTF8_OK ) {
$json = str_replace( self::BAD_CHARS, self::BAD_CHARS_ESCAPED, $json );
$options |= JSON_UNESCAPED_UNICODE;
}
if ( !( $escaping & self::XMLMETA_OK ) ) {
$options |= JSON_HEX_TAG | JSON_HEX_AMP;
}
$json = json_encode( $value, $options );
if ( is_string( $pretty ) && $pretty !== ' ' && $json !== false ) {
// Change the four-space indent to the provided indent.
// The regex matches four spaces either at the start of a line or immediately
// after the previous match. $pretty should contain only whitespace characters,
// so there should be no need to call StringUtils::escapeRegexReplacement().
$json = preg_replace( '/ {4}|.*+\n\K {4}/A', $pretty, $json );
}
return $json;

View file

@ -14,6 +14,8 @@ class FormatJsonTest extends MediaWikiUnitTestCase {
[ ' ', ' ' ],
// One tab
[ "\t", "\t" ],
// Empty string
[ '', '' ],
];
}
@ -34,6 +36,7 @@ class FormatJsonTest extends MediaWikiUnitTestCase {
'"7":["8",{"9":"10"}]',
// Whitespace clean up doesn't touch strings that look alike
"{\n\t\"emptyObject\": {\n\t},\n\t\"emptyArray\": [ ]\n}",
" []",
],
];
@ -48,7 +51,8 @@ class FormatJsonTest extends MediaWikiUnitTestCase {
456
],
"\"7\":[\"8\",{\"9\":\"10\"}]",
"{\n\t\"emptyObject\": {\n\t},\n\t\"emptyArray\": [ ]\n}"
"{\n\t\"emptyObject\": {\n\t},\n\t\"emptyArray\": [ ]\n}",
" []"
]
}';