This allows de-duplication of single page jobs for the
same page due to edits to different templates. This is
the same logic that RefreshLinksJob already has.
Also fix a bug in that method in RefreshLinksJob.
Change-Id: I2f79031c945eb3d195f9dbda949077bbc3e67918
Do not count jobs that just make subdivide as having any
"work items". This makes $wgJobBackoffThrottling less
overzealous when used to limit these type of jobs.
The main reason to limit htmlCacheUpdate would be for
CDN purge rate limiting. For refreshLinks, it would
mostly be lag, though that is already handled for
leaf jobs and JobRunner itself.
Bug: T173710
Change-Id: Ide831b555e51e3111410929a598efb6c0afc0989
I was bored. What? Don't look at me that way.
I mostly targetted mixed tabs and spaces, but others were not spared.
Note that some of the whitespace changes are inside HTML output,
extended regexps or SQL snippets.
Change-Id: Ie206cc946459f6befcfc2d520e35ad3ea3c0f1e0
This lets the runJobs.php $wgCommandLineMode hack be removed.
Some fixes based on unit tests:
* Only call applyTransactionRoundFlags() for master connections
for transaction rounds from beginMasterChanges().
* Also cleaned up the commitAndWaitForReplication() reset logic.
* Removed deprecated DataUpdate::doUpdate() calls from jobs
since they cannot nest in a transaction round.
Change-Id: Ia9b91f539dc11a5c05bdac4bcd99d6615c4dc48d
* Removed the lockAndGetLatest() call which caused contention problems.
Previously, job #2 could block on job #1 in that method, then job #1
yields the row lock to job #2 in LinksUpdate::acquirePageLock() by
committing, then job #1 blocks on job #2 in updateLinksTimestamp().
This caused timeout errors. It also is not fully safe ever since
batching and acquirePageLock() was added.
* Add an outer getScopedLockAndFlush() call to runForTitle() which
avoids this contention (as well as contention with page edits)
but still prevents older jobs from clobbering newer jobs. Edits
can happen concurrently, since they will enqueue a job post-commit
that will block on the lock.
* Use the same lock in DeleteLinksJob to edit/deletion races.
Change-Id: I9e2d1eefd7cbb3d2f333c595361d070527d6f0c5
Just do a single slave lag wait check when branching the base job.
Any remnant/leaf jobs after than do not have to do anything special.
This should also improve de-duplication and reduce commonswiki
errors like "Could not acquire lock on page #42482792" due to
insane pages.
Change-Id: I40f9c6e0e905bd8149bb364c33a0642628cb1423
* Make this actually use the cache beyond edge cases
by making the page_touched check less strict. The
final check on the cache timestamp is good enough.
* Log metrics to statsd to give visibility.
Change-Id: I14c14846a7b68d079e1a29c6d50e354a3c1926d6
If we already know that the triggeringRevisionId is outdated, fail early
instead of doing all the work of re-parsing that old revision and
preparing all the updates only to fail later at the lockAndGetLatest()
call.
Change-Id: Ic70c659899d5d47e74fa17c88ed26b436732ca8a
This is needed if the $useOutdated behavior of ParserCache
is modified per Ibd111bed203dd.
Bug: T133659
Change-Id: I70806dffba8af255d7cdad7663132b58479f63e3
This partially reverts 22476baa85, as the setTriggeringUser()
call that was removed was being used by Echo to be able to determine
which user caused a LinksUpdate to be triggered.
Bug: T121780
Change-Id: I62732032a6b74f17b5ae6a2497fa519f9ff38d4f
* Do not de-duplicate jobs with "masterPos". It either does not
catch anything or is not correct. Previously, it was the later,
by making getDuplicationInfo() ignore the position. That made the
oldest DB position win among "duplicate" jobs, which is unsafe.
* From graphite, deduplication only applies .5-2% of the time for
"refreshLinks", so there should not be much more duplicated
effort. Dynamic and Prioritized refreshLinks jobs remain
de-duplicated on push() and root job de-duplication still applies
as it did before. Also, getLinksTimestamp() is still checked to
avoid excess work.
* Document the class constants.
Change-Id: Ie9a10aa58f14fa76917501065dfe65083afb985c
* Use READ_LATEST when needed to distinguish slave lag
affecting new pages from page deletions that happened
after the job was pushed. Run-of-the-mill mass backlink
updates still typically use "masterPos" and READ_NORMAL.
* Search for the expected revision (via READ_LATEST)
for jobs triggered by direct page edits. This avoids lag
problems for edits to existing pages.
* Added a CAS-style check to avoid letting jobs clobber
the work of other jobs that saw a newer page version.
* Rename and expose WikiPage::lock() method.
* Split out position wait logic to a separate protected
method and made sure it only got called once instead of
per-title (which didn't do anything). Note that there is
normally 1 title per job in any case.
* Add FIXME about a related race-conditions.
Bug: T117332
Change-Id: Ib3fa0fc77040646b9a4e5e4b3dc9ae3c51ac29b3
So extensions like Echo are able to attribute post-edit link updates to
specific the users who triggered them.
Bug: T116485
Change-Id: I083736a174b6bc15e3ce60b2b107c697d0ac13da
* Focus on updating links that would *not* already be updated
by jobs, not those that already *will* be updated.
* Place the jobs into a dedicated queue so they don't wait
behind jobs that actually have to parse every time. This
helps avoid queue buildup.
* Make Job::factory() set the command field to match the value
it had when enqueued. This makes it easier to have the same
job class used for multiple queues.
* Given the above, remove the RefreshLinksJob 'prioritize' flag.
This worked by overriding getType() so that the job went to a
different queue. This required both the special type *and* the
flag to be set if using JobSpecification or either ack() would
route to the wrong queue and fail or the job would go in the
regular queue. This was too messy and error prone. Cirrus jobs
using the same pattern also had ack() failures for example.
Change-Id: I5941cb62cdafde203fdee7e106894322ba87b48a
New enqueue method of DeferredUpdates was turning LinksUpdate
updates into Jobs. However RefreshLinksJob was not properly
reconstructing the secondary updates as being recursive (if they
were recursive). This means that templates weren't having the pages
that were using them being updated.
See also related T116001.
Change-Id: Ia06246efb2034fdfe07232fd8c2334160edbcf02
This is part of a chain that reverts:
e412ff5ecc.
NOTE:
- The feature is disabled by default
- User settings default to hiding changes
- T109707 Touching a file on wikisource adds and
removes it from a category... Even when page
has no changes.... WTF? See linked issue,
marked as stalled with a possible way forward
for this patch.
@see https://gerrit.wikimedia.org/r/#/c/235467/
Changes since version 1:
- T109604 - Page names in comment are no longer
url encoded / have _'s
- T109638 & T110338 - Reserved username now used
when we can't determine a username for the change
(we could perhaps set the user and id to be blank
in the RC table, but who knows what this might do)
- T109688 - History links are now disabled in RC....
(could be fine for the introduction and worked
on more in the future)
- Categorization changes are now always patrolled
- Touching on T109672 in this change emails will never
be sent regarding categorization changes. (this
can of course be changed in a followup)
- Added $wgRCWatchCategoryMembership defaulting to true
for enabling / disabling the feature
- T109700 - for cases when no revision was retrieved
for a category change set the bot flag to true.
This means all changes caused by parser functions
& Lua will be marked as bot, as will changes that
cant find their revision due to slave lag..
Bug: T9148
Bug: T109604
Bug: T109638
Bug: T109688
Bug: T109700
Bug: T110338
Bug: T110340
Change-Id: I51c2c1254de862f24a26ef9dbbf027c6c83e9063
'isParserCachedUsed' implies that the parser cache usage has already occurred,
and obscures the true purpose of this method, which is to determine whether or
not the requested page *should* be looked up in the parser cache.
Only usage in extensions is in TextExtracts, which I changed to be both
backward- and forward-compatible in If5d5da8eab13.
Change-Id: I7de67937f0e57b1dffb466319192e4d400b867de
* On Wikipedia, for example, these jobs are good percent of
all refreshLinks jobs; skipping the parse step should avoid
runner CPU overhead
* Also fixed bad TS_MW/TS_UNIX comparison
* Moved the fudge factor to a constant and raised it a bit
Bug: T98621
Change-Id: Id6d64972739df4b26847e4374f30ddcc7f93b54a
* Use special prioritized refreshLinksJobs instead, which triggers when
transcluded pages are changed
* Also added a triggerOpportunisticLinksUpdate() method to handle
dynamic transcludes
bug: T89389
Change-Id: Iea952d4d2e660b7957eafb5f73fc87fab347dbe7
One theory for what's behind bug 46014 is that the vandal submits the
edit, then someone (maybe the vandal) gets into the branch of
Article::view that uses PoolWorkArticleView, then ClueBot comes along
and reverts before the PoolWorkArticleView actually executes. Once that
PoolWorkArticleView actually does execute, it overwrites the parser
cache entry from ClueBot's revert with the one from the old edit.
To detect this sort of thing, let's include the revision id in the
parser cache entry and consider it expired if that doesn't match. Which
makes sense to do anyway.
And for good measure, let's have PoolWorkArticleView not save to the
parser cache if !$isCurrent.
Bug: 46014
Change-Id: Ifcc4d2f67f3b77f990eb2fa45417a25bd6c7b790