Parsoid/Parser Unification/zh
目前,我们在维基媒体集群(以及其他一些第三方wiki)的MediaWiki中使用了两个独立的维基文本解析器。 一个是原始的核心解析器 (旧版解析器),另一个是Parsoid 。 截至2023年初,核心解析器用于所有桌面和移动网页阅读视图,而Parsoid则用于服务所有编辑客户端(可视化编辑器、结构化讨论、内容翻译)、代码检查工具(Extension:Linter)、一些小工具、移动应用程序、Kiwix离线阅读器、维基媒体企业版以及谷歌知识图谱项目。
该项目的目的是创建一个能够支持所有客户端和用例的统一解析器。 这将使我们的维基对于编辑者、读者和工具来说更加可靠和一致。拥有一个用于维基文本处理的单一代码库将促进未来维基文本功能的开发。
这个项目主要由Content Transform Team 团队(之前是Parsing 团队)驱动,媒体维基平台团队、所有开发Parsoid客户端的内部团队、運動传播团队(之前是社群关系专家)、维基媒体维基编辑社群以及第三方MediaWiki项目也参与其中,因为这次解析器统一将影响到所有这些团队和项目。
本页面包含项目概述;我们还维护一个路线图、里程碑和更新 页面,以及包含更多技术信息 的页面列表。
Project Goals
Longer Term Goal: Parsoid is the default wikitext engine for MediaWiki and the legacy parser is removed from the codebase
Intermediate Goal: Parsoid replaces the core parser for all wikitext use cases on the Wikimedia cluster.
Why
Why unification?: Maintaining two wikitext engines requires a lot of resources and would require a duplication of efforts for new features.
Why Parsoid?: Parsoid meets all the editing use cases, API use cases that are unique to Parsoid (ex: Enterprise, Kiwix), and active work is in progress for it to meet all the read use cases. The legacy parser does not support HTML-based editing use cases (ex: VisualEditor).
How are we testing this change?
- Parser tests: This is how Parsoid has been developed since its inception. We ensure that Parsoid continues to pass parser tests, and where divergence is known, it is recorded after careful review. We have also vastly expanded parser test coverage over the years, and all patches against Parsoid need to pass tests.
- Round-tripping / Integration tests: In this mode, before every production deployment, we convert wikitext to HTML and HTML back to wikitext on about 180K pages from about 50 production wikis. While this testing mode is primarily to ensure our HTML -> wikitext conversion is not broken (which would impact our editing client tools), this also implicitly serves to flag any breakages in our HTML output. But, these aren't the most reliable tests for verifying that our HTML output is not broken.
- Visual diff tests: Here, we take renderings of legacy parser HTML and Parsoid HTML and compare the rendering screenshots and generate a numeric diff score. A typical run will involve 25k+ pages from about 20 production wikis. This has been a really reliable way to identify various breakages and bugs in Parsoid output. As we deploy we are expanding our testing to a wider range of wikis and improving the tool's ability to distinguish real issues from false positives.
- Parsoid reading and editing clients: Parsoid's output has been used over the years by VisualEditor, Android and iOS mobile apps, Kiwix, and other clients. We have fixed a number of bugs and incompatibilities in Parsoid over the years and continue to fix the various long-tail edge cases as they are discovered and reported.
As our rollout progresses, we will continue identify other QA and testing methodologies as required to ensure we can roll out this change in as smooth and non-disruptive fashion as possible.
What is our deployment plan / strategy?
At this stage of this project, we have split this work into a number of steps to achieve the intermediate goal.
- ✅ Deploy changes to core that makes media structure HTML largely identical to what Parsoid emits. This has its own deployment plan. This change has been live on mediawiki.org and officewiki since September 2021 and we expect to roll this out to all wikis gradually in 2022.
- ✅ Deploy individual user opt-in tools to use Parsoid for read views as part of the ParserMigration extension.
- ✅ Deploy changes to Wikimedia production that lets DiscussionTools use Parsoid HTML directly. This lets us iron out bugs in a restricted use case.
- Turn on Parsoid HTML read views on additional wikis incrementally
- ✅ officewiki
- ✅ Talk pages on wikitech
- ✅ wikivoyage (except zhwikivoyage)
- ✅ Incubator and Dagbani Wikipedia
- ✅ Most wikitionaries
- (in progress) remaining wiktionaries (except those using LanguageConverter)
- (next) low-traffic wikipedias
- ...
- Continued work to ensure Parsoid is able to generate identical metadata that the legacy parser generates (categories, backlink tables, page properties, etc). This is needed for tighter integration of Parsoid into MediaWiki core and to start replacing the legacy parser in additional wikitext use cases.
- Use of Parsoid to generate user interface messages
- Shipping a long-term support release (planned to be 1.47) with Parsoid as the default wikitext parser out of the box.
Confidence Framework
To validate our road-map evolution and use data-driven decision making for deployments, we have developed a Confidence Framework for Parsoid Read Views. This framework contains the guidelines for how we prioritise features, bugfixes, and deployments.
How does this impact wikis?
For the most part, the switch to Parsoid generated HTML should be transparent to most users. But, below, we outline some possible impacts on readers, editors, and developers.
Readers
Parsoid models and processes wikitext differently compared to the legacy parser and this can sometimes lead to differences in rendering in some edge case scenarios. If some wikitext pattern is commonly used, we have attempted to support that in Parsoid where possible, and where not, by either fixing or providing support to fix them up. At this time, we believe all rendering differences we expect to run into will be edge cases that can likely be adjusted by fixing wikitext either on individual pages or on templates.
Editors and bot, gadget, skin developers
- Parsoid's HTML for media wikitext is different from what the legacy parser has typically generated. As part of a separate project to use semantic HTML5 output for images, the legacy parser is currently being updated to generate HTML that is pretty close to Parsoid's HTML. We expect to roll this out this year which might require some skins, gadgets, bots, and template styles to be updated.
- The Cite extension that targets Parsoid relies on CSS rules to localize numbering of references rather than generate localized HTML. This requires editors with appropriate permissions to update MediaWiki:Common.css on their wikis to add suitable CSS rules targeting this HTML.
Extension developers
Parsoid's internal processing model is different from the legacy parser. As a result, extensions may need to be updated. This only impacts extensions that do one or more of the following: (a) operate on wikitext (b) provide handlers for parser hooks (c) call a public method of the legacy parser.
Extensions that process wikitext will definitely need to be updated to work with Parsoid. To date, the vast majority of such extensions have been updated. Since Parsoid continues to access the legacy parser for expanding templates, processing parser functions, any parser hooks triggered during this processing will continue to operate and extensions that rely on these hooks will continue to operate. For the rest, we are exploring strategies to minimize updates needed to extensions.
We file phabricator tasks for all impacted extensions, and will fix whatever extensions we can within our team. If you are an extension developer, we would greatly appreciate any proactive work and prompt code review for patches we might submit.
What kind of support will we provide to impacted editor and developer communities?
The Content Transform Team is driving this project. Our goal is to make this switch to Parsoid as seamless as possible. So, we have tried to roll out changes over the years gradually.
We started with replacing HTML4 Tidy with HTML5 RemexHtml in the 2015 - 2018 timeframe. In 2019, in preparation to integrate Parsoid into MediaWiki core more closely, we ported Parsoid from JS to PHP. This switch went very smoothly. In the 2020 - 2022 timeframe, we started work to unify the media output generated by Parsoid and by core. This has mostly involved making changes to core, but we have occasionally adjusted Parsoid's output based on feedback and other technical considerations. In 2024 we began deploying Parsoid as the default parser for page views on wikivoyage.
Going forward, we will provide support in the following ways:
- Linter rules for any wikitext that needs fixing.
- The vast majority of this work was completed as part of the Tidy -> Remex migration and we don't expect to introduce a large number of new linter categories for this
- Communication via this page, via tech news updates, and via updates and posts to village pump and other wiki-specific forums.
- Opt-in mechanisms for early adopter users / wikis to test and report problems.
- See the next section for more details!
How can you help / be involved?
Starting November 2023, you can opt-in to using the new Parsoid parser for reading articles on Wikipedia. See Help:Extension:ParserMigration for more information!
Other things you can do to help:
- Test your gadgets / user scripts against Parsoid HTML to identify / fix any breakages
- Use the ParserMigration extension to allow you to easily swap between legacy and Parsoid HTML
- Parsoid read views will be rolled out first on wikis whose communities have elected to be early adopters; watch this space for more details.
Related documentation
- Parsoid/Parser Unification/Updates Project updates
- Parsoid/Parser Unification/Known Issues lists differences between the Parsoid parser and the legacy parser which a user is likely to notice when using the ParserMigration extension.
- Parsoid/Parser Unification/Instructions for editors
- Parsoid Performance Considerations outlines performance work needed for Parsoid and acceptance criteria for deploy
- Historical documents:
- Known differences between Parsoid & core parser output is a more technical page listing specific differences in HTML output, of interest to developers and authors of CSS styles. This page is not as relevant anymore. We are now focused on more fine-grained problems found during visual diff testing.
- The Pixel_Diff_Testing_Stats page tracked progress in measuring (and achieving) rendering parity between Parsoid and core parser output. This page is no longer relevant right now because we are no longer doing generic visual diff tests but instead focusing on wikis we are rolling out to and doing focused visual diff tests for those wikis and fixing problems we find.
- February 2019 tech talk: The long and winding road to making Parsoid the default MediaWiki parser
- Replacing Tidy: Project page related to replacing Tidy
- Parsing/Notes/Two Systems Problem. This 2016 document explored different options at arriving at a single parser
- Parsing/Notes/Moving Parsoid Into Core