A Brief History of Wikitext

Hey, I’m aismallard, and I’m an SCP Wiki administrator and Co-Captain of its Technical Team. I’d like to talk a bit about FTML, the parser and render library that bluesoul mentioned in The Story So Far.

I originally created the initial ftml repository on February 6th, 2019 (prior to becoming Junior Staff). I had been editing a draft of mine, and frustrations with Wikidot’s poor editor experience made me wish there was an independent tool for live preview of my work. I decided to start work on an independent library for parsing and rendering wikitext, which I named after the file extension mentioned in Kate McTiriss’s Proposal. While I recognized it would be a substantial task, it is not an inherently impossible one, and I figured I could get it mostly done in a few months of dedicated work.

I was wrong.

Here, 2¼ years later I am writing this in a very different environment, with Wikijump underway and the library having significantly changed forms a number of times. I’ll briefly walk through the history of this project, and how our understanding of wikitext has evolved.

The very first attempt at a parsing library was to simply reproduce the regular expressions used by Text_Wiki (the PHP library Wikidot uses to process page sources). I quickly realized this was untenable, as nested structures like [[div]] would require too many hacks to work. After all, this library was meant to improve off of Text_Wiki: we would need a real parser.

After having thought about the parser structure and looking at options for a bit, I settled on Pest. I wrote up a grammar for capturing various syntactical constructs. Being able to define a PEG-based grammar made the process rather smooth, and adding new constructs was fairly easy. This remained the primary paradigm for the library until June 2020.

How Wikidot renders improper [[div]] syntax

One large issue with the well-defined grammar setup is that Wikidot is not well-defined. If you have a syntax error in a language like C, then the compiler will simply refuse to proceed further. However in Wikidot, any malformed structures simply appear as the text themselves. A system which rejects the entire page for having a minor flaw, even with good error reporting, is simply not going to work with how people use Wikidot, nor many of the existing pages out there.

Additionally there were several parser ambiguities I encountered at this point. For instance, dashes in Wikidot’s syntax are extremely overloaded:

  • -- is an em dash
  • --some text-- is strikethrough
  • --- produces an em dash followed by a regular dash (—-), when the user almost certainly wants:
  • ---- (or longer) for horizontal rules
  • And [!-- ... --] produces a comment.

The initial approach was to handle -- → — in the preprocessor. However, this borked comments and strikethroughs by turning them into —text like this— or [!— like this —]. Then I moved the em dash to the pest parser, which fixed the comments issue, but still broke strikethroughs, as the parser preferred the “simpler” construction. One unfortunately reality is that, even without text fallback, Wikidot grammar is fundamentally ambiguous.

Instead I began working on a different approach, which is how the parser works today. It is a hand-written recursive descent parser which iterates through the tokens and attempts to match them based on a series of rules. The “tokenizer” is in fact Pest, but using a much simpler grammar to extract terminal tokens only, and a fallback token to catch any other text. This isn’t very performant, but for now it works. There is a future issue to write a new lexer which lacks the complexity of Pest.

The parser’s rules correspond to various syntactical constructs, like bolded text, bullet lists, titles, and blocks (the name for objects like [[div]] and [[span]]). If a rule for the token fails, it proceeds to the next rule and tries that. For instance, when encountering Token::Dash (--), it first tries to match a strikethrough, and that failing, outputs an em dash.

Once all the rules for the present token have been tried, it then emits a parser warning and goes to the fallback rule, which simply interprets it as raw text. This way it can retain the flexibility that Text_Wiki has, but with the structure and non-regular expression-based hackiness of repeated regular expressions. Additionally, this allows us to emit warnings: non-fatal exceptions to inform the user that the parser encountered an issue, but not terminate parsing.

In addition to several utilities to help parsing common patterns, such as simple formatting containers like italics or underline, there exist extensions to the parser structure for handling blocks. This allows easy and modular capturing of blocks, including argument parsing and interpretation of the body (if it has one). For instance, Wikidot is very inconsistent around which attributes it accepts: [[div]], [[span]], and some others accept id, class, and style, but [[module Join]] only accepts class, and [[a]] accepts a few additional attributes. ftml changed this to accept any HTML attribute which was on a safe list (preventing inline script injection or sandbox escaping, but allowing user flexibility).

This is a massive improvement over Text_Wiki, which inconsistently handles arguments, contents, and name case-sensitivity between blocks, which leads to easily-avoidable issues like the long-standing inability to nest collapsibles in Wikidot.

Around this time, Monkatraz inquired about a WebAssembly build of ftml for use in Sheaf, Wikijump’s replacement editor. This would permit live previews without expensive back-and-forth between the client and the server (as well as offline support). However, this presented some challenges, such as the fact that parser warnings returned slice indices in UTF-8 (what Rust uses for string encoding), while Javascript uses UTF-16. Additionally, we found that the performance of library logging in a browser environment was notable, so I spent time wrapping logging in a compile-time flag. This way it is possible to produce a build with no logging code at all.

Next was integrating ftml into Wikijump’s PHP code. The library is called from PHP using FFI. This means that all of the Rust functionality is wrapped in C ABI-compatible functions, which is built into the dynamic library (libftml.so). PHP is then able to interface with these C functions using its FFI capabilities, which permits native PHP code to invoke it from the main php-fpm container. This has improved performance over a remote service, but increases the complexity of the container build process.

Once the FFI interface was present, then came the challenge of refactoring Wikidot code to enable abstracted wikitext transformation. In true Wikidot fashion, Text_Wiki has inconsistent expectations across usages. I’ve been able to replace direct invocations of the library with an interface called WikitextBackend. This effectively routes the same calls to differing backends, allowing a Wikijump installation to switch between Text_Wiki and ftml via a feature flag in wikijump.ini. Work on this interface is ongoing, as there are some remainders in Wikidot’s usages of rendering that aren’t quite accounted for yet, but I hope to have a proposal soon for the shape of ftml’s remote handle (an abstraction around fetching contextual data like information on other pages).

Now we are able to use either wikitext backend, which will help a lot as ftml continues to develop and adds the features it is missing for parity with mainline Wikidot’s Text_Wiki. Hopefully soon, ftml can hit 1.0 and can take its place as the definitive parser and renderer soon!

Author: aismallard

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.