How to mirror the Russian Wikipedia with Debian and Kiwix

It has been reported that the Russian government has threatened to block access to Wikipedia for documenting narratives that do not agree with the official position of the Russian government.

One of the anti-censorship strategies I've been working on is Kiwix, an offline Wikipedia reader (and plenty of other content too). Kiwix is free and open source software developed by a great community of people that I really enjoy working with.

With threats of censorship, traffic to Kiwix has increased fifty-fold, with users from Russia accounting for 40% of new downloads!

You can download copies of every language of Wikipedia for offline reading and distribution, as well as hosting your own read-only mirror, which I'm going to explain today.

Disclaimer: depending on where you live it may be illegal or get you in trouble with the authorities to rehost Wikipedia content, please be aware of your digital and physical safety before proceeding.

With that out of the way, let's get started. You'll need a Debian (or Ubuntu) server with at least 30GB of free disk space. You'll also want to have a webserver like Apache or nginx installed (I'll share the Apache config here).

First, we need to download the latest copy of the Russian Wikipedia.

$ wget 'https://download.kiwix.org/zim/wikipedia/wikipedia_ru_all_maxi_2022-03.zim'

If the download is interrupted or fails, you can use wget -c $url to resume it.

Next let's install kiwix-serve and try it out. If you're using Ubuntu, I strongly recommend enabling our Kiwix PPA first.

$ sudo apt update
$ sudo apt install kiwix-tools
$ kiwix-serve -p 3004 wikipedia_ru_all_maxi_2022-03.zim

At this point you should be able to visit http://yourserver.com:3004/ and see the Russian Wikipedia. Awesome! You can use any available port, I just picked 3004.

Now let's use systemd to daemonize it so it runs in the background. Create /etc/systemd/system/kiwix-ru-wp.service with the following:

[Unit]
Description=Kiwix Russian Wikipedia

[Service]
Type=simple
User=www-data
ExecStart=/usr/bin/kiwix-serve -p 3004 /path/to/wikipedia_ru_all_maxi_2022-03.zim
Restart=always

[Install]
WantedBy=multi-user.target

Now let's start it and enable it at boot:

$ sudo systemctl start kiwix-ru-wp
$ sudo systemctl enable kiwix-ru-wp

Since we want to expose this on the public internet, we should put it behind a more established webserver and configure HTTPS.

Here's the Apache httpd configuration I used:

<VirtualHost *:80>
        ServerName ru-wp.yourserver.com

        ServerAdmin webmaster@localhost
        DocumentRoot /var/www/html

        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined

        <Proxy *>
                Require all granted
        </Proxy>

        ProxyPass / http://127.0.0.1:3004/
        ProxyPassReverse / http://127.0.0.1:3004/
</VirtualHost>

Put that in /etc/apache2/sites-available/kiwix-ru-wp.conf and run:

$ sudo a2ensite kiwix-ru-wp
$ sudo systemctl reload apache2

Finally, I used certbot to enable HTTPS on that subdomain and redirect all HTTP traffic over to HTTPS. This is an interactive process that is well documented so I'm not going to go into it in detail.

You can see my mirror of the Russian Wikipedia, following these instructions, at https://ru-wp.legoktm.com/. Anyone is welcome to use it or distribute the link, though I am not committing to running it long-term.

This is certainly not a perfect anti-censorship solution, the copy of Wikipedia that Kiwix provides became out of date the moment it was created, and the setup described here will require you to manually update the service when the new copy is available next month.

Finally, if you have some extra bandwith, you can also help seed this as a torrent.


Building fast Wikipedia bots in Rust

Lately I've been working on mwbot-rs, a framework to write bots and tools in Rust. My main focus is for things related to Wikipedia, but all the code is generic enough to be used by any modern (1.35+) MediaWiki installation. One specific feature of mwbot-rs I want to highlight today is the way it enables building incredibly fast Wikipedia bots, taking advantage of Rust's "fearless concurrency".

Most Wikipedia bots follow the same general process:

  1. Fetch a list of pages
  2. For each page, process the page metadata/contents, possibly fetching other pages for more information
  3. Combine all the information together, into the new page contents and save the page, and possibly other pages.

The majority of bots have rather minimal/straightforward processing, so they are typically limited by the speed of I/O, since all fetching and edits happens over network requests to the API.

MediaWiki has a robust homegrown API (known as the "Action API") that predates modern REST, GraphQL, etc. and generally allows fetching information based on how it's organized in the database. It is optimized for bulk lookups, for example to get two pages' basic metadata (prop=info), you just set &titles=Albert Einstein|Taylor Swift (example).

However, when using this API, the guidelines are to make requests in series, not parallel, or in other words, use a concurrency of 1. Not to worry, there's a newer Wikimedia REST API (aka "RESTBase") that allows for concurrency up to 200 req/s, which makes it a great fit for writing our fearlessly concurrent bots. As a bonus, the REST API provides page content in annotated HTML format, which means it can be interpreted using any HTML parser rather than needing a dedicated wikitext parser, but that's a topic for another blog post.

Let's look at an example of a rather simple Wikipedia bot that I recently ported from Python to concurrent Rust. The bot's task is to create redirects for SCOTUS cases that don't have a period after "v" between parties. For example, it would create a redirect from Marbury v Madison to Marbury v. Madison. If someone leaves out the period while searching, they'll still end up at the correct article instead of having to find it in the search results. You can see the full source code on GitLab. I omitted a bit of code to focus on the concurrency aspects.

First, we get a list of all the pages in Category:United States Supreme Court cases.

let mut gen = categorymembers_recursive(
    &bot,
    "Category:United States Supreme Court cases",
);

Under the hood, this uses an API generator, which allows us to get the same prop=info metadata in the same request we are fetching the list of pages. This metadata is stored in the Page instance that's yielded by the page generator. Now calls to page.is_redirect() can be answered immediately without needing to make a one-off HTTP request (normally it would be lazy-loaded).

The next part is to spawn a Tokio task for each page.

let mut handles = vec![];
while let Some(page) = gen.recv().await {
    let page = page?;
    let bot = bot.clone();
    handles.push(tokio::spawn(
        async move { handle_page(&bot, page).await },
    ));
}

The mwbot::Bot type keeps all data wrapped in Arc<T>, which makes it cheap to clone since we're not actually cloning the underlying data. We keep track of each JoinHandle returned by tokio::spawn() so once each task is spawned, we can await on each one so the program only exits once all threads have been processed. We can also access the return value of each task, which in this case is Result<bool>, where the boolean indicates whether the redirect was created or not, allowing us to print a closing message saying how many new redirects were created.

Now let's look at what each task does. The next three code blocks make up our handle_page function.

// Should not create if it's a redirect
if page.is_redirect().await? {
    println!("{} is redirect", page.title());
    return Ok(false);
}

First we check that the page we just got is not a redirect itself, as we don't want to create a redirect to another redirect. As mentioned earlier, the page.is_redirect() call does not incur a HTTP request to the API since we already preloaded that information.

let new = page.title().replace(" v. ", " v ");
let newpage = bot.page(&new)?;
// Create if it doesn't exist
let create = match newpage.html().await {
    Ok(_) => false,
    Err(Error::PageDoesNotExist(_)) => true,
    Err(err) => return Err(err.into()),
}
if !create {
    return Ok(false);
}

Now we create a new Page instance that has the same title, but with the period removed. We need to make sure this page doesn't exist before we try to create it. We could use the newpage.exists() function, except it will make a HTTP request to the Action API since the page doesn't have that metadata preloaded. Even worse, the Action API limits us to a concurrency of 1, so any task that has made it this far now loses the concurrency benefit we were hoping for.

So, we'll just cheat a bit by making a request for the page's HTML, served by the REST API that allows for the 200 req/s concurrency. We throw away the actual HTML response, but it's not that wasteful given that in most cases we either get a very small HTML response representing the redirect or we get a 404, indicating the page doesn't exist. Issue #49 proposes using a HEAD request to avoid that waste.

let target = page.title();
// Build the new HTML to be saved
let code = { ... };
println!("Redirecting [[{}]] to [[{}]]", &new, target);
newpage.save(code, ...).await?;
Ok(true)

Finally we build the HTML to be saved and then save the page. The newpage.save() function calls the API with action=edit to save the page, which limits us to a concurrency of 1. That's not actually a problem here, as by Wikipedia policy, bots generally are supposed to pause 10 seconds in between edits if there is no urgency to the edits (in constrast to an anti-vandalism bot wants to bad revert edits as fast as possible). This is mostly to avoid cluttering up the feed of recent changes that humans patrol. So regardless how fast we can process the few thousand SCOTUS cases, we still have to wait 10 seconds in between each edit we want to make.

Despite that forced delay, the concurrency will make most bots faster. If the few redirects we need to create appear at the end of our categorymembers queue, we'd first have to process all the pages that come before it just to save one or two at the end. Now that everything is processed concurrently the edits will happen pretty quickly, despite being at the end of the queue.

The example we looked through was also quite simple, other bots can have handle_page functions that take longer than 10 seconds because of having to fetch multiple pages, in which case the concurrency really helps. My archiveindexer bot operates on a list of pages, for each page, it fetches all the talk page archives and builds an index of the discussions on them, which can easily end up pulling 5 to 50 pages depending on how controversial the subject is. The original Python version of this code took about 3-4 hours, the new concurrent Rust version finishes in 20 minutes.

The significant flaw in this goal of concurrent bots is that the Action API limits us to a concurrency of 1 request at a time. The cheat we did earlier requires intimate knowledge of how each underlying function works with the APIs, which is not a reasonable expectation, nor is it a good optimization strategy since it could change underneath you.

One of the strategies we are implementing to work around this is to combine compatible Action API requests. Since the Action API does really well at providing bulk lookups, we can intercept multiple similar requests, for example page.exists(), merge all the parameters into one request, send it off and then split the response up back to each original caller. This lets us process multiple threads' requests while still only sending one request to the server.

This idea of combining requests is currently behind an unstable feature flag as some edge cases are still worked out. Credit for this idea goes to Lucas Werkmeister, who pioneered it in his m3api project.

If this works interests you, the mwbot-rs project is always looking for more contributors, please reach out, either on-wiki, on GitLab, or in the #wikimedia-rust:libera.chat room (Matrix or IRC).


2022 goals

New year, new job, new goals. In no specific order:

  • Move out of my parents' house.
  • Contribute something meaningful to SecureDrop.
  • Contribute something meaningful to MediaWiki.
  • Not get COVID.
  • Continue contributing to Mailman.
  • Continue working on mwbot-rs, while having fun and learning more Rust.
  • Get more stickers (lack of in-person meetups has really been hurting my sticker collecting).
  • Port the rest of my wiki bots to Rust.
  • Make progress on moving wiki.debian.org to MediaWiki.
  • Write at least one piece of recognized content (DYK/GA/FA) for Wikipedia.
  • Travel outside the US (COVID-permitting).
  • Finish in the top half of our Fantasy Football league and Pick 'em pool. I did pretty well in 2020 and really regressed in 2021.
  • Keep track of TV show reviews/ratings. I've been pretty good about tracking movies I watch, but don't yet do the same for TV.

I'm hoping that 2022 will be better than the previous two years, the bar is really, really low.


What it takes to parse MediaWiki page titles...in Rust

In the UseModWiki days, Wikipedia page titles were "CamelCase" and automatically linked (see CamelCase and Wikipedia).

MediaWiki on the other hand uses the famous [[bracketed links]], aka "free links". For most uses, page titles are the primary identifier of a page, whether it's in URLs for external consumption or [[Page title|internal links]]. Consequently, there are quite a few different normalization and validation steps MediaWiki titles go through.

Myself and Erutuon have been working on a Rust library that parses, validates and normalizes MediaWiki titles: mwtitle. The first 0.1 release was published earlier this week! It aims to replicate all of the PHP logic, but in Rust. This is just a bit harder than it seems...

First, let's understand what a MediaWiki title is. A complete title looks like: interwiki:Namespace:Title#fragment (in modern MediaWiki jargon titles are called "link targets").

The optional interwiki prefix references a title on another wiki. On most wikis, looking at Special:Interwiki shows the list of possible interwiki prefixes.

Namespaces are used to distinguish types of pages, like articles, help pages, templates, categories, and so on. Each namespace has an accompanying "talk" namespace used for discussions related to those pages. Each namespace also has an internal numerical ID, a canonical English form, and if the wiki isn't in English, localized forms. Namespaces can also have aliases, for example "WP:" is an alias for the "Wikipedia:" namespace. The main article namespace (ns #0) is special, because its name is the empty string.

The actual title part goes through various normalization routines and is stored in the database with spaces replaced by underscores.

And finally the fragment is just a URL fragment that points to a section heading or some other anchor on pages.

There are some basic validation steps that MediaWiki does. Titles can't be empty, can't have a relative path (Foo/../Bar), can't start with a colon, can't have magic tilde sequences (~~~, this syntax is used for signatures), and they can't contain illegal characters. This last one is where the fun begins, as MediaWiki actually allows users to configure what characters are allowed in titles:

$wgLegalTitleChars = " %!\"$&'()*,\\-.\\/0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF+";

This then gets put into a regex like [^$wgLegalTitleChars], which, if it matches, is an illegal character. This works fine if you're in PHP, except we're using Rust! Looking closely, you'll see that / is escaped, because it's used as the delimiter of the PHP regex, except that's an error when using the regex crate. And the byte sequences of \x80-\xFF mean we need to operate on bytes, when we really would be fine with just matching against \u0080-\u00FF.

MediaWiki has some (IMO crazy) code that parses the regex to rewrite it into the unicode escape syntax so it can be used in JavaScript. T297340 tracks making this better and I have a patch outstanding to hopefully make this easier for other people in the future.

Then there's normalization. So what kind of normalization routines does MediaWiki do?

One of the most obvious ones is that the first letter of a page title is uppercase. For example, the article about iPods is actually called "IPod" in the database (it has a display title override). Except of course, for all the cases where this isn't true. Like on Wiktionaries, where the first letter is not forced to uppercase and "iPod" is actually "iPod" in the database.

Seems simple enough, right? Just take the first character, call char.to_uppercase(), and then merge it back with the rest of the characters.

Except...PHP uppercases characters differently and changes behavior based on the PHP and possibly ICU version in use. Consider the character (U+1F80). When run through mb_strtoupper() using PHP 7.2 (3v4l), what Wikimedia currently uses, you get (U+1F88). In Rust (playground) and later PHP versions, you get ἈΙ (U+1F08 and U+0399).

For now we're storing a map of these characters inside mwtitle, which is terrible, but I filed a bug for exposing this via the API: T297342.

There's also a whole normalization routine that sanitizes IP addresses, especially IPv6. For example, User talk:::1 normalizes to User talk:0:0:0:0:0:0:0:1.

Finally, adjacent whitespace is normalized down into a single space. But of course, MediaWiki uses its own list of what whitespace is which doesn't exactly match char.is_whitespace().

We developed mwtitle by initially doing a line-by-line port of MediaWikiTitleCodec::splitTitleString(), and discovering stuff we messed up or overlooked by copying test cases too. Eventually this escalated by writing a PHP extension wrapper, php-mwtitle which could be plugged into MediaWiki for running MediaWiki's own test suite. And after a few fixes, it fully passes everything.

Since I already wrote the integration, I ran some basic benchmarks, the Rust version is about 3-4x faster than MediaWiki's current PHP implementation (see the raw perf measurements). But title parsing isn't particularly hot, so switching to the Rust version would probably result in only a ~0.5% speedup overall based on some rough estimations looking at flamegraphs. That's not really worth it, considering the social and tooling overhead of introducing a Rust-based PHP extension as a optional MediaWiki dependency.

For now mwtitle is primarily useful for people writing bots and other MediaWiki tools in Rust. Given that a lot of people tend to use Python for these tasks, we could look into using PyO3 to write a Python wrapper.

There's also generally a lot of cool code in mwtitle, including sets and maps that can perform case-insensitive matching without requiring string allocations (nearly all Erutuon's fantastic work!).

Throughout this process, we found a few bugs mostly by just staring at and analyzing this code over and over:

And filed some that would make parsing titles outside of PHP easier:

mwtitle is one part of the new mwbot-rs project, where we're building a framework for writing MediaWiki bots and tools in Rust the wiki way. We're always looking for more contributors, please reach out if you're interested, either on-wiki, on GitLab, or in the #wikimedia-rust:libera.chat room (Matrix or IRC).


Generating Rust types for MediaWiki API responses

I just released version 0.2.0 of the mwapi_responses crate. It automatically generates Rust types based on the query parameters specified for use in MediaWiki API requests. If you're not familiar with the MediaWiki API, I suggest you play around with the API sandbox. It is highly dynamic, with the user specifying query parameters and values for each property they wanted returned.

For example, if you wanted a page's categories, you'd use action=query&prop=categories&titles=[...]. If you just wanted basic page metadata you'd use prop=info. For information about revisions, like who made specific edits, you'd use prop=revisions. And so on, for all the different types of metadata. For each property module, you can further filter what properties you want. If under info, you wanted the URL to the page, you'd use inprop=url. If you wanted to know the user who created the revision, you'd use rvprop=user. For the most part, each field in the response can be toggled on or off using various prop parameters. These parameters can be combined, so you can just get the exact data that your use-case needs, nothing extra.

For duck-typed languages like Python, this is pretty convenient. You know what fields you've requested, so that's all you access. But in Rust, it means you either need to type out the entire response struct for each API query you make, or just rely on the dynamic nature of serde_json::Value, which means you're losing out on the fantastic type system that Rust offers.

But what I've been working on in mwapi_responses is a third option: having a Rust macro generate the response structs based on the specified query parameters. Here's an example from the documentation:

use mwapi_responses::prelude::*;
#[query(
    prop="info|revisions",
    inprop="url",
    rvprop="ids"
)]
struct Response;

This expands to roughly:

#[derive(Debug, Clone, serde::Deserialize)]
pub struct Response {
    #[serde(default)]
    pub batchcomplete: bool,
    #[serde(rename = "continue")]
    #[serde(default)]
    pub continue_: HashMap<String, String>,
    pub query: ResponseBody,
}

#[derive(Debug, Clone, serde::Deserialize)]
pub struct ResponseBody {
    pub pages: Vec<ResponseItem>,
}

#[derive(Debug, Clone, serde::Deserialize)]
pub struct ResponseItem {
    pub canonicalurl: String,
    pub contentmodel: String,
    pub editurl: String,
    pub fullurl: String,
    pub lastrevid: Option<u32>,
    pub length: Option<u32>,
    #[serde(default)]
    pub missing: bool,
    #[serde(default)]
    pub new: bool,
    pub ns: i32,
    pub pageid: Option<u32>,
    pub pagelanguage: String,
    pub pagelanguagedir: String,
    pub pagelanguagehtmlcode: String,
    #[serde(default)]
    pub redirect: bool,
    pub title: String,
    pub touched: Option<String>,
    #[serde(default)]
    pub revisions: Vec<ResponseItemrevisions>,
}

#[derive(Debug, Clone, serde::Deserialize)]
pub struct ResponseItemrevisions {
    pub parentid: u32,
    pub revid: u32,
}

It would be a huge pain to have to write that out by hand every time, so having the macro do it is really convenient.

The crate is powered by JSON metadata files for each API module, specifying the response fields and which parameters need to be enabled to have them show up in the output. And there are some uh, creative methods on how to represent Rust types in JSON so they can be spit out by the macro. So far I've been writing the JSON files by hand by testing each parameter out manually and then reading the MediaWiki API source code. I suspect it's possible to automatically generate them, but I haven't gotten around to that yet.

Using enums?

So far the goal has been to faithfully represent the API output and directly map it to Rust types. This was my original goal and I think a worthwhile one because it makes it easy to figure out what the macro is doing. It's not really convenient to dump the structs the macro creates (you need a tool like cargo-expand), but if you can see the API output, you know that the macro is generating the exact same thing, but using Rust types.

There's a big downside to this, which is mostly that we're not able to take full advantage of the Rust type system. In the example above, lastrevid, length, pageid and touched are all typed using Option<T>, because if the page is missing, then those fields will be absent. But that means we need to .unwrap() on every page after checking the value of the missing property. It would be much better if we had ResponseItem split into two using an enum, one for missing pages and the other for those that exist.

enum ResponseItem {
    Missing(ResponseItemMissing),
    Exists(ResposneItemExists)
}

This would also be useful for properties like rvprop=user|userid. Currently setting that property results in something like:

pub struct ResponseItemrevisions {
    #[serde(default)]
    pub anon: bool,
    pub user: Option<String>,
    #[serde(default)]
    pub userhidden: bool,
    pub userid: Option<u32>,
}

Again, Option<T> is being used for the case where the user is hidden, and those properties aren't available. Instead we could have something like:

enum RevisionUser {
    Hidden,
    Visible { username: String, id: u32 }   
}

(Note that anon can be figured out by looking at id == 0.) Again, this is much more convenient than the faithful representation of JSON.

I'm currently assuming these kinds of enums can be made to work with serde, or maybe we'll need some layer on top of that. I'm also still not sure whether we want to lose the faithful representation aspect of this.

Next steps

The main next step is to get this crate used in some real world projects and see how people end up using it and what the awkward/bad parts are. One part I've found difficult so far is that these types are literally just types, there's no integration with any API library, so it's all up to the user on how to figure that out. There's also currently no logic to help with continuing queries, I might look into adding some kind of merge() function to help with that in the future.

I have some very very proof-of-concept integration code with my mwbot project, more on that to come in a future blog post.

Contributions are welcome in all forms! For questions/discussion, feel free to join #wikimedia-rust:libera.chat (via Matrix or IRC) or use the project's issue tracker.

To finish on a more personal note, this is easily the most complex Rust code I've written so far. proc-macros are super powerful, but it's super easy to get lost writing code that just writes more code. It feels like it's been through at least 3 or 4 rounds of complex refactoring, each taking advantage of new Rust things I learn, generally making the code better and more robust. The code coverage metrics are off because it's split between two crates, the code is actually fully 100% covered by integration+unit tests.


Kiwix returns in Debian Bullseye

(This is my belated #newindebianbullseye post.)

The latest version of the Debian distro, 11.0 aka Bullseye, was released last week and after a long absence, includes Kiwix! Previously in Debian 10/Buster, we only had the underlying C/C++ libraries available.

If you're not familiar with it, Kiwix is an offline content reader, providing Wikipedia, Gutenberg, TED talks, and more in ZIM (.zim) files that can be downloaded and viewed entirely offline. You can get the entire text of the English Wikipedia in less than 100GB.

apt install kiwix will get you a graphical desktop application that allows you to download and read ZIMs. apt install kiwix-tools installs kiwix-serve (among others), which serves ZIM files over an HTTP server.

Additionally, there are now tools in Debian that allow you to create your own ZIM files: zimwriterfs and the python3-libzim library.

All of this would not have been possible without the support of the Kiwix developers, who made it a priority to properly support Debian. All of the Kiwix and repositories have a CI process that builds Debian packages for each pull request and needs to pass before it'll be accepted.

Ubuntu users can take advantage of our primary PPA or the bleeding-edge PPA. For Debian users, my goal is that unstable/sid will have the latest verison within a few days of a release, and once it moves into testing, it'll be available in Debian Backports.

It is always a pleasure working with the Kiwix team, who make a point to send stickers and chocolate every year :)