2023 Wikimedia Hackathon recap

I had a wonderful time at the 2023 Wikimedia Hackthon in Athens, Greece, earlier this month. The best part was easily seeing old friends that I haven't met in person since probably 2018 and getting to hack and chat together. I also met a ton of new friends for the first time, even though we've been working together for multiple years at this point! I very much enjoy the remote, distributed nature of working in Wikimedia Tech, but it's also really nice to meet people in person.

This post is very scattered because that was my experience at the hackathon itself, just constantly running around, bumping into people.

I wrote that I wanted to work on: "mwbot-rs and Rust things, technical governance (open to nerd sniping)". I definitely did my fair share of Rust evangelism and had good discussions regarding technical governance (more on that another time). And some Mastodon evangelism and a bunch of sticker trading.

But before I got into hacking things, I tabulated and published the results of the 2022 Commons Picture of the Year contest, which I think turned out pretty well this year. Of course, the list of things to improve for next year keeps getting longer and longer (again, more on that in a future post).

At some point during conversation, I/we realized that the GWToolset extension was still deployed on Wikimedia Commons despite being, well, basically dead. It hadn't been used in over a year and last rites were administered back in November (literally, you have to look at the photos).

With a thumbs-up from extension-undeploying expert Zabe (and others), I undeployed it! There was a "fun" moment when the venue WiFi dropped so the scap output froze on my terminal, but I knew it sucessfully went through a few minutes later because of the IRC notification, phew. Anyways, RIP, end of an era.

And then Taavi deployed the RealMe extension, which allows wiki users to verify their Mastodon accounts and vice versa. But we went for dinner immediately after so Taavi wasn't even the first one to announce it, Raymond beat him to it! :-)

I spent a while rebasing a patch to bring EventStreams output to parity with the IRC feed that was first posted in April 2020 and got it merged (you're welcome Faidon ;)).

One of the last things I did before leaving was an interview about MediaWiki in the context of spinning up a new MediaWiki platform team (guess which one I am). At one point the question was "What is the single biggest pain point of working in MediaWiki?" Me: "can I have two?"

Reviewed a bunch of stuff:

Probably the most important patch I wrote at the hackathon was to add MaxSem, Amir (Ladsgroup), TheDJ and Petr Pchelko to the primary MediaWiki authors list on Special:Version. <3

Despite having a bunch of wonderful people being there, it was also very apparent who wasn't there. We need more regional hackathons and after a bit of reassurance from Siebrand and Maarten, it became clear that we have enough Wikimedia Tech folks in New York City already, so uh, stay tuned for details about some future NYC-based hackathon and let me know if you're interested in helping!

Final thanks to the Wikimedia Foundation for giving me a scholarship to attend. I really can't wait until the next time I get to see everyone again.


Six months of Wikis World

I did a lot of new, crazy things in 2022, but by far, the most unplanned and unexpected was running a social media server for my friends.

Somehow it has been six months since Taavi and I launched Wikis World, dubbed "a Mastodon server for wiki enthusiasts".

Given that milestone, it's time for me to come clean: I do not like microblogging. I don't like tweets nor toots nor most forms of character-limited posting. I'm a print journalist by training and mentality (there's a reason this blog has justified text!); I'd so much rather read your long-form blog posts and newsletters. I want all the nuance and detail that people leave out when microblogging.

But this seems to be the best option to beat corporate social media, so here I am, co-running a microblogging server. Not to mention that I'm attempting to co-run accounts for two projects I've basically dedicated the past decade of my life to: @MediaWiki and @Wikipedia.

Anyways, here are some assorted thoughts about running a Mastodon server.

Content moderation

I feel like I have a good amount of "content moderation" experience from being a Wikipedia administrator and my conclusion is that I don't like it (what a trend, I promise there are actually things I like about this) and more importantly, I'm not very good at it. For the first few months I read literally every post on the Wikis World local timeline, analyzing to see whether it was okay or problematic. This was, unsurprisingly, incredibly unhealthy for me and once I realized how unhappy I was, I stopped doing it.

Especially once we added Lucas and now AntiComposite as additional moderators, I feel a lot more comfortable skimming the local timeline with the intent of actually seeing what people are posting, not pure moderation.

This is not to eschew proactive moderation (which is still important!), just that my approach was not working for me, and honestly, our members have demonstrated that they don't really need it. Which brings me to...

Community building

I've said in a few places that I wanted Wikis World to grow organically. I never really defined what un-organically was, but my rough idea was that we would build a community around/through Wikis World instead of just importing one from elsewhere. I don't think that ended up happening, but it was a bad goal and was never going to happen. We have slightly under 100 accounts, but it's not like all of us are talking to and with each other. Instead, I feel like I'm in a circle of ~5-15 people, largely Wikimedians active in tech areas, who regularly interact with each other, and half of those people host their account elsewhere. Plus the common thread bringing everyone together is wikis, which is already an existing community!

So far I'm pretty happy with how Wikis World has grown. I have a few ideas on how to reduce signup friction and automatically hand out invites, hopefully in the next few months.

The rewarding part

It is incredibly empowering to exist in a social media space that is defined on my own terms (well, mostly). We are one of the few servers that defaults to a free Creative Commons license (hopefully not the only server). We have a culture that promotes free and open content over proprietary stuff. And when I encourage people to join the Fediverse, I know I'm bringing them to a space that respects them as individual human beings and won't deploy unethical dark patterns against them.

To me, that makes it all worth it. The fact that I'm also able to provide a service for my friends and other wiki folks is really just a bonus. Here's to six more months of Wikis World! :-)


Wikimedia Foundation layoffs

The Wikimedia Foundation is currently going through layoffs, reducing headcount by about 5%. I am disappointed that no public announcement has been made, rather people are finding out through rumor and backchannels.

In February when I asked whether the WMF was planning layoffs at the "Conversation with the Trustees" event (see on YouTube), the response was that the WMF was anticipating a reduced budget, "slower growth", and that more information would be available in April. My understanding is that the fact ~5% layoffs would happen has been known since at least early March.

Consider the reaction to Mozilla's layoffs from a few years ago; the broader community set up the Mozilla Lifeboat, among other things to help find new jobs for people who were laid off. Who knows if such a thing would happen now given the current economy, but it absolutely won't happen if people don't even know about the layoffs in the first place.

Layoffs also greatly affect the broader Wikimedia volunteer community, whether it's directly in that staff you were working with are no longer employed at the WMF or a project you were contributing to or even depending on now has less resources.

I have much more to say about what the ideal size of the WMF is and how this process unfolded, but I'll save that for another time. For now, just thanks to the WMF staff, both current and past.


One year in New York City

The first time I heard the song Welcome to New York, my reaction was something along the lines of "Eh, decent song, except she's wrong. West Coast Best Coast."

I would like to formally apologize to Taylor Swift for doubting her and state for the record that she was absolutely right.

Today is the one year anniversary of me touching down at LGA and dragging two overstuffed suitcases to Brooklyn, woefully unprepared for the cold. My sister came up on the train the next day to help me apartment hunt and mocked me for questioning why she wasn't zipping up her jacket given that it was "literally freezing" outside (narrator: it was not).

It's been a whirlwind of a year, and nothing describes New York City better than "it keeps you guessing / it's ever-changing / it drives you crazy / but you know you wouldn't change anything".

I mostly feel like a "true New Yorker" now, having gotten stuck in the subway (I love it though), walked across the Brooklyn Bridge (also love it), and e-biked up and down all of Manhattan (very much love it).

I met and photographed Chris Smalls, spoke at HOPE 2022 and rode a bunch of trains. I have tried at least 25 different boba tea shops and somehow the best one is literally the one closest to me.

New Yorkers get a bad reputation for being unfriendly, but my experience has been the opposite. From my friends (and family!) that helped me move my stuff here and find an apartment to the Wikimedia NYC folks who made me feel at home right away, everyone is super nice and was excited that I moved. (Also couldn't have done it without the cabal, you know who you are.) I'm very grateful and hope I have the opportunity to pay it forward someday.

I still miss California and the West Coast mentality but I miss it less and less every day.


Measuring the length of Wikipedia articles

There was recently a request to generate a report of featured articles on Wikipedia, sorted by length, specifically the "prose size". It's pretty straightforward to get a page's length in terms of the wikitext or even the rendered HTML output, but counting just the prose is more difficult. Here's how the "Readable prose" guideline section defines it:

Readable prose is the main body of the text, excluding material such as footnotes and reference sections ("see also", "external links", bibliography, etc.), diagrams and images, tables and lists, Wikilinks and external URLs, and formatting and mark-up.

Why do Wikipedians care? Articles that are too long just won't be read by people. A little bit further down on that page, there are guidelines on page length. If it's more than 8,000 words it "may need to be divided", 9,000 words is "probably should be divided" and 15,000 words is "almost certainly should be divided"!

Featured articles are supposed to be the best articles Wikipedia has to offer, so if some of them are too long, that's a problem!

The results

The "Featured articles by size" report now updates weekly. As of the Feb. 22 update, the top five articles are:

  1. Elvis Presley: 18,946 words
  2. Ulysses S. Grant: 18,847 words
  3. Douglas MacArthur: 18,632 words
  4. Manhattan Project: 17,803 words
  5. History of Poland (1945–1989): 17,843 words

On the flip side, the five shortest articles are:

  1. Si Ronda: 639 words
  2. William Feiner: 665 words
  3. 2005 Azores subtropical storm: 668 words
  4. Miss Meyers: 680 words
  5. Myriostoma: 682 words

In case you didn't click yet, Si Ronda is a presumed lost 1930 silent film from the Dutch East Indies. Knowing that, it's not too surprising that the article is so short!

When I posted this on Mastodon, Andrew posted charts comparing prose size in bytes vs word count vs wikitext size, showing how much of the wikitext markup is well, markup, and not the words shown in the article.

Lookup tool

So creating the report is exactly what had been asked. But why stop there? Surely people want to be able to look up the prose size of arbitrary articles that they're working to improve. Wikipedia has a few tools to provide this information (specifically the Prosesize gadget and XTools Page History), but unfortunately both implementations suffer from bugs that I figured creating another might be useful.

Enter prosesize.toolforge.org. For any article, it'll tell you the prose size in bytes and word count. As a bonus, it highlights exactly which parts of the article are being counted and which aren't. An API is also available if you want to plug this information into something else.

How it works

We grab the annotated HTML (aka "Parsoid HTML") for each wiki page. This format is specially annotated to make it easier to parse structured information out of wiki pages. The parsoid Rust crate makes it trivial to operate on the HTML. So I published a "wikipedia_prosesize" crate that takes the HTML and calculates the statistics.

The code is pretty simple, it's less than 150 lines of Rust.

First, we remove HTML elements that shouldn't be counted. This currently is:

  • inline <style> tags
  • the #coordinates element
  • elements with a class of *emplate (this is supposed to match a variety of templates)
  • math blocks, which have typeof="mw:Extension/math"
  • references numbers (specfically the [1], not the reference itself), which have typeof="mw:Extension/ref"

Then we find all nodes that are top-level text, so blockquotes don't count. In CSS terms, we use the selector section > p. For all of those we add up the length of the text content and count the number of words (by splitting on spaces).

I mentioned that the other tools have bugs, the Prosesize gadget (source) doesn't discount math blocks, inflating the size of math-related articles, while XTools (source) doesn't strip <style> tags nor math blocks. XTools also detects references with a regex, \[\d+\], which won't discount footnotes that use e.g. [a]. I'll be filing bugs against both, suggesting that they use my tool's API to keep the logic centralized in one place. I don't mean to throw shade on these implementations, but I do think it shows why having one centralized implementation would be useful.

Source code for the database report and the web tool are both available and welcome contributions. :-)

Next

I hope people find this interesting and are able to use it for some other analysises. I'd be willing to generate a dataset of prose size for every article on the English Wikipedia using a database dump if people would actually make some use of it.


Upload support in mwbot-rs and the future of mwapi_errors

I landed file upload support in the mwapi (docs) and mwbot (docs) crates yesterday. Uploading files in MediaWiki is kind of complicated, there are multiple state machines to implement and there are multiple ways to upload files and different options that come with that.

The mwapi crate contains most of the upload logic but it offers a very simple interface for uploading:

pub async fn upload<P: Into<Params>>(
        &self,
        filename: &str,
        path: PathBuf,
        chunk_size: usize,
        ignore_warnings: bool,
        params: P,
) -> Result<String>

This fits with the rest of the mwapi style of simple functions that try to provide the user with maximum flexibility.

On the other hand, mwbot has a full typed builder with reasonable defaults, I'll just link to the documentation instead of copying it all.

A decent amount of internal refactoring was required to make things that took key-value parameters now accept key-value parameters plus bytes that should be uploaded as multipart/form-data. Currently only uploading from a path on disk is supported, in the future I think we should be able to make it more generic and upload from anything that implements AsyncRead.

Next steps for mwapi

This is the last set of functionality that I had on my initial list for mwapi, after the upload code gets some real world usage, I'm feeling comfortable calling it complete enough for a 1.0 stable release. There is still probably plenty of work to be done (like rest.php support maybe?), but from what I percieve a "low-level" MediaWiki API library should do, I think it's checked the boxes.

Except....

Future of mwapi_errors

It took me a while to get comfortable with error handling in Rust. There are a lot of different errors the MediaWiki API can raise, and they all can happen at the same time or different times! For example, editing a page could fail because of some HTTP-level error, you could be blocked, your edit might have tripped the spam filter, you got an edit conflict, etc. Some errors might be common to any request, some might be specific to a page or the text you're editing, and others might be temporary and totally safe to retry.

So I created one massive error type and the mwapi_errors crate was born, mapping all the various API error codes to the correct Rust type. The mwapi, parsoid, and mwbot crates all use the same mwapi_error::Error type as their error type, which is super convenient, usually.

The problem comes that they all need to use the exact same version of mwapi_errors, otherwise the Error type will be different and cause super confusing compilation errors. So if we need to make a breaking change to any error type, all 4 crates need to issue semver-breaking releases, even if they didn't use that functionality!

Before mwapi can get a 1.0 stable release, mwapi_errors would need to be stable too. But I am leaning in the direction of splitting up the errors crate and just giving each crate its own Error type, just like all the other crates out there do. And we'll use Into and From to convert around as needed.