The Wikimedia Foundation is currently going through layoffs, reducing headcount by about 5%. I am disappointed that no public announcement has been made, rather people are
finding out through rumor and backchannels.
In February when I asked whether the WMF was planning layoffs at the "Conversation with the Trustees" event (see on YouTube), the response
was that the WMF was anticipating a reduced budget, "slower growth", and that more information would be available in April. My understanding is that the fact ~5% layoffs would happen has been known since at least early March.
Consider the reaction to Mozilla's layoffs from a few years ago; the broader community set up the Mozilla Lifeboat, among other things to help find new jobs for people who were laid off. Who knows
if such a thing would happen now given the current economy, but it absolutely won't happen if people don't even know about the layoffs in the first place.
Layoffs also greatly affect the broader Wikimedia volunteer community, whether it's directly in that staff you were working with are no longer employed at the WMF or a project you were contributing to or even depending on now has
I have much more to say about what the ideal size of the WMF is and how this process unfolded, but I'll save that for another time. For now, just thanks to the WMF staff, both current and past.
The first time I heard the song Welcome to New York, my reaction was something along the lines of "Eh, decent song, except she's wrong. West Coast Best Coast."
I would like to formally apologize to Taylor Swift for doubting her and state for the record that she was absolutely right.
Today is the one year anniversary of me touching down at LGA and dragging two overstuffed suitcases to Brooklyn, woefully unprepared for the cold. My sister came up on the train the next day to help me apartment hunt and mocked me
for questioning why she wasn't zipping up her jacket given that it was "literally freezing" outside (narrator: it was not).
It's been a whirlwind of a year, and nothing describes New York City better than "it keeps you guessing / it's ever-changing / it drives you crazy / but you know you wouldn't change anything".
I mostly feel like a "true New Yorker" now, having gotten stuck in the subway (I love it though), walked across the Brooklyn Bridge (also love it), and e-biked up and down all of Manhattan (very much love it).
I met and photographed Chris Smalls, spoke at HOPE 2022
and rode a bunch of trains. I have tried at least 25 different boba tea shops and somehow the best one is literally the one closest to me.
New Yorkers get a bad reputation for being unfriendly, but my experience has been the opposite. From my friends (and family!) that helped me move my stuff here and find an apartment to the Wikimedia NYC folks who made me feel at home right away,
everyone is super nice and was excited that I moved. (Also couldn't have done it without the cabal, you know who you are.) I'm very grateful and hope I have the opportunity to pay it forward someday.
I still miss California and the West Coast mentality but I miss it less and less every day.
There was recently a request
to generate a report of featured articles on Wikipedia, sorted by length, specifically the "prose size". It's pretty straightforward to get
a page's length in terms of the wikitext or even the rendered HTML output, but counting just the prose is more difficult. Here's how the "Readable prose"
guideline section defines it:
Readable prose is the main body of the text, excluding material such as footnotes and reference sections ("see also", "external links", bibliography, etc.), diagrams and images, tables and lists, Wikilinks and external URLs, and formatting and mark-up.
Why do Wikipedians care? Articles that are too long just won't be read by people. A little bit further down on that page, there are guidelines on page length.
If it's more than 8,000 words it "may need to be divided", 9,000 words is "probably should be divided" and 15,000 words is "almost certainly should be divided"!
Featured articles are supposed to be the best articles Wikipedia has to offer, so if some of them are too long, that's a problem!
The "Featured articles by size" report now updates weekly. As of the Feb. 22 update, the top five articles are:
- Elvis Presley: 18,946 words
- Ulysses S. Grant: 18,847 words
- Douglas MacArthur: 18,632 words
- Manhattan Project: 17,803 words
- History of Poland (1945–1989): 17,843 words
On the flip side, the five shortest articles are:
- Si Ronda: 639 words
- William Feiner: 665 words
- 2005 Azores subtropical storm: 668 words
- Miss Meyers: 680 words
- Myriostoma: 682 words
In case you didn't click yet, Si Ronda is a presumed lost 1930 silent film from the Dutch East Indies. Knowing that, it's not too surprising that the article is so short!
When I posted this on Mastodon, Andrew posted charts comparing prose size in bytes vs word count vs wikitext size, showing how much of the wikitext markup is well, markup, and not the words
shown in the article.
So creating the report is exactly what had been asked. But why stop there? Surely people want to be able to look up the prose size of arbitrary articles that they're working to improve.
Wikipedia has a few tools to provide this information (specifically the Prosesize gadget and XTools Page History), but unfortunately
both implementations suffer from bugs that I figured creating another might be useful.
Enter prosesize.toolforge.org. For any article, it'll tell you the prose size in bytes and word count. As a bonus, it highlights exactly which parts of the article are being counted and which
aren't. An API is also available if you want to plug this information into something else.
How it works
We grab the annotated HTML (aka "Parsoid HTML") for each wiki page. This format is specially annotated to make it easier to parse structured information out of wiki pages.
The parsoid Rust crate makes it trivial to operate on the HTML. So I published a "wikipedia_prosesize"
crate that takes the HTML and calculates the statistics.
The code is pretty simple, it's less than 150 lines of Rust.
First, we remove HTML elements that shouldn't be counted. This currently is:
- elements with a class of
*emplate (this is supposed to match a variety of templates)
- math blocks, which have
- references numbers (specfically the
, not the reference itself), which have
Then we find all nodes that are top-level text, so blockquotes don't count. In CSS terms, we use the selector
section > p. For all of those we add up the length of the text content
and count the number of words (by splitting on spaces).
I mentioned that the other tools have bugs, the Prosesize gadget (source)
doesn't discount math blocks, inflating the size of math-related articles, while XTools (source)
<style> tags nor math blocks. XTools also detects references with
\[\d+\], which won't discount footnotes that use e.g.
[a]. I'll be filing bugs against both, suggesting that they use my tool's API to keep the logic centralized in one place. I don't mean to throw shade on these
implementations, but I do think it shows why having one centralized implementation would be useful.
Source code for the database report and the web tool
are both available and welcome contributions. :-)
I hope people find this interesting and are able to use it for some other analysises. I'd be willing to generate a dataset of prose size for every article on the English Wikipedia using a database dump if people
would actually make some use of it.
I landed file upload support in the
mwapi (docs) and
mwbot (docs) crates yesterday. Uploading files in MediaWiki is kind of complicated, there are multiple state machines to
implement and there are multiple ways to upload files and different options that come with that.
mwapi crate contains most of the upload logic but it offers a very simple interface for uploading:
pub async fn upload<P: Into<Params>>(
) -> Result<String>
This fits with the rest of the
mwapi style of simple functions that try to provide the user with maximum flexibility.
On the other hand,
mwbot has a full typed builder with reasonable defaults, I'll just link to the documentation instead of copying it all.
A decent amount of internal refactoring was required to make things that took key-value parameters now accept key-value parameters plus bytes that should be uploaded as
multipart/form-data. Currently only uploading
from a path on disk is supported, in the future I think we should be able to make it more generic and upload from anything that implements
Next steps for mwapi
This is the last set of functionality that I had on my initial list for
mwapi, after the upload code gets some real world usage, I'm feeling comfortable calling it complete enough for a 1.0 stable release. There
is still probably plenty of work to be done (like
rest.php support maybe?), but from what I percieve a "low-level" MediaWiki API library should do, I think it's checked the boxes.
Future of mwapi_errors
It took me a while to get comfortable with error handling in Rust. There are a lot of different errors the MediaWiki API can raise, and they all can happen at the same time or different times! For example, editing a page could
fail because of some HTTP-level error, you could be blocked, your edit might have tripped the spam filter, you got an edit conflict, etc. Some errors might be common to any request, some might be specific to a page or the
text you're editing, and others might be temporary and totally safe to retry.
So I created one massive error type and the
mwapi_errors crate was born, mapping all the various API error codes to the correct Rust type. The
mwbot crates all use the same
as their error type, which is super convenient, usually.
The problem comes that they all need to use the exact same version of
mwapi_errors, otherwise the Error type will be different and cause super confusing compilation errors. So if we need to make a breaking change to any
error type, all 4 crates need to issue semver-breaking releases, even if they didn't use that functionality!
mwapi can get a 1.0 stable release,
mwapi_errors would need to be stable too. But I am leaning in the direction of splitting up the errors crate and just giving each crate its own
Error type, just like all the
other crates out there do. And we'll use
From to convert around as needed.
I was intending to write a pretty different blog post about progress on mwbot-rs but...ugh. The main dependency of the
kuchiki, was archived over the weekend.
In reality it's been lightly/un-maintained for a while now, so this is just reflecting reality, but it does feel like a huge setback. Of course, I only have gratitude for Simon Sapin, the
primary author and maintainer, for starting the project in the first place.
kuchiki was a crate that let you manipulate HTML as a tree, with various ways of iterating over and selecting specific DOM nodes.
parsoid was really just a wrapper around that, allowing you get to get a
instead of a plain
<a> tag node. Each "WikiNode" wrapped a
kuchiki::NodeRef for convenient accessors/mutators, but still allowed you to get at the underlying node via
Deref, so you could manipulate the HTML directly even if the
parsoid crate didn't know about/support something yet.
This is not an emergency by any means,
kuchiki is pretty stable, so in the short-term we'll be fine, but we do need to find something else and rewrite
parsoid on top of that. Filed T327593
I am mostly disappointed because have cool things in the pipeline that I wanted to focus on instead. The new
toolforge-tunnel CLI is probably ready for a general announcement and was largely worked on by
MilkyDefer. And I also have upload support mostly done, I'm just trying to see if I can avoid
a breaking change in the underlying
In short: ugh.
This past weekend at Wikipedia Day I had a discussion with Enterprisey and some other folks about different
ways edit counters (more on that in a different blog post) could visualize edits, and one of the things that came up was GitHub's scorecard and streaks. Then I saw a post from Jan Ainali
with a SQL query showing the people who had made an edit for every single day of 2022 to Wikidata. That got me thinking, why stop at 1 year? Why not try to find out the longest active editing streak on Wikipedia?
Slight sidebar, I find streaks fascinating. They require a level of commitment, dedication, and a good amount of luck! And unlike sports where if you set a record, it sticks, wikis are constantly changing. If you make an
edit, and months or years later the article gets deleted, your streak is retroactively broken. Streaks have become a part of wiki culture, with initatives like 100wikidays, where people
commit to creating a new article every day, for 100 days. There's a new initiative called 365 climate edits, I'm sure you can figure out
the concept. Streaks can become unhealthy, so this all should be taken in good fun.
So... I adopted Jan's query to find users who had made one edit per day in the past 365 days, and then for each user, go backwards day-by-day to see when they missed an edit. The results are...unbelievable.
Johnny Au has made at least one edit every day since November 11, 2007! That's 15 years, 2 months, 9 days and counting. Au was profiled in the Toronto Star
in 2015 for his work on the Toronto Blue Jays' page:
Au, 25, has the rare distinction of being the top editor for the Jays’ Wikipedia page. Though anyone can edit Wikipedia, few choose to do it as often, or regularly, as Au.
The edits are logged on the website but hidden from most readers. Au said he doesn’t want or need attention for his work.
“I prefer to be anonymous, doing things under the radar,” he said.
Au spends an average 10 to 14 hours a week ensuring the Blues Jays and other Toronto-focused Wikipedia entries are up to date and error-free. He’s made 492 edits to the Blue Jays page since he started in 2007, putting him squarely in the number one spot for most edits, and far beyond the second-placed editor, who has made 230 edits.
Au usually leaves big edits to other editors. Instead, he usually focuses on small things, like spelling and style errors.
“I’m more of a gatekeeper, doing the maintenance stuff,” he said.
Next, Bruce1ee (unrelated to Bruce Lee) has made at least one edit every day since September 6, 2011. That's 11 years, 4 months, 14 days and counting.
Appropriately featured on their user page is a userbox that says: "This user doesn't sleep much".
It is mind blowing to me the level of consistency you need to edit Wikipedia every day, for this long. There are so many things that could happen to stop you from editing Wikipedia (internet goes out, you go on vacation, etc.) and they
manage to continue editing regardless.
I also ran a variation of the query that only considered edits to articles. The winner there is AnomieBOT, a set of automated processes written and operated by Anomie.
AnomieBOT last took a break from articles on August 6, 2016, and hasn't missed a day since.
You can see the full list of results on-wiki as part of the database reports project: Longest active user editing streaks
and Longest active user article editing streaks. These will update weekly.
Hopefully by now you're wondering what your longest streak is. To go along with this project, I've created a new tool: Wiki streaks. Enter in a username and wiki and see all of your streaks
(minimum 5 days), including your longest and current ones. It pulls all of the days you've edited, live, and then segments them into streaks. The source code (in Rust of course)
is on Wikimedia GitLab, contributions welcome, especially to improve the visualization HTML/CSS/etc.
I think there is a lot of interesting stats out there if we kept looking at streaks of Wikipedians. Maybe Wikipedians who've made an edit every week? Every month? It certainly seems reasonable that there are people out there who've
made an edit at least once a month since Wikipedia started.
Of course, edits are just one way to measure contribution to Wikipedia. Logged actions (patrolling, deleting, blocking, etc.) are another, or going through specific processes, like getting articles promoted to "Good article" and
"Featured article" status. For projects like Commons, we coud look at file uploads instead of edits. And then what about historical streaks? I hope this inspires others to think of and look up other types of wiki streaks :-)