Running the ArchiveTeam Warrior under Podman

I'm finally back on an unlimited internet connection, so I've started running the ArchiveTeam Warrior once again.

The Warrior is a software application for archiving websites in a crowdsourced manner, especially when there's a time crunch when a website announces that it's closing or planning to delete things. Currently the default project is to archive public Telegram channels.

Historically the Warrior was distributed as a VirtualBox appliance, which was a bit annoying to run headlessly and was unnecessarily resource intensive because it required full virtualization. But they now have a containerized version that is pretty trivial to set up.

Relatedly, I've recently been playing with Podman's "Quadlet" functionality, which I really, really like. Instead of needing to create a systemd service to wrap running a container, you can specify what you want to run in a basically systemd-native way:

[Unit]
Description=warrior

[Container]
Image=atdr.meo.ws/archiveteam/warrior-dockerfile

PublishPort=8001:8001

Environment=DOWNLOADER=<your name>
Environment=SELECTED_PROJECT=auto
Environment=CONCURRENT_ITEMS=4

AutoUpdate=registry

[Service]
Restart=on-failure
RestartSec=30
# Extend Timeout to allow time to pull the image
TimeoutStartSec=180

[Install]
# Start by default on boot
WantedBy=multi-user.target default.target

I substituted in my username and dropped this into ~/.config/containers/systemd/warrior.container, ran systemctl --user daemon-reload and systemctl --user start warrior and it immediately started archiving! Visiting localhost:8001 should bring up the web interface.

You can then run systemctl --user cat warrior to see what the generated .service file looks like.

The AutoUpdate=registry line tells podman-auto-update to automatically fetch image updates and restart the running container. You'll likely need to enable/start the timer for this, with systemctl --user enable podman-auto-update.timer.

The one thing I haven't figured out yet is gracefully shutting down, which is important to avoid losing unfinished data. I suspect the Restart=always is harmful here, since I do want to explicitly shutdown in some cases.

P.S. I also have a infrequently updated Free bandwidth wiki page that contains other suggestions for how to use your internet connection for good.

Update (2024-07-14): I changed the restart options to Restart=on-failure and RestartSec=30, which fixes the issue with restarting immediately after a graceful shutdown and correctly restarting if it starts up before networking is ready.


Basic anti-abuse monitoring for Mastodon

Back in February, Mastodon and the connected Fediverse faced a spam attack caused by unattended instances with an open signup policy. Bots quickly registered accounts and then sent spammy messages that were relayed through the Fediverse.

It was annoying and the normal moderation tool of limiting or blocking entire instances wasn't effective since the attackers were coming from a wide set of places. Since then people have developed shared blocklists that you can subscribe to, but that has its own downsides.

So here's the tool I developed that we used for wikis.world: masto-monitor.

The workflow is straightfoward:

  • Poll the federated timeline for all public posts
  • Check them against a manually curated list of patterns
  • If there's a match, report it using the API

This allows us to have an automated process checking all incoming posts while still enabling humans to make any moderation decisions.

The code itself is pretty straightforward that it doesn't really merit much explanation. The matching logic is very very basic, it just looks for substring matches. I think the approach is worth developing further, allowing people to write more expressive rules/filters that trigger automated reports.

But, I'm not planning to do so myself since we don't currently have a need, so people are welcome to fork it to enhance it.


#wikimedia-rust Matrix to IRC bridge is back

tl;dr: You can now chat in #wikimedia-rust:matrix.org and reach folks on IRC

Nearly a year ago, the official Libera.Chat <-> Matrix bridge was shut down. There's a lot that went wrong in the technical and social operation of the bridge that the Libera.Chat staff have helpfully documented, but from a community management perspective the bridge, when it worked, was fantastic.

But, we now have a bridge back in place! Bridgebot is a deployment of matterbridge, which is primarily used in Wikimedia spaces for bridging IRC and Telegram, but it also speaks Matrix reasonably well.

Thanks to Bryan Davis for starting/maintaining the bridgebot project and Lucas Werkmeister for deploying the change; we now have a bot that relays comments between IRC and Matrix.

12:12:50 <wm-bb> [matrix] <legoktm> ooh, I think the Matrix bridging is working now
12:12:58 <legoktm> o/

And if you look at the view.matrix.org logs, those messages are there too.

So if you're interested in or working on Wikimedia-related things that are in Rust, please join us, in either IRC or Matrix or both :)


Implementing search for my blog in WebAssembly

If you visit my blog (most likely what you're reading now) and have JavaScript enabled, you should see a magnifying glass in the top right, next to the feed icon. Clicking it should open up a search box that lets you perform a very rudimentary full-text search of all of my blog posts.

It's implemented fully client-side using Rust compiled to WebAssembly (WASM), here's all the code I added to implement it.

At a high level, it:

  1. Splits all blog posts into individual words, counts them, and dumps it into a search index that is a JSON blob.
  2. Installs a click handler (using JavaScript) that displays the search bar and lazy-loads the rest of the WASM code and search index.
  3. Installs an input handler (using WASM) that takes the user's input, searches through the index, and returns up to 10 matching articles.

The search algorithm is pretty basic, it gives one point per word-occurence in the blog post, and 5 points if the word is in the title or a tag. Then it sorts by score, and if there's a tie, by most recently published.

There's no stemming or language processing, the only normalization that happens is treating everything as lowercase.

I've played with WASM before but this is the first time I've actually deployed something using it. As much as I enjoyed writing it in Rust, the experience left something to be desired. I had to use a separate tool (wasm-bindgen) and load a pre-built JavaScript file first that then let me initialize the WASM code.

The payload is also ...heavy:

  • search.js: 5.53kB (23.63kB before gzip)
  • search_bg.wasm: 53.78kB (122.82kB before gzip)
  • search_index.json: 323.13kB (322.76kB before gzip)

I'm not sure why the index compresses so poorly with Apache, locally it goes down to 100kB. (I had briefly considered using a binary encoding like MessagePack but thought it wouldn't be more efficient than JSON after compression.) And of course, the more I write, the bigger the index gets, so it'll need to be addressed sooner rather than later. I think any pure-JavaScript code would be much much smaller than the WASM bundle.


Wiki burnout

Yeah, I burned out from wiki things. I flew too close to the sun, tried to take on too many projects and cool ideas and then crashed and let people down. I'm sorry.

I participate in wiki communities because, aside from believing in free knowledge, etc., it's really fun. And then it stopped being fun, so I just...stopped. In some aspects it was nice, I spent my time doing a lot of other IRL things (I bought a bike), but I also missed doing wiki things.

And so I'm slowly starting to get back into things. I'll try to get back to everyone...eventually. I'm not really sure what's next in my queue. I'm trying to not jump back into what I was doing previously because that doesn't really solve the burnout problem but also I owe people some work.

I'll be at the "All-Day Hacking Sunday" this weekend in New York City, it should be fun and hope to hang out with people there.


Updates to my blog

It's been nearly 10 years since I created this blog and for that entire time it's been using the Pelican static site generator. It's been pretty good, but lately I've wanted to improve and change some things, so I've taken the opportunity to rewrite it from mostly scratch.

b2 is a Rust program that takes an input folder of markdown files and spits out a complete HTML directory of the blog. The templates and theme are taken from my fork of pelican-sober. For the most part nothing has changed, I've just taken the opportunity to adjust the CSS and improve some of the HTML output.

b2 is pretty specific for my usecase, I'm not planning to turn it into a general purpose static site generator, though you're more than welcome to fork it for your own needs.