Speeding up Toolforge tools with Redis

Over the past two weeks I significantly sped up two of my Toolforge tools by using Redis, a key-value database. The two tools, checker and shorturls were slow for different reasons, but now respond instantaneously. Note that I didn't do any proper benchmarking, it's just noticably faster.

If you're not familiar with it already, Toolforge is a shared hosting platform for the Wikimedia community build entirely using free software. A key component is providing web hosting services so developers can build all sorts of tools to help Wikimedians with really whatever they want to do.

Toolforge provides a Redis server (see the documentation) for tools to use for key-value caching, pub/sub, etc. One important security note is that this is a shared service for all Toolforge users to use, so it's especially important to prefix your keys to avoid collisions. Depending on what exactly you're storing, you may want to use a cryptographically-random key prefix, see the security documentation for more details.

Redis on Toolforge is really straightforward to take advantage of for caching, and that's what I want to highlight.

checker#

"checker"

Visit the tool — Source code

checker is a tool that helps Wikisource contributors quickly see the proofread status of pages. The tool was originally written as a Python CGI script and I've since lightly refactored it to use Flask and jinja2 templates.

On each page load, checker would make a database query to get the list of all available wikis, and then an additional query to get information about the selected wiki and an API query to get namespace information. This data is basically static, it would only change whenever a new wiki is created, which is rare.

<+bd808> I think it would be a lot faster with a tiny bit of redis cache mixed in

I used the Flask-Caching library, which provides convenient decorators to cache the results of Python functions. Using that, adding caching was about 10 lines of code.

To set up the library, you'll need to configure the Cache object to use tools-redis.

from flask import Flask
from flask_caching import Cache
app = Flask(__name__)
cache = Cache(
    app,
    config={'CACHE_TYPE': 'redis',
            'CACHE_REDIS_HOST': 'tools-redis',
            'CACHE_KEY_PREFIX': 'tool-checker'}
)

And then use the @cache.memoize() function for whatever needs caching. I set an expiry of a week so that it would pick up any changes in a reasonable time for users.

shorturls#

"shorturls"

Visit the tool — Source code

shorturls is a tool that displays statistics and historical data for the w.wiki URL shortener. It's written in Rust primarily using the rocket.rs framework. It parses dumps, generates JSON data files with counts of the total number of shortened URLs overall and by domain.

On each page load, shorturls generates an SVG chart plotting the historical counts from each dump. To generate the chart, it would need to read every single data file, over 60 as of this week. On Toolforge, the filesystem is using NFS, which allows for files to be shared across all the Toolforge servers, but it's sloooow.

<+bd808> but this circles back to "the more you can avoid reading/writing to the NFS $HOME, the better your tool will run"

So to avoid reading 60+ files on each page view, I cached each data file in Redis. There's still one filesystem call to get the list of data files on disk, but so far that seems to be acceptable.

I used the redis-rs crate combined with rocket's connection pooling. The change was about 40 lines of code. It was a bit more invovled because redis-rs doesn't have any support for key prefixing nor automatic (de)serialization so I had to manually convert to/from JSON.

The data being cached is immutable, but I still set a 30 day expiry on it, just in case I change the format or cache key, I don't want the data to sit around forever in the Redis database.

Conclusion#

Caching mostly static data in Redis is a great way to make your Toolforge tools faster if you are reguarly making SQL queries, API requests or filesystem reads that don't change as often. If you need help or want tips on how to make other Toolforge tools faster, stop by the #wikimedia-cloud IRC channel or ask on the Cloud mailing list. Thanks to Bryan Davis (bd808) for helping me out.



Learning Rust, week 5

I'm skipped writing a post for week 4 and then didn't do any Rust related things for a week, so this is my week 5 update.

The main (published) Rust I've written since my last post is a port of my w.wiki statistics Toolforge tool. It reads through compressed plaintext dumps, parses URLs and aggregates counts per-domain to make a nice table. I used the flate2 crate for decompressing gzipped files and then the std::io::BufRead trait to read a file line-by-line.

It also has a slow-to-load chart that shows the increase in total shortened URLs since the start of the service. After looking through a few different plotting libraries, I ended up using plotters because it could properly chart timescale graphs. I think the graphs created by the charts crate look prettier but it wasn't flexible enough for this dataset. The chart is slow to load on Toolforge because it reads ~60 cache files, needing to hit NFS for each one.

I want to move the cache to redis, but the primary Rust redis library doesn't support having an automatic key prefix so I might end up writing a wrapper to do that.

In the future I want to provide charts for the individual domains and maybe a listing of recently shortened links for each domain, we'll see.

Because of how rocket's template system wants its structs to be serde-serializable, it becomes really straightforward to create a JSON API for every template-based endpoint. I had written a whole library (flask-dataapi) for this in Python, and now it's basically built-in.

I also submitted two OAuth2-related patches to Rust crates:

In terms of documentation, I've spent a decent amount of time improving my Rust on Toolforge wiki page, including some updates that came after debugging with other Rust users on IRC. I think it's in a state that we can link to it from the official Toolforge docs.