The Lego Mirror - Tech

How many Rs are in the word strawberry?

2024-08-10T04:08:12+00:00

There's a viral Mastodon post that demonstrates a supposed deficiency in large language models (LLMs), specifically ChatGPT, because it apparently can't count the number of letters in the English word "strawberry".

> How many R's are in the word strawberry

ChatGPT: There are two "R's" in the word "strawberry."

> Are you sure?

ChatGPT: Yes, I'm sure. There are two "R's" in the word "strawberry."

[...]

> Count the Rs in the word straw and the word berry separately.

ChatGPT: Sure. Let's break it down:

* The word "straw" has 1 "R."
* The word "berry" has 2 "R's."

So, together they have 3 "R's." But in the combined word "strawberry," there are still only 2 "R's."

At first glance, that's a big yikes. (I was able to reproduce the inital result using ChatGPT 4o.)

But I get annoyed by this kind of post for two main reasons: 1) if you ask slightly differently, ChatGPT can easily get this question right, and 2) it's a missed opportunity to explain how LLMs work, specifically with regards to tokenization.

Here's the prompt I used with ChatGPT 4o:

> Write and then run a Python program to determine how many R's are in the word strawberry

ChatGPT: The word "strawberry" contains 3 occurrences of the letter 'R'.

It's not really well known that ChatGPT has a feature to execute Python code in a sandbox and use the result in answers.

The Python code it generated was:

# Python program to count the number of 'R's in the word 'strawberry'

word = "strawberry"
count_Rs = word.lower().count('r')  # Convert the word to lowercase and count 'r'
count_Rs

The code is pretty simple, and I think we'd all agree that this is a place where Python code can quite accurately answer this question. Having the LLM generate the code and execute it nicely sidesteps the underlying issue.

Of course, the problem is you have to prompt it to use Python in the first place, knowing that this is a subject where an LLM without code execution would fail. Which brings me to...

When you provide a prompt for an LLM, it breaks down each word into tokens. OpenAI provides a visualization tool for this; you can see that using the GPT-4 tokenizer, the string "strawberry" is broken down into "str", "aw" and "berry", which get transformed to numerical values of [496, 675, 15717]. Or in other words, the LLM doesn't have access to individual characters, it just sees the tokens.

In this case, you can try to bypass the tokenizer by using spaces, so each letter is now its own word and tokenized separately:

> How many r characters are in the letters s t r a w b e r r y

ChatGPT: The word "strawberry" contains 3 instances of the letter "r."

Success, though I've had mixed success with approach this in the past.

There are plenty of problems, both technically and ethically, with LLMs and I don't want to be (or even appear to be) an apologist for them. But I think it's important that we all understand how LLMs work to actually tackle and address those problems.

Side quest: creating a "main" tool

2024-07-18T00:57:56+00:00

I like Simon Willison's framing of using large language models (aka LLMs, aka "AI") to enable side quests of things you wouldn't normally do.

Could I have done this without LLM assistance? Yes, but not nearly as quickly. And this was not a task on my critical path for the day—it was a sidequest at best and honestly more of a distraction.

So, yesterday's side quest: writing a tool that checks out the default branch of a Git repository, regardless of what it's named.

Context: most of my work these days happens on GitHub, which involves creating PRs off the main branch, which means I'm frequently going back to it, via git checkout main and then usually a git pull to fast-forward the branch.

But just to make things a little more interesting, the SecureDrop server Git repository's main branch is named develop, which entirely screws with muscle memory and autocomplete. Not to mention all the older projects that still use a master branch.

For a while now I've wanted a tool that just checks out the main branch, regardless of what it's actually named, and optionally pulls it and stashes pending changes.

I asked Claude 3.5 Sonnet for exactly that:

I want a Rust program named "main" that primarily checks out the main branch of a Git repository (or master if it's called that).

I want to invoke it as:

main - just checkout the main branch

main stash - stash changes, then checkout main, then pop the stash

main pull - checkout main and then git pull

main stash pull or main pull stash - stash changes, checkout main, then pull, then pop the stash

It was mostly there, except it hardcoded the main and master branches intead of looking it up via Git. I asked:

Is there a smarter way to determine the main branch? What if it's called something other than main or master?

And it adjusted to checking git symbolic-ref refs/remotes/origin/HEAD, which I didn't know about.

I cleaned up the argument handling a little bit, added --version and published it on Salsa.

It took me about 5-10 minutes for this whole process, which according to xkcd is an efficiency positive (saves 1 second, but I do it ~5 times a day) over 5 years.

It probably would've taken me 2-3x as long without using an LLM, but honestly, I'm not sure I would've ever overcome the laziness to write something so small.

Anyways, so far I haven't really gotten around to writing about my experiences and feelings about LLMs yet, so here's literally the smallest piece of work to kick that off.

Running the ArchiveTeam Warrior under Podman

2024-07-08T03:48:43+00:00

I'm finally back on an unlimited internet connection, so I've started running the ArchiveTeam Warrior once again.

The Warrior is a software application for archiving websites in a crowdsourced manner, especially when there's a time crunch when a website announces that it's closing or planning to delete things. Currently the default project is to archive public Telegram channels.

Historically the Warrior was distributed as a VirtualBox appliance, which was a bit annoying to run headlessly and was unnecessarily resource intensive because it required full virtualization. But they now have a containerized version that is pretty trivial to set up.

Relatedly, I've recently been playing with Podman's "Quadlet" functionality, which I really, really like. Instead of needing to create a systemd service to wrap running a container, you can specify what you want to run in a basically systemd-native way:

[Unit]
Description=warrior

[Container]
Image=atdr.meo.ws/archiveteam/warrior-dockerfile

PublishPort=8001:8001

Environment=DOWNLOADER=<your name>
Environment=SELECTED_PROJECT=auto
Environment=CONCURRENT_ITEMS=4

AutoUpdate=registry

[Service]
Restart=on-failure
RestartSec=30
# Extend Timeout to allow time to pull the image
TimeoutStartSec=180

[Install]
# Start by default on boot
WantedBy=multi-user.target default.target

I substituted in my username and dropped this into ~/.config/containers/systemd/warrior.container, ran systemctl --user daemon-reload and systemctl --user start warrior and it immediately started archiving! Visiting localhost:8001 should bring up the web interface.

You can then run systemctl --user cat warrior to see what the generated .service file looks like.

The AutoUpdate=registry line tells podman-auto-update to automatically fetch image updates and restart the running container. You'll likely need to enable/start the timer for this, with systemctl --user enable podman-auto-update.timer.

The one thing I haven't figured out yet is gracefully shutting down, which is important to avoid losing unfinished data. I suspect the Restart=always is harmful here, since I do want to explicitly shutdown in some cases.

P.S. I also have a infrequently updated Free bandwidth wiki page that contains other suggestions for how to use your internet connection for good.

Update (2024-07-14): I changed the restart options to Restart=on-failure and RestartSec=30, which fixes the issue with restarting immediately after a graceful shutdown and correctly restarting if it starts up before networking is ready.

Basic anti-abuse monitoring for Mastodon

2024-06-26T00:12:52+00:00

Back in February, Mastodon and the connected Fediverse faced a spam attack caused by unattended instances with an open signup policy. Bots quickly registered accounts and then sent spammy messages that were relayed through the Fediverse.

It was annoying and the normal moderation tool of limiting or blocking entire instances wasn't effective since the attackers were coming from a wide set of places. Since then people have developed shared blocklists that you can subscribe to, but that has its own downsides.

So here's the tool I developed that we used for wikis.world: masto-monitor.

The workflow is straightfoward:

Poll the federated timeline for all public posts
Check them against a manually curated list of patterns
If there's a match, report it using the API

This allows us to have an automated process checking all incoming posts while still enabling humans to make any moderation decisions.

The code itself is pretty straightforward that it doesn't really merit much explanation. The matching logic is very very basic, it just looks for substring matches. I think the approach is worth developing further, allowing people to write more expressive rules/filters that trigger automated reports.

But, I'm not planning to do so myself since we don't currently have a need, so people are welcome to fork it to enhance it.

Implementing search for my blog in WebAssembly

2024-06-08T05:18:24+00:00

If you visit my blog (most likely what you're reading now) and have JavaScript enabled, you should see a magnifying glass in the top right, next to the feed icon. Clicking it should open up a search box that lets you perform a very rudimentary full-text search of all of my blog posts.

It's implemented fully client-side using Rust compiled to WebAssembly (WASM), here's all the code I added to implement it.

At a high level, it:

Splits all blog posts into individual words, counts them, and dumps it into a search index that is a JSON blob.
Installs a click handler (using JavaScript) that displays the search bar and lazy-loads the rest of the WASM code and search index.
Installs an input handler (using WASM) that takes the user's input, searches through the index, and returns up to 10 matching articles.

The search algorithm is pretty basic, it gives one point per word-occurence in the blog post, and 5 points if the word is in the title or a tag. Then it sorts by score, and if there's a tie, by most recently published.

There's no stemming or language processing, the only normalization that happens is treating everything as lowercase.

I've played with WASM before but this is the first time I've actually deployed something using it. As much as I enjoyed writing it in Rust, the experience left something to be desired. I had to use a separate tool (wasm-bindgen) and load a pre-built JavaScript file first that then let me initialize the WASM code.

The payload is also ...heavy:

search.js: 5.53kB (23.63kB before gzip)
search_bg.wasm: 53.78kB (122.82kB before gzip)
search_index.json: 323.13kB (322.76kB before gzip)

I'm not sure why the index compresses so poorly with Apache, locally it goes down to 100kB. (I had briefly considered using a binary encoding like MessagePack but thought it wouldn't be more efficient than JSON after compression.) And of course, the more I write, the bigger the index gets, so it'll need to be addressed sooner rather than later. I think any pure-JavaScript code would be much much smaller than the WASM bundle.

Advent of Code 2022, in Rust

2023-01-04T09:06:06+00:00

There's a yearly programming contest called Advent of Code (AoC). If you haven't heard about it, I'd recommend reading betaveros's post explaining what makes it unique.

This was my third attempt at AoC, previously trying it in 2019 (made it to day 5) and 2021 (day 6). This year I made it to... drumroll ...day 14! I had a good time this year, primarily because a group of friends (read: wiki folks on Mastodon) were doing it every day, so I'd be motivated to be able to compare my solution with their own.

Then on day 15 at midnight I looked at the puzzle and said "nope." and went to sleep.

AoC definitely messed with my sleep schedule being on EST and starting the puzzles at midnight rather than the 9 p.m. back in PST. Once I finished each puzzle, it always took a while to calm down from the rush and by then I'm sleeping at least an hour later than I should've been.

But since I was starting as soon as the puzzle came out on most days, the leaderboard accurately reflects how long it took me on those puzzles:

      --------Part 1---------   --------Part 2---------
Day       Time    Rank  Score       Time    Rank  Score
 14   00:35:44    2411      0   00:40:21    1977      0
 13   00:30:11    1920      0   00:38:08    1735      0
 12   23:09:41   34803      0   23:24:54   33874      0
 11   00:28:01    1435      0   01:01:03    2707      0
 10   00:15:40    2657      0   00:27:38    1841      0
  9   02:34:24   15092      0   02:56:58   11213      0
  8   00:36:38    6896      0       >24h   61768      0
  7   00:34:54    2671      0   00:45:38    2924      0
  6   00:08:31    5046      0   00:10:01    4555      0
  5   00:16:09    1720      0   00:17:34    1375      0
  4   00:08:33    3667      0   00:10:10    2539      0
  3   14:34:00   82418      0   22:00:31   92084      0
  2   14:27:16  100430      0   14:47:19   94770      0
  1   17:13:27  112294      0   17:16:09  107095      0

Day 5 was my best performance, I attribute that to the input format requiring a more-complex-than-usual parser, which I sidestepped by cleaning up the input in my editor first.

I posted a link to each day's solution and some commentary on a Mastodon thread. All of my solutions are available in a Git repo.

Overall I enjoyed doing the challenges in Rust. I feel that a good amount of the puzzles just required basic string/array manipulation, which are faster to do in a dynamically typed language like Python, but there were plenty of times I felt Rust's match statement (which Python now sort of has...) and sum types came in handy. Specifically with Rust's match statement, the compiler will complain if you don't satisfy some branch, which helped when e.g. implementing the rock-paper-scissors state machine.

As far as learning goes, I picked up some CS concepts like Dijkstra's algorithm. I'm not sure I really learned any more Rust, just got more comfortable with the concepts I already knew and likely faster at applying them. For the past few months I feel like I'm now thinking in Rust, rather than thinking in Python and writing it in Rust.

Past puzzles are available indefinitely, so you can do them whenever you want. I don't plan on finishing the rest, I mostly lost the incentive now that it's no longer a daily thing. But I'll probably try again in December and see how far I go :-)

Current thoughts on Mastodon

2022-11-11T06:14:34+00:00

The majority of posts on Mastodon right now are about how to get started, discussions about various features or making fun of the dumpster fire that is the birdsite. This is mostly unavoidable as new people sign up, but I've tried to keep using Mastodon as an actual social network by not posting about "meta" things unless necessary. That said, I have enough thoughts though that I should say something, so here it is.

I first wrote about Mastodon in February 2020 in "Twitter's tools help online harassers" (I was probably one of the first people to ever get their Mastodon handle in newsprint!), examining the flaws of Twitter that Mastodon has attempted to fix. I think this framing, "a better Twitter", is a good first introduction but misses the bigger picture.

In 2005, Wikipedia co-founder Jimmy Wales gave a talk in which he outlined 10 areas for us to free.

His final item was "Free the Communities!":

Source

(His slide concludes by promoting Wikicities, later renamed Wikia, later renamed Fandom. Given that communities had to escape Wikia, I'd say that didn't end up freeing them. A topic for another day...)

I've been unable to find a working video of his talk, but the general point is clear: people should be in charge of their communities, not companies. We should dictate the terms of who we include and exclude, what we find acceptable for people to say in our spaces, and most importantly, how we make those decisions. We shouldn't need to report trolls to opaque content moderators who can't make correct decisions because they lack sufficient context; we should just boot them ourselves. It's incredibly empowering to be in communities that have agency to make these decisions for themselves.

(Tangent: this is a good time to plug Mako's 2018 LibrePlanet keynote, "How markets coopted free software’s most powerful weapon", discussing how companies monetized "peer production" features. In this case Twitter is monetizing our posts, thoughts, experiences, commentary, etc., relying on the masses for content and curation.)

Mastodon kind of gets us to running our own communities, though it's far from perfect. I think it's a much better representation of how online communities have historically worked, you have groups of people who are tied together by some common interest (some project, geographic location, etc.) but have open doors so you can easily be in multiple communities at the same time. Shoving everyone into one space... I don't think it really worked out that well.

It will take some time for people to unlearn the bad habits that Twitter continually reinforced. There's some meta discussion happening on how journalists should engage on Mastodon (some instances have already started blocking the new journa.host). I attribute this friction to switching from using social media to drive up engagement to the established culture on Mastodon to actually engage with people! I think it's entirely doable, in the past I reported on Googleville developments, Elsevier negotations, and a bunch of other things on Mastodon without getting a single complaint.

The hardest part of Mastodon is finding the "right" server (read: community) to join. There's probably a good chance the server doesn't even exist yet! Given that you hit this problem as you try to signup and don't actually know how anything works yet, the UX is baaaaaad. (No, I don't have any proposals to fix this, I just think it's important to acknowledge that this is a significant hurdle to onboard new people.)

Like most other community-based projects, I expect the UX will improve gradually over time through careful refinement and feedback from a large and diverse group of users. Getting through the poor UX now is merely an investment in the future. Many servers have also been struggling on the rapid increase in people signing up and posting, so some performance/scaling improvements are in order hopefully.

Is Mastodon ready for the masses? Probably not yet, but now is a great time to try.

It's very easy to bash on Twitter these days, but I will certainly miss it once it's gone. I have a lot of fun memories associated with conversations I had with people on the site (the time I got "expelled" from Debian) or cool things that I posted there (being featured in the Computer History Museum)! One of my heroes, Dr. Harry Edwards, once tweeted out an editorial I co-wrote. Oh and the time we reported on a Tesla hitting someone riding a Lime Scooter and it briefly went viral as "the most Silicon Valley thing ever". That said, these are all because of the people, not because of any unique feature of the platform.

Techdirt covered how Twitter previously had a very strong free speech stance, especially when it came to protecting users' anonymity. One of the downsides of having small community-run instances is that they have much less legal infrastructure and protection. How many Mastodon server administrators would have simply given in when faced with state demands for private user data? Or been able to assemble a legal team to put up a winning defense?

I sometimes forget how ingrained Twitter is in our current society and infrastructure. I went to look up the Caltrain timetable yesterday and to get service alerts if a train will be more than 5 minutes late you have to check @CaltrainAlerts on Twitter. Or get updates on whether you should evacuate because of a fire, you check Twitter.

Because of its federated nature, I don't think Mastodon can (currently) replace something that's so dependent on real-time updates. And I doubt most organizations/sites that are currently using Twitter can implement their own website or app or whatever to provide instant notifications in a manner that was as usable as Twitter.

I'm very excited to see where Mastodon goes next. More than the software, I have thrived in free communities for years now and hope even more people can experience the liberation that comes from joining one.

I'm putting my money and time where my mouth is by co-adminning a Mastodon server for wiki enthusiasts. We're growing rather slowly (about 1 new account per day), which I hope will help build a real community instead of just importing one from somewhere else. If you need help, contact me by whatever means we normally use, I'm very happy to help.

Learning Rust, week 5

2020-07-13T06:29:33+00:00

I'm skipped writing a post for week 4 and then didn't do any Rust related things for a week, so this is my week 5 update.

The main (published) Rust I've written since my last post is a port of my w.wiki statistics Toolforge tool. It reads through compressed plaintext dumps, parses URLs and aggregates counts per-domain to make a nice table. I used the flate2 crate for decompressing gzipped files and then the std::io::BufRead trait to read a file line-by-line.

It also has a slow-to-load chart that shows the increase in total shortened URLs since the start of the service. After looking through a few different plotting libraries, I ended up using plotters because it could properly chart timescale graphs. I think the graphs created by the charts crate look prettier but it wasn't flexible enough for this dataset. The chart is slow to load on Toolforge because it reads ~60 cache files, needing to hit NFS for each one.

I want to move the cache to redis, but the primary Rust redis library doesn't support having an automatic key prefix so I might end up writing a wrapper to do that.

In the future I want to provide charts for the individual domains and maybe a listing of recently shortened links for each domain, we'll see.

Because of how rocket's template system wants its structs to be serde-serializable, it becomes really straightforward to create a JSON API for every template-based endpoint. I had written a whole library (flask-dataapi) for this in Python, and now it's basically built-in.

I also submitted two OAuth2-related patches to Rust crates:

mediawiki - Support authenticating with an OAuth2 access token
rocket_oauth2 - Add Wikimedia as a provider

In terms of documentation, I've spent a decent amount of time improving my Rust on Toolforge wiki page, including some updates that came after debugging with other Rust users on IRC. I think it's in a state that we can link to it from the official Toolforge docs.

Learning Rust, week 3

2020-06-23T09:56:48+00:00

I'm a little behind with this update as it took me a little longer to prepare and launch my project: diff-libraries. I'm planning to write a more detailed post on that later, but it's my most involved Rust project so far. The webserver is powered by Rocket and it uses diesel for the SQL backend. I'm using Tera templates, which feel like Jinja2 templates, but are missing some of the nice Flask integrations like url_for.

I've gotten a pretty good setup going with GitLab-CI now: rust-ci-pipeline (the name and setup are inspired by what Debian has). There are still a few problems with cargo-tarpaulin segfaulting every now and then but I haven't been able to debug it yet.

I published my first real library crate too: eventstreams (docs). It's a wrapper around Wikimedia's new recent changes feed. I think the fact that docs.rs automatically builds documentation for every single library on crates.io, with no extra action needed is a real game changer. Even the auto-generated documentation is super useful, and it makes authors more likely to fill in the documentation knowing that someone will actually read it.

My goal for this past week was to learn async, and I kind of did, I ported subdown3 to be all async. I think I get the basics, but eh, over it for now. The fact that reqwest didn't let me incrementally migrate from it's blocking mode to the async version was frustrating, because it meant I had to port the entire codebase over to async before I could even get any of it to run to verify I was heading in the right direction.

Code written/released:

diff-libraries (~430 lines)
eventstreams (~420 lines)
rustc-simple-version (~45 lines)
gerrit-grr 2.1.0 and 2.2.0

Libraries used:

Concepts learned:

async/await (beginner)
std::thread
build.rs
visibility in libraries/modules (pub)

Next week:

I want to have a better understanding of lifetimes, especially with regards to threads.
@janriemer gave me some tips on mocking libraries that I still need to look into (thanks!).

Learning Rust, week 2

2020-06-14T02:49:21+00:00

I think I'm starting to understand why people like Rust so much. The tooling, especially rustup and cargo are pretty fantastic. The fact that rustfmt (code auto-formatter), clippy (linter) and a test runner are all integrated through cargo is super convenient. I feel like Python used to have that with setuptools/setup.py but over time that's been lost.

This week I ported my Gerrit helper grr to Rust, and wrote a Reddit downloader tool, subdown3, that I originally wrote in Python nearly a decade ago. subdown3 has straightforward command-line options and primarily deals with URL parsing and hitting various APIs. grr is a convenience wrapper around git that just shells out.

I'm also hosting my Rust projects on GitLab, primarily to take advantage of its CI features (which I don't feel like setting up for git.legoktm.com). I've been using cargo-tarpaulin to generate coverage for tests, which has been simple. No extra configuration or anything, you just run it.

One thing I've been struggling with is figuring out how to mock functions. Because grr primarily shells out to git, integration testing isn't that useful, but testing what exactly we're shelling out to is more useful.

Code written:

subdown3 (~750 lines)
grr (~425 lines)