Clawed: A Language Model Written in Scratch

12 min readNov 12, 2024

This blog post started as a joke Claude is the name of the Large Language Model from Anthropic. It also sounds like “clawed”, which sounds like it would be the perfect name for a language model written in the Scratch programming language.

Scratch is a visual programming language. It would be ridiculous (though not impossible) to write a deep neural network in Scratch. A large language model would be out of the question? But what about a small language model?

The term large language model is used to refer to a language model that is implemented as a deep neural network and in “large” in the sense that it has more than a billion parameters. If you want to know about deep neural networks and large language models, you can see this Very Gentle Introduction to Large Language Models Without the Hype.

But let’s step back a second. A language model is a very simple concept: it is the probability of a word, given one or more words that came before it. Consider the following sequence of words: “Sailing the deep blue”. What word should come next? There is a high probability of a word like “sea” because the phrase “deep blue sea” occurs a lot in English writing. Less likely would be the word “avocado” — we just don’t see “Sailing the deep blue avocado” occur very often.

Mathematically, we say a language model is:

which can be read as “the probability of the n-th word, given a series of words 1 through n-1.”

There are many ways to learn this probability. One way is to go and start counting word co-occurrences. Another way is to develop a sophisticated neural network that approximates that probability function. Either way, you end up with a language model.

If a language model is just a probability distribution over words given other words, what is to stop us from counting some words in Scratch and learning a language model? Nothing.

But first, there are different types of language models based on how many words you want to be able to keep track of. A very very simple language model is called the unigram language model. “Unigram” means “one word”. The unigram language model is the probability of a word without tracking any previous words. It is simply the frequency of every word in a dataset. There is not much you can do with a unigram word model — it’s too simple.

A slightly less simple type of language model is the bigram language model, the probability of a word given the word that immediately precedes it. “Bigram” means two words: the word you are trying to guess and the one previous word. In a bigram language model, you have a one word clue that helps you guess the next word. This is how early versions of autocomplete work.

I will show how a unigram language model and bigram language model works in Scratch.

The Training Data

First, we are going to need some training data. Training data is just words. It can be a single document, many documents concatenated together, or many separate documents. For simplicity, we will use the first verse of Lewis Carol’s The Jabberwocky, further simplified by lowering all capital letters and removing all punctuation.

twas brillig and the slithy toves did gyre and gimble in the wabe all mimsy were the borogoves and the mome raths outgrabe

Let’s create a variable document, and give it the above string.

The training data is a single string, the first verse from Lewis Carol’s Jabberwocky.

This is a useful example because it has a bunch of unique words, but also a few common words such as “and” and “the”. This will make the probability computations easy enough to visually inspect and debug if necessary.

Data Structures

We are going to need some data structures. Scratch doesn’t give us a lot of options. Aside from variables, Scratch only provides lists. We will spend a lot of time working around this limitation. For now, we will create five lists:

tokens: this list will contain each word in a training document in order. The document string will need to be broken into unique words. When we do this, we refer to each word as a token.
vocabulary: this list will contain each unique word (the document contains some repeat words). While the order of the words doesn’t matter, the index number of each word will matter. This allows us to translate back and forth between words as string as numbers. For example if “twas” is in position 1, then the number 1 and the string “twas” are equivalent.
word counts: The number of times each word appears in the training data. If “and” appears three times, and we put “and” in the vocabulary in position 5, then having a “3” in the 5th position of word counts would signify that “and” appears three times.
unigram model: The unigram model is the probability of each word appearing in the training data.
bigram model: the bigram model is the probability of each possible pair of words (called a bigram) appearing in the training data. For example “and the” appears twice, “twas brillig” appears once, and “brillig twas” appears zero times.

Having made these lists, we should make sure they are all empty at the beginning of the program.

Parsing the Data

Our training data is a string of characters, but we need a list of words (tokens). We are going to have to do some work to break this string of characters into words. We will use the simplifying assumption that words are delimited by spaces.

We will benefit by having a helper function, Found a word. When we find a word, we need to add that word to the token list. If we have never seen that word before, we need to add it to the vocabulary. If we have seen that word before, we need to increment its word count.

Now we need to go find some words. Let’s create a new variable called word. As we work through the document, we will add to this word letter by letter until we get to a space. When we get to a space, we will use our handy helper function above, clear the word, and start over. We will do this until we run out of letters in the training document.

Parsing the training data into words (tokens).

(But there will be one left-over word because the last word will not be followed by a space.)

One final touch: we will add two special tokens to the vocabulary.

“SOS” will refer to the “start of sequence”.
“EOS” will refer to the “end of sequence”. We will also add “EOS” to the end of the token list.

These won’t be particularly useful in the unigram model, but will be very helpful in generating from the bigram model.

Keeping track of the start of sequence and end of sequence.

Train and Run the Unigram Model

The last thing we will do is train the unigram model and then generate text by sampling from it. We will create some separate blocks to do those. Sampling from the language model should populate a new variable, generated.

Training the Unigram Model

To train the unigram model, we will simply compute the probability that each word appears in the training data. The probability of a word appearing in the training data is the frequency of it’s appearance divided by the total numbers of words that appear. We have already gathered the word counts, so all we need to do is sum up the total number of word appearances then divide each word count by the sum.

Now we know the probability of each word. It looks like this:

Word 4 appears 20% of the time. Word 4 is “the” if we look it up in our vocabulary list.

Generating from the Unigram Model

To generate from a unigram language model, we just pick words based on how probable they are. The way we will do this is to produce a random number between 0 and 1. This is our target. We will then iterate through our unigram model, accumulating probability “mass” until we hit our target. This works because high probability words take up more space in the span of 0 to 1 so we are more likely to land on them and pick them.

We will collect up all the words and put them in the generated variable, which is the return value.

Running the model doesn’t result in a particularly satisfying output.

Train and Run the Bigram Model

We should be able to do better with a bigram model. The bigram model allows a single, previous word, to provide a hint as to what should come next. Once again, sampling from the language model should populate a new variable, generated.

Training the Bigram Model

The bigram model is the probability of a word once we’ve seen the previous word. The word “the” is likely to follow the word “and”. It is also possible for “gimble” to follow “and”. It is not possible for “brillig” to follow “and” according to our data.

What makes the bigram model is tricky is because we need probabilities given two words, but Scratch lists cannot be nested to create matrices like in other programming language.

The bigram model will be a list with N x N elements, where N is the number of words. Each index will refer to the probability of two words back-to-back. For example, index 1 could be “twas twas”, index 2 could be “twas brillig” and so on.

Let’s set that up by adding N x N zeros to the bigram model list.

It’s going to be a pain in the butt to keep things straight in a 1-D array that is essentially a flattened 2-D array. Let’s make some helper functions to help us translate back and forth between pairs of words and index.

This block will take two words and figure out which index represents that bigram. The bigram index variable is the return value.

This block will take a bigram index and figure out what two words make up the bigram. Word1 and word2 are return values.

The next step in the Compute bigram model block is to compute the bigram counts. This is how many times each possible pair of words appears in the training data. We need to track two words at a time, previous word and word. We will start off previous word as “SOS” because we have to have something. Then we walk through the token list, and for every pair of back-to-back words, we figure out the corresponding index and increment the value there.

It would be really helpful if we could see what was going on. Let’s visualize the bigram for each index in the bigram model in a special list called bigrams for debugging. This won’t do anything useful except to let us more easily make sense of the bigram model.

Here is what we have so far:

We can easily see that “and the” appears twice, while “and gimble” appears once. Most possible bigrams like “and slithy” do not appear at all.

Unfortunately, we don’t want bigram counts, we want bigram probabilities. Like before, we can sum up the total number of bigrams and then divide all the frequency counts by the total sum.

Now we have a proper bigram model containing the probability of each bigram:

The bigram “and the” has probability of 66.6% and the bigram “and gimble” has a probability of 33.3%.

Generating from the Bigram Model

To generate from a bigram model, we need to sample from all possible words that follow another word. That means we need to start with a word. But what word? How about “SOS” — all sequences should start with that. We set the previous word to “SOS”. Now we can figure out what word is most likely to follow “SOS”. Once we figure out that next word, we will make it the previous word and guess the word after that. We will do that until we either generate the target number of words, or until we generate “EOS”, which tells us we have a complete sequence.

While we haven’t generated an “EOS”, we will sample just like in the unigram situation. We will use our helper words to bigram index to jump to the right part of the bigram list and iterate through all words until we hit our random target. As before, certain bigrams take up more space on the number line between 0 and 1 so we are more likely to stop on high-probability bigrams. We add whatever word we hit to generated, and make the new word the previous word.

Now if we run the bigram model, we get output that looks much more plausible:

We might get a long or a short generation. Since the training data is so simple, we can more easily see what is going on. Starting with “SOS” the only successor we have ever seen is “twas” followed by “brillig” followed by “and”. The data says we have 100% probability of that sequence. Once we generate “and” we have a 66.6% probability of generating “the” and a 33.3% change of generating “gimble”. If we generate “the” we have a 25% chance of generating each of “slithy”, “borogoves”, “mome”, and “wabe”. If we ever generate “mome”, it is 100% followed by “raths” then “outgrabe” then “EOS”, at which point we are done generating. But other words give a probability of generating “and” again, so we might go around in circles a few times.

What about Trigrams and Beyond?

What about a trigram model? A trigram model is the probability of a word based on the prior two words. I haven’t implemented a trigram model. It will be much more annoying to deal with a 3-D matrix of probabilities flattened into a 1-D list. But it will look a lot like the bigram model, except a trigram model would require N x N x N entries, where N is the number of unique words. As we increase the size of the gram, the amount of data we need to store grows.

We can think about what a trigram model would do on the first verse of The Jabberwocky. Each word triple appears exactly once in the data. Thus, starting with “SOS”, we would be able to recreate The Jabberwocky perfectly because our data is embarrassingly simple.

A large language model like Claude or ChatGPT can process n-grams where n is in the thousands. Let’s think about the lesson of the trigram on The Jabberwocky. If a language model can compute the probability of a word given a thousand preceding words, then a large language model should theoretically be able to recreate large swaths of its training data. And that is what we see, to some extent, with the largest of the large language models. They have been shown to be able to output entire paragraphs of books that they were trained on.

Another way to think of the scale of large language models is that it is very likely that every possible question that we might ask has likely appeared in some variation in the training data, and the large language model can prove a memorized response. There is some variance because most questions we might ask have probably been asked in a few different ways, providing some variance somewhere between our bigram model on The Jabberwocky and a hypothetical trigram on The Jabberwocky.