Solve Wordle Using Information Theory

This is a note about how to use information theory to create an optimal strategy for solving “Wordle”.

Initial ideas

For any given word, we can find a distribution for probabilities for each pattern of colors that could be revealed. Example, for “WEARY”, there is some number of words that do not contain any of the letters, some that contain W, some with W and E, some with E at the first position, etc.

We can geenrate the list of all possible outcomes for a given word, and check how many words in the list match the patterns to come up with a probability.

Key A guess is “informative” if it is unlikely, i.e., the more unlikely the pattern, the fewer possibilities remain to be checked. A highly unlikely guess, if correct, significantly cuts down the solution space.

The expected amount of information from the distribution for a certain word is

$$ E[\text{Information}] = \sum_x {p(x) \cdot \text{Something}} $$

This “something” should measure how informative that pattern is. We could try to lower the average number of matches. The fewer the number of matches for a pattern, the more informative that pattern is. Instead we can use a concept from information theory.

What is a “bit”?

An observation that can cut the space of possibilities in half, that is called a bit of information.

Therefore

p(1 bit) = 1/2
p(2 bits) = 1/4
p(3 bits) = 1/8 etc.

Or, for information $I$

$$ \left(\frac{1}{2}\right)^I = p \implies 2^I = \frac{1}{p}\ I = \log_2{\left(\frac{1}{p}\right)} $$

This is a better way of representing probability. You can say that you get 20 bits of information instead of saying that the probability of such a thing occuring is 0.00000095.

Information (or bits) add together the same way probabilities multiply, thanks to the logarithm.

Applying information theory to the problem

$$ E[\text{Information}] = \sum_x {p(x) \cdot \log_2{\left(\frac{1}{p(x)}\right)} } $$

This is the average number of bits you could get from a certain word. The higher the probability of seeing a pattern, the lower the number of bits.

This metric can be used to rank the guess in order of number of expected bits of information.

Note:

**This expected value of information is also called “entropy” ($H$). This was a suggestion by von Neumann to Claude Shannon. While this has some connection to the entropy from thermodynamics, Shannon was dealing purely with probability distribution when he used the term. **

Strategy

First give a guess based on the word that gives the highest expected number of bits of information. Then for the next guess use the more restricted list of words based on the result of the first guess, and so on.

The solver can be run across every single word in the word-list to work out an average score.

A better strategy?

Assign a probability for each word for being in the final list based on frequency of use in the English language instead of a uniform distribution. Grant ranks them by frequency and uses a sigmoid function to assign probabilities. Use this to measure the “remaining uncertainty” of the possible word candidates left. So even if the number of possible words is higher, the uncertainty would be lower thanks to the weighted distribution.

Expected Score

TBD.

References

[1] The mathematically optimal Wordle strategy, 3Blue1Brown