Developing an AI system with Markov chains

Sure-Fire Success

© Lead Image © ct,

© Lead Image © ct,

Article from Issue 205/2017

Markov chains model systems that jump from state to state with predetermined probabilities, but can they help write new columns like this one after learning from previously written articles?

Hard to believe, but true: This edition of my programming column marks its 20-year anniversary. Lately, I've been diving into artificial intelligence topics, a field that is increasingly replacing traditional jobs, but I wonder if a machine could possibly one day replace authors like me? An AI system would certainly have a number of advantages over a human writer, because a robot would no doubt deliver the manuscript on time every time, to the amazement of my editors, who are not fond of my bad habit of watching deadlines swoosh by. Also, I could afford to work less hard and have more time to kick back, relax, and spend my days pursuing my surfing hobby at various Pacific beaches (Figure 1).

Figure 1: Target: The author surfing in Hawaii whilst an AI system writes his column.

Moody like the Weather

An algorithm that writes columns automatically could be implemented as a so-called Markov chain, named after the Russian mathematician Andrei Markov. Markov chains are stochastic processes with different states that change according to predetermined probabilities.

The weather forecasting model shown in Figure 2, for example, says the chance of a sunny day following a "Sun" state is 65 percent. At 30 percent, the weather will change to the "Clouds" state, and at 5 percent, it goes directly to the "Rain" state. The probabilities apply only to the current state; a change to the next state does not require knowledge of any event in the past, so the probability of a transition is very easy to calculate.

Figure 2: Markov diagram for the weather forecast, with the transition matrix in the lower right corner.

In a transition matrix such as that shown in Figure 2, the computer only needs to select the row with the current state, such as the third row for the Clouds state (C), to find out that there is a 25 percent chance of the weather being sunny, a 25 percent chance of rain, and a 50 percent chance of skies remaining cloudy. Equipped with corresponding random generators, the model jumps back and forth between states, and a later analysis of where this Monte Carlo method leads eventually allows conclusions to be drawn about the most likely weather in the future.

Sentences from the Machine

Such "random walks" can predict not only the weather, they can generate sequences of words that sound like they were written by a human author. For this purpose, a program first analyzes the word sequences used in an extensive body of text. It notes which words follow one another in a sentence, and pieces together a new text from learned phrases. For example, it might realize that, in corpus, three given words are sometimes followed by word A and sometimes by word B. When generating new prose, the algorithm will roll an imaginary die each time to determine whether the three initial words are followed by A or B, according to the probabilities in which they occur in the original text.

To start off, the algorithm separates the body of text into words (tokens) and then analyzes it step by step with a rolling window of words. The first (n – 1) words of the window determine the state of the machine at the current time, and the last word in the window is the value the machine assumes during the transition to the next state.

Rolling Window

For example, for a window size of n=3, from the text

Humpty Dumpty sat on a wall,

Humpty Dumpty had a big fall.

it extracts the sequences "Humpty Dumpty sat," "Dumpty sat on," "sat on a," and so on. The first two words (Humpty Dumpty) respectively determine the current state of the machine; the following word ("sat") is the next state value. In the second part of the nursery rhyme, the algorithm stumbles again on the "Humpty Dumpty" state. However, it sees not the word "sat," but rather the word "had" following the current state. When it later creates random text, the algorithm now has two options: It can either follow up with "sat," or it can choose "had," and it will do so according to calculated random values. The result is text that shows slight human qualities because it copies word sequences, but occasionally performs crazy jumps between contexts, which can sound either confusing or funny.

The oldest Programming Snapshot in Linux Pro Magazine dates back to 2003 (although the German edition started in 1997), and Listing 1 [1] reads all raw manuscripts cobbled together of every single issue since then in the file all.txt. The file name is first passed to the generate_from() function in line 68. The statemap_gen function starting in line 37 then splits the file into tokens with token_feed() and passes on the resulting list of the window() function from line 26, which accepts n=4 as the window width. As explained previously, the status of the Markov chain in this case is made up of the three last words; the fourth is part of the result state.

Listing 1

01 #!/usr/bin/python3
02 from nltk.tokenize import word_tokenize
03 from collections import deque
04 import re
05 import random
07 re_special=re.compile("^[,.:]$")
08 re_word=re.compile("^\w+$")
10 def token_feed(file):
11     string  = open(file, 'r').read()
13     for i in word_tokenize(string):
14         if re_special.match(i) or \
15            re_word.match(i):
16             yield(i)
18 def tokens_to_text(tokens):
19     out=""
20     for word in tokens:
21         if(not re_special.match(word)):
22             out += " "
23         out += word
24     return out
26 def window(seq, n):
27     it = iter(seq)
28     win = deque(
29       (next(it, None) for _ in range(n)),
30       maxlen=n)
31     yield win
32     append = win.append
33     for e in it:
34         append(e)
35         yield win
37 def statemap_gen(file):
38     statemap={}
39     tokens=token_feed(file)
40     for state in window(tokens,4):
41         key   = list(state)
42         value = key.pop()
43         key   = tuple(key)
45         if key in statemap:
46             statemap[key].append(value)
47         else:
48             statemap[key] = [value]
49     return statemap
51 def generate_from(file):
52     statemap=statemap_gen(file)
53     key=list(
54       random.choice(list(statemap.keys())))
55     words_new=list(key)
57     for i in range(200):
58         if not tuple(key) in statemap:
59             continue
60         next_token = random.choice(
61           statemap[tuple(key)])
62         words_new.append(next_token)
63         key.append(next_token)
64         key.pop(0)
66     return tokens_to_text(words_new)
68 print(generate_from('all.txt'))

To prepare the tokenizer of the nltk package, it needs to be installed with pip3 install nltk. Moreover, the tokenizer dataset punkt still needs to be downloaded separately in a Python shell (Figure 3). The tokenizer interprets punctuation marks, such as commas or colons (the regular expression in line 7 defines them all), as individual tokens. This approach later increases the readability and entertainment value of the random text.

Figure 3: Downloading data for the NLTK tokenizer.

Figure 4 shows how the algorithm generates a state file of the Markov chain from the nursery rhyme and the window length n=3. Note how the statemap generated offers both "sat" and "had" as a follow-up state for "Humpty Dumpty":

('Humpty', 'Dumpty'): ['sat', 'had']
Figure 4: Dividing text into tokens and creating the Markov model.

The Python code shows some of the spice of the language, such as the yield() operator in the token_feed() function in line 16: So that Python can iterate over the components of an object using a for loop, it must be of type Iterable. These objects do not spill out their elements in one go, but they pass them through one piece at a time with each call to yield().

The advantage of this process is that the script does not need to store all of the data in a construct in memory; rather, it can get away with only calculating the data when needed. Therefore, an object can contain an infinite number of elements; as long as they are never needed all at once, the illusion stays alive despite limited memory.

Python also provides a variety of data structures. Double-ended queues, called deque, as used in the window() function in lines 26-35, can be modified really fast if the code limits itself to adding or removing elements to or from the head or tail of the queue. The only flaw is that the caller of the window() function cannot manipulate the deque object returned, or the algorithm gets confused.

For this reason, line 41 uses key=list(state) to copy the elements of the deque object into a Python list, which the statemap_gen() function can modify to its heart's content. It interprets the first (n – 1) elements as a Markov state, and the last item as a possible next state.

In line 43, key is a Python tuple that can no longer be modified. The dictionary statemap maps such tuples to one or more follow-up state values. In this way, the algorithm can later look up possible follow-up state values quickly and select one randomly with random.choice() in line 60, leading the text on unpredictable paths and imbuing the algorithm with human-like traits.

The result can be seen in Figure 5. However, it demonstrates that the applied Markov chain procedure does not automatically provide exciting new magazine articles. The grammar is still lacking enormously, and the content doesn't really make any sense. This column will probably have to be created manually for the foreseeable future.

Figure 5: The AI system writes the programming column of all programming columns.

The Author

Mike Schilli works as a software engineer in the San Francisco Bay area, California. Each month in his column, which has been running since 1997, he researches practical applications of various programming languages. If you go to, he will gladly answer any questions.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Security Lessons

    Sometimes, even ING, YouTube, The New York Times, and Google get it wrong.

  • Perl: Spotify

    For a monthly fee, the Spotify streaming service beams music onto your desktop or phone. To intensify the groove, Perlmeister Mike Schilli archived his Spotify playlists for eternity using an OAuth-protected web API.

  • Index Search with Lucene

    Even state-of-the-art computers need to use clever methods to process ever-increasing amounts of document data. The open source Lucene framework uses inverted indexing for fast searches of document collections.

  • Free Software Projects

    The free high-end game, Yo Frankie, in which players steer a flying squirrel through a colorful 3D world, is almost finished. KI Research still faces major issues, but FreeHAL, a dialog program, gives users a behind-the-scenes look at the current state of affairs.

  • One-Time Passwords on the Web

    Add security to your website with a one-time password system.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95