# Using Markov Chains to Generate Test Input

One challenge that we've faced at CloudBees is how to verify that our makefile parser correctly emulates GNU Make. We started by generating test cases based on a close reading of the gmake manual. Then we turned to real-world examples: makefiles from dozens of open source projects and from our customers. After several years of this we've accumulated nearly two thousand individual tests of our gmake emulation, and yet we still sometimes find incompatibilities. We're always looking for new ways to test our parser. One idea is to generate random text and use that as a "makefile". Unfortunately, truly random text is almost useless in this regard, because it doesn't look anything like a real makefile. Instead, we can use *Markov chains* to generate random text that is very much like a real makefile. When we first introduced this technique, we uncovered 13 previously unknown incompatibilities — at the time that represented 10 of the total defects reported against the parser! Read on to learn more about Markov chains and how we applied them in practice.

### Markov chains

A Markov chain is simply a sequence of random values in which the next value is in some way dependent on the current value, rather than being completely random. Consider the case of generating random text one letter at a time, from the set of uppercase English letters (A-Z). If the sequence is completely random, then for each character generated, any letter is equally probable. Regardless of what characters you've generated up to this point, you are just as likely to get a *D* as an *X* next. Think of it as if you have a bag of tiles, one for each letter. With truly random text, you pick one tile and write down the letter on that tile. Then you return the tile to the bag and pick again. Lather, rinse, repeat until you've generated as much text as you like. But suppose we said that the probability of the next character is dependent on the last character we generated. For example, in English text if you run across the letter *Q* you can be pretty sure that the next character is going to be *U* . It's almost certainly not going to be *X* , or *Z* , etc. We can build a table that tells us the probabilty that any given letter will be followed by any other letter. For the letter *Q* , the table might look like this:

Letter | Probability of appearing after Q |

A | 1 |

E | 1 |

I | 1 |

O | 1 |

U | 96 |

All others | 0 |

We can do a much better job of generating text that looks like English if we use these tables to guide us. Imagine that instead of one bag of tiles with one tile for each letter, we have one bag for each letter, and we fill each bag with tiles according to the probabilities in our table. For example, the bag for the letter *Q* would contain 96 *U* tiles, and one tile each for *A* , *E* , *I* , and *O* . Each time we want to generate a new letter, we look at the last letter we generated, find the bag of tiles corresponding to that letter, and pick out one tile. After writing down the letter, we return the tile to the bag and repeat the process. The sequence of letters that we generate in this manner is a simple Markov chain. How do you build the probability tables? One way is to generate them from some sample input. If we have a sufficiently large example of text in the target language, we can scan it and count the number of times each letter occurs, and the number of times it is followed by each letter. Note that it is critical to have a large and varied input. If the sample is too small, the resulting probability tables won't accurately represent the target language. The generator will only be able to generate the input text itself.

### Markov chains of order *m*

Of course, we don't have to limit ourselves to considering just the single previous letter. For example, if the previous two characters are *ST* , then the next character is probably a vowel, or maybe an *R* . It's probably not going to be *D* , or *Q* , etc. Just as before, we can make a table that tells us the probability that any given pair of letters will be followed by any other letter. And we can keep going, adding more and more of the preceding letters to our formula. The number of previous characters we use is called the *order* of the chain, so if we use the previous 4 characters, we would say we have a *Markov chain of order 4* . The greater the order of the chain, the "smarter" our generator becomes, because it is considering more context when choosing letters. As you can see, the generated text looks more and more like the target language as you increase the order (although after a some point, you get little additional benefit from further increases):

Order | Result |

0 (purely random) | ehnee.Alr noer ealcra edctn eIi |

1 | Pige foule.ce d futht wrion e mara |

2 | Prookiname arg-tm aread on achivedging |

3 | Yes, and no usinession be |

4 | Project that last it make you first, moderneath. |

### Using Markov chains in testing

In order to use this technique effectively in testing, you need a couple of things besides the generator itself:

**A large sample input to seed the generator**. As noted, the bigger your sample input text, the higher the quality of the probability tables, and therefore the more varied your generated text will be. Since we are trying to generate makefiles, we used several megabytes of makefiles from a variety of open-source projects as the sample text.**An automated evaluation mechanism**. You have to be able to determine quickly and automatically if a given generated file is processed correctly or not. Of course,*correct*can mean many different things here. It might be as simple as "does not cause a program crash". In our case, it means "emake parses this makefile the same way that gmake does", so we use gmake as a*reference implementation*. Note that it doesn't matter if the generated text truly is a completely valid makefile. In fact most of the time it will not be. What matters is those cases is that emake and gmake both report the*same*error.

We wrote a simple shell script to drive the testing process. First, it uses the generator to produce several random makefiles. Then it runs each makefile through both gmake and emake, and compares the results. Any differences are reported for further investigation.

### Conclusion

Verifying the implementation of an emulator for a complex system is hard, especially when the original system has no formal specification. Using randomly generated input is a useful way to extend the breadth of your testing, and Markov chains make it possible to generate even more useful random input. Our original implementation of this technique uncovered several previously unknown defects, and it continues to pay dividends both by uncovering new defects and by providing a confidence measure for our emulation. If you want to play with Markov chains yourself, you can download the source for the generator used in this article . NB: the program has only been compiled and used on Linux; on other platforms your mileage may vary. For more information about Markov chains, I recommend Section 15.3 of the excellent book *Programming Pearls* by Jon Bentley .

If you enjoyed this article about testing techniques, you may enjoy this related article:

### Build/Test Acceleration and Continuous Delivery

Continuous Delivery isn’t continuous if builds and tests take too long to complete. Learn more on how CloudBees Accelerator speeds up builds and tests by up to 20X, improving software time to market, infrastructure utilization and developer productivity.