Refactoring Legacy Code: Part 1 - The Golden Master

Old code. Ugly code. Complicated code. Spaghetti code. Jibberish nonsense. In two words, Legacy Code. This is a series that will help you work and deal with it.

In an ideal world, you would write only new code. You would write it beautiful and perfect. You would never have to revisit your code and you will never have to maintain projects ten years old. In an ideal world...

Unfortunately, we live in a reality that is not ideal. We have to understand, modify and enhance ages-old code. We have to work with legacy code. So what are you waiting for? Let's get our heads into this first tutorial, get the code, understand it a little bit and create a safety net for our future modifications.

Definition of Legacy Code

Legacy code was defined in so many ways it is impossible to find a single, commonly accepted definition for it. The few examples at the start of this tutorial are just the tip of the iceberg. So I won't give you any official definition. Instead, I will quote you my favorite one.

To me, legacy code is simply code without tests. ~ Michael Feathers

Well, that is the first formal definition of the expression legacy code, published by Michael Feathers in his book Working Effectively with Legacy Code. Of course, the industry used the expression for ages, basically for any code that is difficult to change. However this definition has something different to tell. It explains the problem very clearly, so that the solution becomes obvious. "Difficult to change" is so vague. What should we do to make it easy to change? We have no idea! "Code without tests" on the other hand is very concrete. And the answer to our previous question is simple, make code testable and test it. So let's get started.

Getting Our Legacy Code

This series will be based on the exceptional Trivia Game by J.B. Rainsberger designed for Legacy Code Retreat events. It is made to be like real legacy code and to also offer opportunities for a wide variety of refactoring, at a decent level of difficulty.

Check Out the Source Code

The Trivia Game is hosted on GitHub and it is GPLv3 licensed, so you can play around with it freely. We will start this series by checking out the official repository. The code is also attached to this tutorial with all the modifications we will make, so if you get confused at some point, you can take a sneak peek at the end result.

 $ git clone https://github.com/jbrains/trivia.git
Cloning into 'trivia'...
remote: Counting objects: 429, done.
remote: Compressing objects: 100% (262/262), done.
remote: Total 429 (delta 100), reused 419 (delta 93)
Receiving objects: 100% (429/429), 848.33 KiB | 305.00 KiB/s, done.
Resolving deltas: 100% (100/100), done.
Checking connectivity... done.

When you open the trivia directory you will see our code in several programming languages. We will work in PHP, but you are free to choose your favorite one and apply the techniques presented here.

Understanding the Code

By definition, legacy code is difficult to understand, especially if we don't even know what it's supposed to do. So the first step is to run the code and make some kind of reasoning, what it is about.

We have two files in our directory.

1	$ cd php/
2	$ ls -al
3	total 20
4	drwxr-xr-x 2 csaba csaba 4096 Mar 10 21:05 .
5	drwxr-xr-x 26 csaba csaba 4096 Mar 10 21:05 ..
6	-rw-r--r-- 1 csaba csaba 5568 Mar 10 21:05 Game.php
7	-rw-r--r-- 1 csaba csaba 410 Mar 10 21:05 GameRunner.php

GameRunner.php seems to be a good candidate for our attempt to run the code.

$ php ./GameRunner.php
Chet was added
They are player number 1
Pat was added
They are player number 2
Sue was added
They are player number 3
Chet is the current player
They have rolled a 4
Chet's new location is 4
The category is Pop
Pop Question 0
Answer was corrent!!!!
Chet now has 1 Gold Coins.
Pat is the current player
They have rolled a 2
Pat's new location is 2
The category is Sports
Sports Question 0
Answer was corrent!!!!
Pat now has 1 Gold Coins.
Sue is the current player
They have rolled a 1
Sue's new location is 1
The category is Science
Science Question 0
Answer was corrent!!!!
Sue now has 1 Gold Coins.
Chet is the current player
They have rolled a 4

## Some lines removed to keep
## the tutorial at a reasonable size

Answer was corrent!!!!
Sue now has 5 Gold Coins.
Chet is the current player
They have rolled a 3
Chet is getting out of the penalty box
Chet's new location is 11
The category is Rock
Rock Question 5
Answer was correct!!!!
Chet now has 5 Gold Coins.
Pat is the current player
They have rolled a 1
Pat's new location is 10
The category is Sports
Sports Question 1
Answer was corrent!!!!
Pat now has 6 Gold Coins.

OK. Our guess was correct. Our code ran and produced some output. Analyzing this output will help us deduce some basic idea about what the code does.

We know it's a Trivia game. We knew it when we checked out the source code.
Our example has three players: Chet, Pat and Sue.
There is some kind of rolling of a dice or similar concept.
There is a current location for a player. Possibly on some kind of board?
There are various categories from which questions are asked.
Users answer questions.
Correct answers give players gold.
Wrong answers send players to the penalty box.
Players can get out of penalty box, based on some not quite clear logic.
It seems like the user that first reaches six gold coins wins.

Now that is a lot of knowledge. We could figure out most of the basic behavior of the application by just looking at the output. In real life applications, the output may not be text on the screen, but it can be a web page, an error log, a database, a network communication, a dump file and so on. In other cases, the module you need to modify can not be run isolated. If so, you will need to run it through other modules of the bigger application. Just try to add the minimum, to get some reasonable output from your legacy code.

Scanning the Code

Now that we have an idea about what the code outputs, we can start looking at it. We will start with the runner.

The Game Runner

I like to start with running all the code through the formatter of my IDE. This greatly improves readability by making the code's form familiar with what I am used to. So this:

... will become this:

... which is somewhat better. It may not be a huge difference with this small amount of code, but it will be on our next file.

Looking at our GameRunner.php file, we can easily identify some key aspects we observed in the output. We can see the lines that add the users (9-11), that a roll() method is called and a winner is selected. Of course, these are far from the inner secrets of the logic of the game, but at least we could start by identifying key methods that will help us discover the rest of the code.

The Game File

We should do the same formatting on the Game.php file also.

This file is much larger; About 200 lines of code. Most of the methods are appropriately sized, but some of them are quite large and after the formatting, we can see that in two places the code indentation goes beyond four levels. High levels of indentation usually means lots of complex decisions, so for now, we can assume that those points in our code will be more complex and more sensible to change.

The Golden Master

And the thought of change leads us to our lack of tests. The methods we saw in Game.php are quite complex. Don't worry if you don't understand them. At this point, they are a mystery for me also. Legacy code is a mystery that we need to solve and understand. We made our first step to understand it and it is now time for our second one.

So What Is This Golden Master?

When working with legacy code, it is almost impossible to understand it and to write code that will surely exercise all the logical paths through the code. For that kind of testing, we would need to understand the code, but we do not yet. So we need to take another approach.

Instead of trying to figure out what to test, we can test everything, a lot of times, so that we end up with a huge amount of output, about which we can almost certainly assume that it was produced by exercising all parts of our legacy code. It is recommended to run the code at least 10,000 (ten thousand) times. We will write a test to run it twice as much and save the output.

Writing the Golden Master Generator

We can think ahead and start by creating a generator and a test as separate files for future testing, but is it really necessary? We don't know that yet for certain. So why not just start with a basic test file that will run our code once and build our logic up from there.

You will find in the attached code archive, inside the source folder but outside the trivia folder our Test folder. In this folder, we create a file: GoldenMasterTest.php.

class GoldenMasterTest extends PHPUnit_Framework_TestCase {

	function testGenerateOutput() {
		ob_start();
		require_once __DIR__ . '/../trivia/php/GameRunner.php';
		$output = ob_get_contents();
		ob_end_clean();

		var_dump($output);
	}

}

We could do this in many ways. We could, for example, run our code from the console and redirect its output to a file. However, having it in a test that is easily run inside our IDE is an advantage we should not ignore.

The code is quite simple, it buffers the output and puts it into the $output variable. The require_once() will also run all the code inside the the included file. In our var dump we will see some already familiar output.

However on a second run, we can observe something odd:

... the outputs differ. Even though we ran the same code, the output is different. The rolled numbers are different, the players' positions are different.

Seeding the Random Generator

do {

	$aGame->roll(rand(0, 5) + 1);

	if (rand(0, 9) == 7) {
		$notAWinner = $aGame->wrongAnswer();
	} else {
		$notAWinner = $aGame->wasCorrectlyAnswered();
	}

} while ($notAWinner);

By analyzing the essential code from the runner, we can see that it uses a the function rand() to generate random numbers. Our next stop is the official PHP documentation to research this rand() function.

The random number generator is seeded automatically.

The documentation tells us that seeding happens automatically. Now we have another task. We need to find a way to control the seed. The srand() function can help with that. Here is its definition from the documentation.

Seeds the random number generator with seed or with a random value if no seed is given.

It tells us, that if we run this before any call to rand(), we should always end up with the same results.

function testGenerateOutput() {
	ob_start();
	srand(1);
	require_once __DIR__ . '/../trivia/php/GameRunner.php';
	$output = ob_get_contents();
	ob_end_clean();

	var_dump($output);
}

We put srand(1) before our require_once(). Now the output is always the same.

Put the Output in a File

class GoldenMasterTest extends PHPUnit_Framework_TestCase {

	function testGenerateOutput() {
		file_put_contents('/tmp/gm.txt', $this->generateOutput());
		$file_content = file_get_contents('/tmp/gm.txt');
		$this->assertEquals($file_content, $this->generateOutput());
	}

	private function generateOutput() {
		ob_start();
		srand(1);
		require_once __DIR__ . '/../trivia/php/GameRunner.php';
		$output = ob_get_contents();
		ob_end_clean();
		return $output;
	}

}

This change looks reasonable. Right? We extracted the code generation into a method, run it twice, and expected the output to be equal. However they won't be.

The reason is that require_once() will not require the same file twice. The second call to the generateOutput() method will produce an empty string. So, what could we do? What if we simply require()? That should be run each time.

Well, that leads to another problem: "Cannot redeclare echoln()". But where is that coming from? It is right at the beginning of the Game.php file. The reason why this error is occurring is because in GameRunner.php we have include __DIR__ . '/Game.php';, which tries to include the Game file twice, each time when we call the generateOutput() method.

1	include_once __DIR__ . '/Game.php';

Using include_once in GameRunner.php will solve our problem. Yes, we needed to modify GameRunner.php without having tests for it, yet! However, we can be 99% sure that our change will not break the code itself. It is a small and simple enough change to not scare us very much. And most importantly, it makes the tests pass.

Run It Several Times

Now that we have code we can run many times, it's time to generate some output.

function testGenerateOutput() {
	$this->generateMany(20, '/tmp/gm.txt');
	$this->generateMany(20, '/tmp/gm2.txt');
	$file_content_gm = file_get_contents('/tmp/gm.txt');
	$file_content_gm2 = file_get_contents('/tmp/gm2.txt');
	$this->assertEquals($file_content_gm, $file_content_gm2);
}

private function generateMany($times, $fileName) {
	$first = true;
	while ($times) {
		if ($first) {
			file_put_contents($fileName, $this->generateOutput());
			$first = false;
		} else {
			file_put_contents($fileName, $this->generateOutput(), FILE_APPEND);
		}
		$times--;
	}
}

We extracted another method here: generateMany(). It has two parameters. One for the number of times we want to run our generator, the other is a destination file. It will put the generated output in the files. On the first run it empties the files, and for the rest of the iterations, it appends the data. You can look into the file to see the generated output 20 times.

But wait! The same player wins every time? Is that possible?

cat /tmp/gm.txt | grep "has 6 Gold Coins."
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.

Yes! It is possible! It is more than possible. It is a sure thing. We have the same seed for our random function. We play the same game over and over again.

Run It Differently Each Time

We need to play different games, otherwise it is almost certain that only a small part of our legacy code is actually exercised over and over again. The scope of the golden master is to exercise as much as possible. We need to re-seed the random generator each time, but in a controlled way. One option is to use our counter as the seed value.

private function generateMany($times, $fileName) {
	$first = true;
	while ($times) {
		if ($first) {
			file_put_contents($fileName, $this->generateOutput($times));
			$first = false;
		} else {
			file_put_contents($fileName, $this->generateOutput($times), FILE_APPEND);
		}
		$times--;
	}
}

private function generateOutput($seed) {
	ob_start();
	srand($seed);
	require __DIR__ . '/../trivia/php/GameRunner.php';
	$output = ob_get_contents();
	ob_end_clean();
	return $output;
}

This still keeps our test passing, so we are sure we generate the same complete output each time, while the output plays a different game for each iteration.

cat /tmp/gm.txt | grep "has 6 Gold Coins."
Sue now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Pat now has 6 Gold Coins.
Pat now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Sue now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Sue now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Sue now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Pat now has 6 Gold Coins.
Chet now has 6 Gold Coins.
Chet now has 6 Gold Coins.

There are various winners for the game in a random fashion. This looks good.

Getting to 20,000

The first thing you may try is to run our code for 20,000 game iterations.

function testGenerateOutput() {
	$times = 20000;
	$this->generateMany($times, '/tmp/gm.txt');
	$this->generateMany($times, '/tmp/gm2.txt');
	$file_content_gm = file_get_contents('/tmp/gm.txt');
	$file_content_gm2 = file_get_contents('/tmp/gm2.txt');
	$this->assertEquals($file_content_gm, $file_content_gm2);
}

This will almost work. Two 55MB files will be generated.

1	ls -alh /tmp/gm*
2	-rw-r--r-- 1 csaba csaba 55M Mar 14 20:38 /tmp/gm2.txt
3	-rw-r--r-- 1 csaba csaba 55M Mar 14 20:38 /tmp/gm.txt

On the other hand, the test will fail with an insufficient memory error. It doesn't matter how much RAM you have, this will fail. I have 8GB plus a 4GB swap and it fails. The two strings are just too big to be compared in our assertion.

In other words, we generate good files, but PHPUnit cannot compare them. We need a work-around.

1	$this->assertFileEquals('/tmp/gm.txt', '/tmp/gm2.txt');

That seems to be a good candidate, but it still fails. What a shame. We need to research the situation further.

1	$this->assertTrue($file_content_gm == $file_content_gm2);

This however, is working.

It can compare the two strings and fail if they are different. It has however, a small price. It won't be able to exactly tell what is wrong when the strings differ. It will just simply say "Failed asserting that false is true.". But we will deal with that in an upcoming tutorial.

Final Thoughts

We are done for this tutorial. We have learned quite a lot for our first lesson and we are on a good start for our future work. We met the code, we analyzed it in different ways and we mostly understood its essential logic. Then we created a set of tests to ensure it is exercised as much as possible. Yes. The tests are very slow. It takes them 24 seconds on my Core i7 CPU to generate the output twice. Fortunately in our future development, we will keep the gm.txt file untouched and generate another one only once per run. But 12 seconds is still a huge amount of time for such a small code base.

By the time we finish this series, our tests should run in less than a second and test all the code properly. So, stay tuned for our next tutorial when we will tackle problems like magic constants, magic strings and complex conditionals. Thanks for reading.