Testing Data-Intensive Code With Go, Part 5

Overview

This is part five out of five in a tutorial series on testing data-intensive code. In part four, I covered remote data stores, using shared test databases, using production data snapshots, and generating your own test data. In this tutorial, I'll go over fuzz testing, testing your cache, testing data integrity, testing idempotency, and missing data.

Fuzz Testing

The idea of fuzz testing is to overwhelm the system with lots of random input. Instead of trying to think of input that will cover all cases, which can be difficult and/or very labor intensive, you let chance do it for you. It is conceptually similar to random data generation, but the intention here is to generate random or semi-random inputs rather than persistent data.

When Is Fuzz Testing Useful?

Fuzz testing is useful in particular for finding security and performance problems when unexpected inputs cause crashes or memory leaks. But it can also help ensure that all invalid inputs are detected early and are rejected properly by the system.

Consider, for example, input that comes in the form of deeply nested JSON documents (very common in web APIs). Trying to generate manually a comprehensive list of test cases is both error-prone and a lot of work. But fuzz testing is the perfect technique.

Using Fuzz Testing

There are several libraries you can use for fuzz testing. My favorite is gofuzz from Google. Here is a simple example that automatically generates 200 unique objects of a struct with several fields, including a nested struct.

import (
    "fmt"
	"github.com/google/gofuzz"
)

func SimpleFuzzing() {
	type SomeType struct {
		A string
		B string
		C int
		D struct {
			E float64
		}
	}

	f := fuzz.New()
	object := SomeType{}

	uniqueObjects := map[SomeType]int{}

	for i := 0; i < 200; i++ {
		f.Fuzz(&object)
		uniqueObjects[object]++
	}
	fmt.Printf("Got %v unique objects.\n", len(uniqueObjects))
	// Output:
	// Got 200 unique objects.
}

Testing Your Cache

Pretty much every complex system that deals with a lot of data has a cache, or more likely several levels of hierarchical caches. As the saying goes, there are only two difficult things in computer science: naming things, cache invalidation, and off by one errors.

Jokes aside, managing your caching strategy and implementation can complicate your data access but have a tremendous impact on your data access cost and performance. Testing your cache can't be done from the outside because your interface hides where the data comes from, and the cache mechanism is an implementation detail.

Let's see how to test the cache behavior of the Songify hybrid data layer.

Cache Hits and Misses

Caches live and die by their hit/miss performance. The basic functionality of a cache is that if requested data is available in the cache (a hit) then it will be fetched from the cache and not from the primary data store. In the original design of the HybridDataLayer, the cache access was done through private methods.

Go visibility rules make it impossible to call them directly or replace them from another package. To enable cache testing, I'll change those methods to public functions. This is fine because the actual application code operates through the DataLayer interface, which doesn't expose those methods.

The test code, however, will be able to replace these public functions as needed. First, let's add a method to get access to the Redis client, so we can manipulate the cache:

1	func (m HybridDataLayer) GetRedis() redis.Client {
2	return m.redis
3	}

Next I'll change the getSongByUser_DB() methods to a public function variable. Now, in the test, I can replace the GetSongsByUser_DB() variable with a function that keeps track of how many times it was called and then forwards it to the original function. That allows us to verify if a call to GetSongsByUser() fetched the songs from the cache or from the DB.

Let's break it down piece by piece. First, we get the data layer (that also clears the DB and redis), create a user, and add a song. The AddSong() method also populates redis.

func TestGetSongsByUser_Cache(t *testing.T) {
    now := time.Now()
	u := User{Name: "Gigi", 
             Email: "gg@gg.com", 
             RegisteredAt: now, LastLogin: now}
	dl, err := getDataLayer()
	if err != nil {
		t.Error("Failed to create hybrid data layer")
	}

	err = dl.CreateUser(u)
	if err != nil {
		t.Error("Failed to create user")
	}

	lm, err := NewSongManager(u, dl)
	if err != nil {
		t.Error("NewSongManager() returned 'nil'")
	}

	err = lm.AddSong(testSong, nil)
	if err != nil {
		t.Error("AddSong() failed")
	}

This is the cool part. I keep the original function and define a new instrumented function that increments the local callCount variable (it's all in a closure) and calls the original function. Then, I assign the instrumented function to the variable GetSongsByUser_DB. From now on, every call by the hybrid data layer to GetSongsByUser_DB() will go to the instrumented function.

    callCount := 0
	originalFunc := GetSongsByUser_DB
	instrumentedFunc := func(m *HybridDataLayer, 
	                         email string, 
                             songs *[]Song) (err error) {
		callCount += 1
		return originalFunc(m, email, songs)
	}

	GetSongsByUser_DB = instrumentedFunc

At this point, we're ready to actually test the cache operation. First, the test calls the GetSongsByUser() of the SongManager that forwards it to the hybrid data layer. The cache is supposed to be populated for this user we just added. So the expected result is that our instrumented function will not be called, and the callCount will remain at zero.

    _, err = lm.GetSongsByUser(u)
	if err != nil {
		t.Error("GetSongsByUser() failed")
	}

	// Verify the DB wasn't accessed because cache should be
    // populated by AddSong()
	if callCount > 0 {
		t.Error(`GetSongsByUser_DB() called when it
                 shouldn't have`)
	}

The last test case is to ensure that if the user's data is not in the cache, it will be fetched properly from the DB. The test accomplishes it by flushing Redis (clearing all its data) and making another call to GetSongsByUser(). This time, the instrumented function will be called, and the test verifies that the callCount is equal to 1. Finally, the original GetSongsByUser_DB() function is restored.

    // Clear the cache
	dl.GetRedis().FlushDB()

	// Get the songs again, now it's should go to the DB
    // because the cache is empty
	_, err = lm.GetSongsByUser(u)
	if err != nil {
		t.Error("GetSongsByUser() failed")
	}

	// Verify the DB was accessed because the cache is empty
	if callCount != 1 {
		t.Error(`GetSongsByUser_DB() wasn't called once 
                 as it should have`)
	}

	GetSongsByUser_DB = originalFunc
}

Cache Invalidation

Our cache is very basic and doesn't do any invalidation. This works pretty well as long as all songs are added through the AddSong() method that takes care of updating Redis. If we add more operations like removing songs or deleting users then these operations should take care of updating Redis accordingly.

This very simple cache will work even if we have a distributed system where multiple independent machines can run our Songify service—as long as all the instances work with the same DB and Redis instances.

However, if the DB and cache can get out of sync due to maintenance operations or other tools and applications changing our data then we need to come up with an invalidation and refresh policy for the cache. It can be tested using the same techniques—replace target functions or directly access the DB and Redis in your test to verify the state.

LRU Caches

Usually, you can't just let the cache grow infinitely. A common scheme to keep the most useful data in the cache is LRU caches (least recently used). The oldest data gets bumped from the cache when it reaches capacity.

Testing it involves setting the capacity to a relatively small number during the test, exceeding the capacity, and ensuring that the oldest data is not in the cache anymore and accessing it requires DB access.

Testing Your Data Integrity

Your system is only as good as your data integrity. If you have corrupted data or missing data then you're in bad shape. In real-world systems, it's difficult to maintain perfect data integrity. Schema and formats change, data is ingested through channels that might not check for all the constraints, bugs let in bad data, admins attempt manual fixes, backups and restores might be unreliable.

Given this harsh reality, you should test your system's data integrity. Testing data integrity is different than regular automated tests after each code change. The reason is that data can go bad even if the code didn't change. You definitely want to run data integrity checks after code changes that might alter data storage or representation, but also run them periodically.

Testing Constraints

Constraints are the foundation of your data modeling. If you use a relational DB then you can define some constraints at the SQL level and let the DB enforce them. Nullness, length of text fields, uniqueness and 1-N relationships can be defined easily. But SQL can't check all the constraints.

For example, in Desongcious, there is a N-N relationship between users and songs. Each song must be associated with at least one user. There is no good way to enforce this in SQL (well, you can have a foreign key from song to user and have the song point to one of the users associated with it). Another constraint may be that each user may have at most 500 songs. Again, there is no way to represent it in SQL. If you use NoSQL data stores then usually there is even less support for declaring and validating constraints at the data store level.

That leaves you with a couple of options:

Ensure that access to data goes only through vetted interfaces and tools that enforce all the constraints.
Periodically scan your data, hunt constraint violations, and fix them.

Testing Idempotency

Idempotency means that performing the same operation multiple times in a row will have the same effect as performing it once.

For example, setting the variable x to 5 is idempotent. You can set x to 5 one time or a million times. It will still be 5. However, incrementing X by 1 is not idempotent. Every consecutive increment changes its value. Idempotency is a very desirable property in distributed systems with temporary network partitions and recovery protocols that retry sending a message multiple times if there is no immediate response.

If you design idempotency into your data access code, you should test it. This is typically very easy. For each idempotent operation you extend to perform the operation twice or more in a row and verify there are no errors and the state remains the same.

Note that idempotent design may sometimes hide errors. Consider deleting a record from a DB. It is an idempotent operation. After you delete a record, the record doesn't exist in the system anymore, and trying to delete it again will not bring it back. That means that trying to delete a non-existent record is a valid operation. But it might mask the fact that the wrong record key was passed by the caller. If you return an error message then it's not idempotent.

Testing Data Migrations

Data migrations can be very risky operations. Sometimes you run a script over all your data or critical parts of your data and perform some serious surgery. You should be ready with plan B in case something goes wrong (e.g. go back to the original data and figure out what went wrong).

In many cases, data migration can be a slow and costly operation that may require two systems side by side for the duration of the migration. I participated in several data migrations that took several days or even weeks. When facing a massive data migration, it's worth it to invest the time and test the migration itself on a small (but representative) subset of your data and then verify that the newly migrated data is valid and the system can work with it.

Testing Missing Data

Missing data is an interesting problem. Sometimes missing data will violate your data integrity (e.g. a song whose user is missing), and sometimes it's just missing (e.g. someone removes a user and all their songs).

If the missing data causes a data integrity problem then you'll detect it in your data integrity tests. However, if some data is just missing then there is no easy way to detect it. If the data never made it into persistent storage then maybe there is a trace in the logs or other temporary stores.

Depending how much of a risk missing data is, you may write some tests that deliberately remove some data from your system and verify the system behaves as expected.

Conclusion

Testing data-intensive code requires deliberate planning and an understanding of your quality requirements. You can test at several levels of abstraction, and your choices will affect how thorough and comprehensive your tests are, how many aspects of your actual data layer you test, how fast your tests run, and how easy it is to modify your tests when the data layer changes.

There is no single correct answer. You need to find your sweet spot along the spectrum from super comprehensive, slow and labor-intensive tests to fast, lightweight tests.