How to generate realistic test data

Use these simple SQL scripts to produce fictitious (but realistic) data for testing your applications.

By Lynn Evans

One of the most boring tasks that programmers face is creating fictitious data for testing their programs. In most cases, you can't use live data for that purpose, especially if it involves personal information such as names and addresses. So you end up having to invent tens or hundreds of plausible test records - a tedious chore that most of us hate.

In this article, I'll show you some short SQL scripts that will take much of the pain out of creating your test data. To keep things simple, we'll just create a table of abbreviated names and addresses, that is, a table with columns for first name, last name and city only. You can easily extend the technique to generate more complex tables, containing, for example, full postal addresses, phone numbers, email addresses and other similar items.

The script is written in T-SQL, and is intended to be used with Microsoft SQL Server (2005 and above). But it can easily be adapted for other databases, including MySQL, Oracle, Microsoft Access and Microsoft Visual FoxPro.

The steps

In summary, the process involves three steps:

  1. Create several small tables: one for each of the relevant columns (first name, last name and city, in this example); populate these tables with a few rows of realistic data.
  2. Cross-join the tables to produce a much larger table containing every combination of the three columns.
  3. Select a random sample from the larger table.

Step #1: The initial tables

First, we'll create three small tables, for first name, last name and city respectively, and place a dozen rows in each:

CREATE TABLE #FirstName (FirstName char (16)) 
INSERT INTO #FirstName VALUES ('Alan') 
INSERT INTO #FirstName VALUES ('Betty')
INSERT INTO #FirstName VALUES ('Charles')
INSERT INTO #FirstName VALUES ('David')
INSERT INTO #FirstName VALUES ('Elaine')
INSERT INTO #FirstName VALUES ('Fred')
INSERT INTO #FirstName VALUES ('Jack')
INSERT INTO #FirstName VALUES ('Mike')
INSERT INTO #FirstName VALUES ('Nicola')
INSERT INTO #FirstName VALUES ('Peter')
INSERT INTO #FirstName VALUES ('Susan')
INSERT INTO #FirstName VALUES ('Tom')

CREATE TABLE #LastName (LastName char (16))
INSERT INTO #LastName VALUES ('Baker')
INSERT INTO #LastName VALUES ('Carson')
INSERT INTO #LastName VALUES ('Davies')
INSERT INTO #LastName VALUES ('Harrison')
INSERT INTO #LastName VALUES ('Jackson')
INSERT INTO #LastName VALUES ('Macdonald')
INSERT INTO #LastName VALUES ('Nicholson')
INSERT INTO #LastName VALUES ('Richardson')
INSERT INTO #LastName VALUES ('Stewart')
INSERT INTO #LastName VALUES ('Tomlin')
INSERT INTO #LastName VALUES ('Williamson')
INSERT INTO #LastName VALUES ('Young')

CREATE TABLE #City (City char (16))
INSERT INTO #City VALUES ('London')
INSERT INTO #City VALUES ('Paris')
INSERT INTO #City VALUES ('Amsterdam')
INSERT INTO #City VALUES ('Barcelona')
INSERT INTO #City VALUES ('Berlin')
INSERT INTO #City VALUES ('New York')
INSERT INTO #City VALUES ('San Francisco')
INSERT INTO #City VALUES ('Sidney')
INSERT INTO #City VALUES ('Tokyo')
INSERT INTO #City VALUES ('Madrid')
INSERT INTO #City VALUES ('Rome')
INSERT INTO #City VALUES ('Athens')

As you can see, the tables' names (#FirstName, #LastName and #City) are prefixed with a pound sign (#). In SQL Server, this indicates that the tables are temporary, that is, they reside in the TempDB database. You don't have to worry about cleaning up these temporary tables when you've finished with them; SQL Server will do that automatically when you close the connection.

Step #2: The cross join

Now that we have our initial tables, we can combine them into a single, much larger table:

SELECT t1.FirstName, t2.LastName, t3.City
INTO #All
FROM #FirstName t1
CROSS JOIN #LastName t2
CROSS JOIN #City t3

This will create another temporarily table, named #All. This table will contain the results of a cross join between all three of the initials tables. In a cross join (also known as a Cartesian join), each row in one table matches every row in the others. #All will therefore contain one row for each combination of first name, last name and city. That's 1,728 rows in all (12 to the power of 3).

Step #3: The random selection

In most development projects, a table containing nearly 2,000 rows would be too big for testing purposes. So we'll cut it down to, say, 80 or so rows by taking a random selection from it.

When I was developing this script, my first thought was to randomly select the required rows like this:

SELECT * 
INTO TestData
FROM #All
WHERE Rand() BETWEEN 0.0 and 0.05

In T-SQL, the Rand() function returns a random number between 0 and 1. By returning rows where that number is between 0.0 and 0.05, I assumed the result set would contain roughly one twentieth of the rows from the original table, which is what I wanted.

But that doesn't work. Because the value returned by Rand() doesn't depend on any of the values in the input table, SQL Server evaluates it once only, before running the rest of the query. It then applies the result of that evaluation to every row. So Rand() will return a single value, and the result set will contain either all the rows from #All or (much more likely) none of them.

So I adopted another solution:

SELECT TOP 80 * 
INTO TestData
FROM #All
ORDER BY NewID()

The NewID() function in T-SQL returns a globally-unique identifier (GUID), that is, a unique 32-character hexadecimal string. In SQL Server, GUIDs are used to populate columns that have a Uniqueidentifier data type. Strictly speaking, a series of GUIDs is not completely random, but it's close enough for our purposes.

So, what we are doing here is sorting our large table to a near-random sequence, then taking the first 80 rows in that sequence. The resulting rows are saved to a new table, named TestData.

Finally, we can combine steps #2 and #3. This eliminates the need for the #All temporary table.

SELECT TOP 80 t1.FirstName, t2.LastName, t3.City
INTO TestData
FROM #FirstName t1
CROSS JOIN #LastName t2
CROSS JOIN #City t3
ORDER BY NewID()

Alternatives to CROSS JOIN

I said earlier that the technique described in this article can be adapted for databases other than SQL Server. If your database (or SQL dialect) doesn't support the CROSS JOIN syntax, there are a couple of possible alternatives. The first is to perform an inner join, based on a condition that will always be true:

SELECT TOP 80 t1.FirstName, t2.LastName, t3.City
INTO TestData
FROM #FirstName t1
JOIN #LastName t2 ON 1 = 1
JOIN #City t3 ON 1 = 1
ORDER BY NewID()

Alternatively, you can use the "old style" JOIN syntax. Here, you omit the JOIN clause. Instead, you use the FROM clause to specify all the tables being joined. By also omitting the WHERE clause, you force the join to be Cartesian:

SELECT TOP 80 t1.FirstName, t2.LastName, t3.City
INTO TestData
FROM #FirstName t1, #LastName t2, #City t3
ORDER BY NewID()

Alternatives to NewID()

Not all dialects of SQL support the NewID() function, but many have an equivalent. In MySQL, for example, GUIDs are referred to as universal unique identifiers, and the function that generates them is UUID(). Check your database documentation to find the appropriate function in your case.

In some databases, my original idea of using the Rand() function might work. You'll have to test that for yourself.

In Microsoft Visual FoxPro (VFP), using Rand() in a WHERE clause does work - but not always reliably. The results are more consistent if you use the function in a HAVING clause. What's more, VFP allows you to use a HAVING clause even if there is no GROUP BY clause. So the following syntax is legal and will work. (Since FoxPro doesn't support the use of the pound sign to indicate a temporary table, I've used VFP cursors instead. The semi-colons here are used as line-continuation characters):

SELECT TOP 80 t1.FirstName, t2.LastName, t3.City ;
INTO CURSOR TestData ;
FROM FirstName t1, LastName t2, City t3 ;
HAVING RAND() BETWEEN 0.0 AND 0.05

Summing up

As I said at the outset, the example I've shown here is a particularly simple one. In practice, your test data will contain more than three columns, and you'll probably want more variation in the results. But the technique is easy to extend. With only a little effort, you can use it to solve a problem that has plagued programmers since the dawn of the profession.

December 2011

Please note: The information given on this site has been carefully checked and is believed to be correct, but no legal liability can be accepted for its use. Do not use code, components or techniques unless you are satisfied that they will work correctly with your sites or applications.

If you found this article useful, please tell your friends: