Parsing and validating UK postcodes in SQL Server

We explain the syntax and formatting rules for UK postal codes - and show some T-SQL queries for extracting the codes from your address data and checking they're in the correct format.

By Mike Lewis

If your database contains UK address data, it's especially important that the postcodes are correctly stored. British postcodes do much more than speed the mail. They play a vital role in many applications, especially in fields such as mapping, transportation, marketing and statistics. In this article, I'll explain the format and syntax rules for UK postcodes, and show you some SQL Server code for parsing and validating them.

When it comes to validation, you can't simply check a postcode against a lookup table. Such a table does exist (it's known as the Postcode Address File, or PAF). But it contains over 27 million records, and is subject to strict licensing and copyright controls. You can access it by purchasing specialized software, but that's a costly option.

Figure 1

Figure 1: Anatomy of a postcode.

However, it's relatively easy to check the syntax of a postcode. That won't tell you if the code actually exists, but it will flag up any codes that are not in the correct format.

Formatting rules for UK postcodes

A UK postcode is a string of either six, seven or eight characters (Figure 1). It consists of two parts, separated by a space.

The first part is called the outward code. It usually consists of one or two letters, followed by one or two digits. The letters represent the postcode area, of which there are 124. Most of these consist of two letters: BS for the Bristol area, DG for Dumfries and Galloway, and so on. But a few large cities are represented by a single letter (L for Liverpool, G for Glasgow, etc.).

Still in the outward code, the letters are followed by one or two digits. These identify the district within the area. So BS1 is the central part of Bristol, while G12 covers Glasgow's West End. In the case of London, some of the densely-populated central districts are further divided by means of an extra letter at the end of the outward code. For example, within the SW1 district, SW1A covers Whitehall and the Mall, while SW1E is Victoria Station and nearby streets.

A9 9AAL2 3SW
A99 9AAM16 0RA
AA99 9AAEH12 9DN

Table 1: All valid postcodes
follow one of these patterns.

Compared to the outward code, the inward code - the part after the space - is quite simple. It always consists of exactly one digit followed by exactly two letters. The digit indicates the sector; the letters show the unit. The unit within the sector represents a very small area, typically containing just 12 - 15 addresses.

All this might sound extraordinarily complicated. But it's really not that bad when you consider that there are only six possible patterns for a valid postcode (see Table 1).

Special codes

There are also a few special-purpose codes that don't follow these rules. They include GIR 0AA (the former National Girobank) and SAN TA1 (a special code for Santa Claus). Most British overseas territories also have non-standard codes: PCRN 1ZZ is the Pitcairn Islands, for instance. For most applications, however, these special codes can be ignored.

Validating postcodes in T-SQL

Given a postcode column within a SQL Server database, here's a T-SQL query that will check the validity of the codes, based on the syntax rules discussed above. The query returns all invalid postcodes from the relevant table.

-- Returns invalid postcodes
SELECT postcode FROM Addresses
patindex('[A-Z][0-9] [0-9][A-Z][A-Z]', postcode) = 0 and  
patindex('[A-Z][0-9][0-9] [0-9][A-Z][A-Z]', postcode) = 0 and
patindex('[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]', postcode) = 0 and
patindex('[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]', postcode) = 0 and
patindex('[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]', postcode) = 0 and
patindex('[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]', postcode) = 0

As you can see, the query uses T-SQL's PATINDEX() function. This function searches a string for a sub-string that matches a wildcard pattern. It returns the starting position of the pattern within the string, or zero if the pattern is not found. In this case, we call PATINDEX() six times per row, once for each of the possible postcode patterns.

Extracting postcodes from longer strings

When designing a UK address table, it's usually desirable to store the postcode in a separate column. But you might sometimes need to work with a table in which the postcodes are embedded in another column - typically, the last line of the address.

Here is a T-SQL query that will extract postcodes that are embedded in another column, and place them in a column on their own.

-- Extracts postcode embedded in an address column (Adr)
when patindex('%[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]%', Adr) > 0
then substring(Adr, patindex('%[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]%',Adr),8)
when patindex('%[A-Z][0-9][0-9] [0-9][A-Z][A-Z]%', Adr) > 0
then substring(Adr, patindex('%[A-Z][0-9][0-9] [0-9][A-Z][A-Z]%',Adr),7)
when patindex('%[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]%', Adr) > 0
then substring(Adr, patindex('%[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]%',Adr),7)
when patindex('%[A-Z][0-9] [0-9][A-Z][A-Z]%', Adr) > 0
then substring(Adr, patindex('%[A-Z][0-9] [0-9][A-Z][A-Z]%',Adr),6)
when patindex('%[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]%', Adr) > 0
then substring(Adr, patindex('%[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]%',Adr),8)
when patindex('%[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]%', Adr) > 0
then substring(Adr, patindex('%[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]%',Adr),7) 
end AS Postcode
FROM Addresses

Checking outward codes against a lookup table

I mentioned earlier that the only way to validate a postcode against a lookup table is to use the PAF, which is expensive. That's true of the code as a whole. But it's much more feasible to just check the outward portion of the code. There are around 3,000 outward codes currently in use - compared to more than 27 million full postcodes.

What's more, it's possible to download a list of outward codes. One place where you can do that is the EasyPeasy site, where the list is available free of charge. It comes in the form of a comma-delimited file (handy for opening in Excel) and as a SQL script which generates a table of the codes. Unfortunately for SQL Server folk, the script is in MySQL syntax, but it can be adapted to T-SQL without too much effort.

One problem with the EasyPeasy file is that it doesn't include the 200 or so non-geographic outward postcodes. These are special codes that have been allocated to very large organisations. They conform to the same syntax rules as other codes, but are independent of any particular geographic area. They include, for example, the VAT Central Unit at BX5, and British Gas at GU95.

Because there are relatively few of these non-geographic codes, it would be possible to add them to the lookup table manually. You can find an up-to-date list here.

Once you have the table in place, it's easy to use it in your validation. The following query will return the rows in an address table for which the outward part of the postcode is not present in the lookup table:

-- Given a lookup table of all valid outward codes, this
-- will return the invalid postcodes from an addreses table
SELECT postcode FROM Addresses
WHERE substring(postcode, 1, charindex(' ', postcode)) 
  NOT IN (SELECT outcode FROM lookup)

If you have a particularly large address table, it might be worth splitting the validation routine into two steps. In other words, you would check postcodes against the lookup table only where they have first passed the syntax check.

Going further

Parsing and validation apart, there are many useful ways of putting British postcodes to work. One potentially worthwhile application is to use them to calculate the distance between two addresses. This in turn would allow you to run queries that will find addresses of a certain type within so-many miles or kilometers of a given point. To see how to do that, take a look at my article, Programmatically calculate distances between UK postcodes.

December 2011

Please note: The information given on this site has been carefully checked and is believed to be correct, but no legal liability can be accepted for its use. Do not use code, components or techniques unless you are satisfied that they will work correctly with your sites or applications.

If you found this article useful, please tell your friends: