Padding Fields for Fixed-Position Data Formats

2015-04-22

Fixed-position data formats will seemingly be with us forever. Despite the relative ease of parsing CSV (or other delimited formats), or even XML, many data exchanges require a fixed-position input. Characters 1-10 are X, characters 11-15 are Y and if the source data is fewer than 5 characters, we have to left-pad with a filler character, etc. When you’re accustomed to working with data that says what it means and means what it says, having to add “extra fluff” like left-padding your integers with a half-dozen zeroes can be a hassle.

I received a draft of a stored procedure recently which had to do exactly this. The intent is for the procedure to output the data almost entirely formatted as required, one record per line in the output file, and dump the result set to a file on disk. As it was given to me, the procedure was peppered with CASE statements like this (only more complex) in the SELECT clause:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


-- Method 1
select
 case len(cast(logid as varchar))
   when 9 then '0' + cast(logid as varchar)
   when 8 then '00' + cast(logid as varchar)
   when 7 then '000' + cast(logid as varchar)
   when 6 then '0000' + cast(logid as varchar)
   when 5 then '00000' + cast(logid as varchar)
   when 4 then '000000' + cast(logid as varchar)
   when 3 then '0000000' + cast(logid as varchar)
   when 2 then '00000000' + cast(logid as varchar)
   when 1 then '000000000' + cast(logid as varchar)
   when 0 then '0000000000' + cast(logid as varchar)
 end as logid
 ,logtext
 from cachedb.dbo.logs; 

It’s perfectly valid, it works, and there’s nothing inherently wrong with it. But I find it a bit tough to read, and it could become trouble if the format changes later, as additional (or fewer) cases will have to be accounted for. Fortunately, the day I received this procedure was right around the day I learned about the REPLICATE() T-SQL function. Maybe we can make this simpler:

1
2
3
4
5


 
select
  replicate('0',10-len(cast(logid as varchar))) + cast(logid as varchar) as logid
  ,logtext
from cachedb.dbo.logs; 

Not bad. But it leaves us with a magic number and similar to the previous example, if the file format changes we have to seek out these magic numbers and fix them. This is easily remedied by defining these field lengths at the beginning of the procedure, so that they’re all in one place if anything needs to change.

1
2
3
4
5
6


-- Method 2
declare @paddedlength int = 10;
select
 replicate('0',@paddedlength-len(cast(logid as varchar))) + cast(logid as varchar) as logid
 ,logtext
from cachedb.dbo.logs; 

Yet another approach would be to pad out the value beyond what we need, then trim the resulting string back to the required length. Again, we have to be careful to not leave ourselves with magic numbers; the solution is the same as when using REPLICATE():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


-- Method 3
select
  right('0000000000' + cast(logid as varchar), 10) as logid
  ,logtext
from cachedb.dbo.logs;
-- Or, with more flexibility/fewer magic numbers
-- Method 4
declare @paddedlength int = 10;
select
  right(replicate('0',@paddedlength) + cast(logid as varchar), @paddedlength) as logid
  ,logtext
from cachedb.dbo.logs; 

All four methods yield the same results, as far as the data itself is concerned. But what about performance? For a table with 523,732 records, execution times were:

2,000ms CPU time, 261,785ms elapsed
2,265ms CPU time, 294,399ms elapsed
2,000ms CPU time, 297,593ms elapsed
2,078ms CPU time, 302,045ms elapsed

Each method had an identical execution plan, so I’m probably going to opt for the code that’s more readable and maintainable - method 2 or 4.

As with any tuning, be sure to test with your own data & queries.