Thursday, September 06, 2007

An app that I've been maintaining for a few years was due for an upgrade from SQL 2000 to SQL 2005.  One of the features of the app is the ability to collect surveys.  Surveys, being what they are, have many types of answers, and the original solution to this was to create answer columns of a generic type VARCHAR.  (OK, ok, don't kill me on this one.  The deal was done on this architecture decision years ago, so there's no point rehashing it now; nor can it for reasons I won't get into here.)

The client came back after the original design and wanted to start performing aggregations and reports on the answers supplied, i.e. averages, standard deviations, etc.  That was all fine and good until we started to realize that some of the responses weren't expected.  E.g., How many times do you see the doctor per year?  was answered "five", "5", "4-6", and "n/a." 

An early decision was made to drop the undecipherable responses and only count true numbers, ignoring non-numeric answers via a WHERE clause, and casting the numeric representation as a decimal, integer, or some other real numeric object; this is known as an "unsafe expression."  This solution worked like a champ for 3 years... until we migrated to SQL2005.

Here's a good example of what broke:

 1: create table dbo.aaaTestBreak ( 
 2: myColumn varchar(25) 
 3: ) 
 4: insert aaaTestBreak select '25' 
 5: insert aaaTestBreak select '-10' 
 6: insert aaaTestBreak select '14.7' 
 7: insert aaaTestBreak select 'matt birmingham' 
 8:  
 9: select 
 10: avg( cast(myColumn as real)) as AverageValue 
 11: from 
 12: aaaTestBreak 
 13: where 
 14: myColumn not like '%[a-z]%' 
 15: and myColumn > 0
 16:  
 17: drop table aaaTestBreak

In SQL 2000, the WHERE clause excluding the row 'matt birmingham' was excluded, allowing the line "myColumn > 0" and the CAST to work exactly how I expected.

SQL 2005 now will rearrange the computation in order to reduce redundant calculations and match indexes earlier in the calculation.  This is great except for the case above.  In my case, the "myColumn > 0" and CAST calls are no longer guaranteed to run in any particular order.

With a suggestion from Lorin Thwaits, I created some helper functions that will not only resolve the problem, but increase readability in my code.  You have to love those two-for-one solutions.

Here's one of the new functions called VarCharToReal:

 1: CREATE FUNCTION dbo.VarCharToReal
 2: (
 3: @VarCharValue as varchar(255)
 4: )
 5: RETURNS real
 6: AS
 7: BEGIN
 8: -- Declare the return variable here
 9: DECLARE @Result real
 10:  
 11: SELECT
 12: @Result = 
 13: CASE 
 14: WHEN @VarCharValue like '%[a-z]%' THEN NULL 
 15: WHEN @VarCharValue like '%[/\|<>,`~!@#$%^&*(){};:_+=]%' THEN NULL
 16: WHEN @VarCharValue like '%[[]%' THEN NULL 
 17: WHEN @VarCharValue like '%]%' THEN NULL 
 18: WHEN charindex('-',ltrim(@VarCharValue),0) > 1 THEN NULL 
 19: WHEN charindex('.',ltrim(rtrim(@VarCharValue)),1+charindex('.',ltrim(rtrim(@VarCharValue)))) <> 0 THEN NULL 
 20: ELSE 
 21: cast(@VarCharValue as real)
 22: END
 23:  
 24: -- Return the result of the function
 25: RETURN @Result
 26:  
 27: END

and here's what the call would look like for the top example:

 

 1: select 
 2: avg(dbo.VarCharToReal(myColumn)) as AverageValue 
 3: from 
 4: aaaTestBreak 
 5: where 
 6: dbo.VarCharToReal(myColumn) > 0

Using the function allows me to stop worrying about non-numeric values, and get to the real business of testing for validity and performing the aggregation.