An app that I've been maintaining for a few years was due for an upgrade from SQL 2000 to SQL 2005. One of the features of the app is the ability to collect surveys. Surveys, being what they are, have many types of answers, and the original solution to this was to create answer columns of a generic type VARCHAR. (OK, ok, don't kill me on this one. The deal was done on this architecture decision years ago, so there's no point rehashing it now; nor can it for reasons I won't get into here.)
The client came back after the original design and wanted to start performing aggregations and reports on the answers supplied, i.e. averages, standard deviations, etc. That was all fine and good until we started to realize that some of the responses weren't expected. E.g., How many times do you see the doctor per year? was answered "five", "5", "4-6", and "n/a."
An early decision was made to drop the undecipherable responses and only count true numbers, ignoring non-numeric answers via a WHERE clause, and casting the numeric representation as a decimal, integer, or some other real numeric object; this is known as an "unsafe expression." This solution worked like a champ for 3 years... until we migrated to SQL2005.
Here's a good example of what broke:
1: create table dbo.aaaTestBreak (
2: myColumn varchar(25)
3: )
4: insert aaaTestBreak select '25'
5: insert aaaTestBreak select '-10'
6: insert aaaTestBreak select '14.7'
7: insert aaaTestBreak select 'matt birmingham'
8:
9: select
10: avg( cast(myColumn as real)) as AverageValue
11: from
12: aaaTestBreak
13: where
14: myColumn not like '%[a-z]%'
15: and myColumn > 0
16:
17: drop table aaaTestBreak
In SQL 2000, the WHERE clause excluding the row 'matt birmingham' was excluded, allowing the line "myColumn > 0" and the CAST to work exactly how I expected.
SQL 2005 now will rearrange the computation in order to reduce redundant calculations and match indexes earlier in the calculation. This is great except for the case above. In my case, the "myColumn > 0" and CAST calls are no longer guaranteed to run in any particular order.
With a suggestion from Lorin Thwaits, I created some helper functions that will not only resolve the problem, but increase readability in my code. You have to love those two-for-one solutions.
Here's one of the new functions called VarCharToReal:
1: CREATE FUNCTION dbo.VarCharToReal
2: (
3: @VarCharValue as varchar(255)
4: )
5: RETURNS real
6: AS
7: BEGIN
8: -- Declare the return variable here
9: DECLARE @Result real
10:
11: SELECT
12: @Result =
13: CASE
14: WHEN @VarCharValue like '%[a-z]%' THEN NULL
15: WHEN @VarCharValue like '%[/\|<>,`~!@#$%^&*(){};:_+=]%' THEN NULL
16: WHEN @VarCharValue like '%[[]%' THEN NULL
17: WHEN @VarCharValue like '%]%' THEN NULL
18: WHEN charindex('-',ltrim(@VarCharValue),0) > 1 THEN NULL
19: WHEN charindex('.',ltrim(rtrim(@VarCharValue)),1+charindex('.',ltrim(rtrim(@VarCharValue)))) <> 0 THEN NULL
20: ELSE
21: cast(@VarCharValue as real)
22: END
23:
24: -- Return the result of the function
25: RETURN @Result
26:
27: END
and here's what the call would look like for the top example:
1: select
2: avg(dbo.VarCharToReal(myColumn)) as AverageValue
3: from
4: aaaTestBreak
5: where
6: dbo.VarCharToReal(myColumn) > 0
Using the function allows me to stop worrying about non-numeric values, and get to the real business of testing for validity and performing the aggregation.