Category Archives: Functions

Get Precise Age

It seems a simple thing: determine the age of a person. Given his/her birth date, and the current date, the maths are quite obvious and a human does the calculation with ease. However… it is surprising that software doesn’t often ship with native functions adequate to the task. Want to see what I mean? ūüôā

Let’s start with the basic DATEDIFF function, using year¬†as the datepart argument.

-- declare and initialize variables
DECLARE @birthdate DATE = '2000-07-31'
    , @currentdate DATE = '2016-01-01';
-- use datediff to calculate the difference
SELECT @birthdate AS 'birth'
    , @currentdate AS 'today'
    , DATEDIFF(YEAR, @birthdate, @currentdate) AS 'ageWithDateDiff';

Obviously this result does not satisfy the use case for which we employed the function. DATEDIFF returns the difference between the year values only, without any consideration of the month or day of either dates. The calculation indicates our person is 16 years old when, in fact, the person is 15 and has not yet reached his/her 16th birthday.

In response to this result I’ve often seen coded what I consider a brute-force approach to precision — maintaining the conceptual trajectory but applying it at a more granular level. Specifically, the DATEDIFF function is still employed but used with days instead of years, the result is divided by 365 (days per year) to get a fractional year value, and the result is rounded down to the nearest integer to get the age.

This approach works for the previous example:

-- declare and initialize variables
DECLARE @birthdate DATE = '2000-07-31'
    , @currentdate DATE = '2016-01-01';
-- use days to calculate the difference
SELECT @birthdate AS 'birth'
    , @currentdate AS 'today'
    , FLOOR(DATEDIFF(DAY, @birthdate, @currentdate) / 365.0) AS 'ageWith365Days';

So far so good! But the approach does not work in every case, for example:

-- declare and initialize variables
DECLARE @birthdate DATE = '2000-02-29'
    , @currentdate DATE = '2016-02-28';
-- use days to calculate the difference
SELECT @birthdate AS 'birth'
    , @currentdate AS 'today'
    , FLOOR(DATEDIFF(DAY, @birthdate, @currentdate) / 365.0) AS 'ageWith365Days';

As can be seen above, the “age calculation using days per year” method fails when a leap day is in play. But brute-force hasn’t given up just yet: let’s use 365.25 days per year to account for the leap day!

-- declare and initialize variables
DECLARE @birthdate DATE = '2000-02-29'
    , @currentdate DATE = '2016-02-28';
-- use average days per year to calculate the difference
SELECT @birthdate AS 'birth'
    , @currentdate AS 'today'
    , FLOOR(DATEDIFF(DAY, @birthdate, @currentdate) / 365.25) AS 'ageWithLeapDays';

At this point it looks like the challenge¬†is solved. However… there is still a problem; for example:

-- declare and initialize variables
DECLARE @birthdate DATE = '2001-02-28'
    , @currentdate DATE = '2016-02-28';
-- use average days per year to calculate the difference
SELECT @birthdate AS 'birth'
    , @currentdate AS 'today'
    , FLOOR(DATEDIFF(DAY, @birthdate, @currentdate) / 365.25) AS 'ageWithLeapDays';

The “365.25” days method calculates that the person is 14. In this case, however, the current date is actually the person’s 15th birthday. In other words, the calculation fails.

Is there anything else for the brute-force approach to try? Instead… let’s attempt to code the underlying rules for calculating a person’s age (or any event anniversary).

Subtracting the years is actually a very good place to start because, if the “current” month-day is on-or-after the “birth” (event) month-day, then the raw year calculation is correct. And if the “birth” month-day is before the “current” month-day, then we subtract one from the raw year calculation.

As it turns out, this comparison of month-days is quite easy; we can simply use the¬†DATEDIFF function with the DAYOFYEAR datepart argument! Of course, we still need to accommodate leap days. If we don’t consider leap days, then we will¬†find that 59th day of the year is “normally” 01-Mar, but it’s 29-Feb on a leap year.

The good news is that accounting for leap days¬†is not as complicated as it might at first seem. We can still use DAYOFYEAR as the DATEDIFF argument, but decrement the result when dealing with a leap year on either the event or current date. (Note that we don’t need to worry about it if both event and current dates are in a leap year or neither are in a leap year — in those cases, the years have the same number of days so a straight-compare works correctly.)

CREATE FUNCTION dbo.ufnGetYearDiff (@start DATE, @end DATE)
RETURNS SMALLINT
AS
BEGIN
    DECLARE @return INT
        , @s SMALLINT = DATEPART(DAYOFYEAR, @start)
        , @e SMALLINT = DATEPART(DAYOFYEAR, @end);
    -- if necessary, decrement the start date dayofyear
    IF YEAR(@start) % 4 = 0 AND YEAR(@end) % 4 != 0 AND @s > 59 SET @s -= 1;
    -- if necessary, decrement the end date dayofyear
    IF YEAR(@start) % 4 != 0 AND YEAR(@end) % 4 = 0 AND @e > 59 SET @e -= 1;
    -- decrement the year diff if the end is earlier in the year than the start
    SET @return = DATEDIFF(YEAR, @start, @end)
        - CASE WHEN @e < @s THEN 1 ELSE 0 END;
    RETURN @return;
END
GO

Let’s test it out and see if we have solved the problem.

-- declare and initialize variables
DECLARE @birthdate DATE = '2001-02-28'
    , @currentdate DATE = '2016-02-28';
-- use the custom function to calculate the age
SELECT @birthdate AS 'birth'
    , @currentdate AS 'today'
    , dbo.ufnGetYearDiff(@birthdate, @currentdate) AS 'ageFn';

Success! Does anyone have a more elegant way to solve this challenge?

PS: Before anyone yells at me, the BOL documentation for DATEDIFF very clearly states what the function does: it returns the number of specified boundaries crossed between the startdate and enddate. Nonetheless, the function doesn’t do what the name implies (or, more precisely, what many people assume).

Descriptive Statistics – Median

Yesterday we lamented the absence of two of the basic descriptive statistics aggregate functions: mode and median. We then demonstrated how to build your own mode function. Today we’ll show how to create a median function.

As with our mode function, we’ll pass our record set of values to the function as XML with a node named “value”. The function will return a decimal value; however, the return is easily modified to meet your needs.

-- Set the database context
USE [master];
GO

-- If the function already exists, drop it
IF EXISTS (
    SELECT *
    FROM sys.objects
    WHERE
        [object_id] = OBJECT_ID(N'ufnMedian')
        AND [type] IN ('FN', 'IF', 'TF', 'FS', 'FT'))
DROP FUNCTION ufnMedian;
GO

-- Configure environmental settings
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
GO

-- Create the function
CREATE FUNCTION ufnMedian (
    @xml XML -- XML containing a single node named "value"
    )
RETURNS DECIMAL(16,4) -- decimal return; redefine as needed
AS
BEGIN
    -- Declare variables
    DECLARE
        @decReturn DECIMAL(16,4)
        , @n INT = (SELECT @xml.value('count(/row/value)','int'))
        , @x INT
    ;
    -- Check to see if there is an even or odd number of records
    IF @n % 2 = 0 -- Even; average the two middle values
    BEGIN
        -- Find the index of the lower middle value
        SET @x = @n / 2.0;
        -- Select the average of the lower and next highest values
        SELECT @decReturn = AVG(value)
        FROM (
                SELECT
                    'value' = tbl.col.value('value[1]','float')
                    , 'n' = ROW_NUMBER()
                        OVER(ORDER BY tbl.col.value('value[1]','float'))
                FROM @xml.nodes('row') tbl(col)
            ) AS dTbl
        WHERE n = @x OR n = @x + 1;
    END
    ELSE -- Odd; return the middle value
    BEGIN
        -- Find the index of the middle value
        SET @x = CEILING(@n / 2.0 );
        -- Select the middle value
        SELECT @decReturn = value
        FROM (
                SELECT
                    'value' = tbl.col.value('value[1]','float')
                    , 'n' = ROW_NUMBER()
                        OVER(ORDER BY tbl.col.value('value[1]','float'))
                FROM @xml.nodes('row') tbl(col)
            ) AS dTbl
        WHERE n = @x;
    END
    -- Return the result of the function
    RETURN @decReturn;
END
GO

To call our median function, simply convert your values to XML and then pass it to the function. For example:

-- Pass the XML to the function
SELECT [master].dbo.ufnMedian(dTbl.x)
FROM (
        SELECT value
        FROM tbl
        FOR XML PATH
    ) AS dTbl(x)
;

Note that the function assumes that the XML node is named “value”; thus you should alias your input values as such when you convert to XML.

Can anyone suggest a more elegant way to calculate median in TSQL? Please send in your comments!

Descriptive Statistics – Mode

For those who have attempted rudimentary analysis through TSQL, I am confident that you found a conspicuous absence in the built-in aggregate functions. Specifically, where are the mode and median functions? The mean, summation, and count functions exist, and even standard deviation and variance are there, in both sample and population flavors. But where are median and mode?!

It would be great to be able to define the two missing statistical aggregates as TSQL aggregate functions; unfortunately, that is not permitted. Only Common Language Runtime (CLR) assemblies are allowed to be a function of the aggregate type. However, you can create a scalar value TSQL function that will perform the calculation with roughly the same usage.

Carrying on the XML theme from recent posts, our mode function will accept an array of values as XML. ¬†We’ll then shred it and aggregate it, returning a string result. Why a string? Because mode can return multiple values when there is a tie. After all, in a uniform distribution, each value is equally probable!

Behold, the mode function in TSQL:

-- Set the database context
USE [master];
GO

-- If the function already exists, drop it
IF EXISTS (
    SELECT *
    FROM sys.objects
    WHERE
        [object_id] = OBJECT_ID(N'ufnMode')
        AND [type] IN ('FN', 'IF', 'TF', 'FS', 'FT'))
DROP FUNCTION ufnMode;
GO

-- Configure environmental settings
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
GO

-- Create the function
CREATE FUNCTION ufnMode (
    @xml XML -- XML containing a single node named "value"
    )
RETURNS VARCHAR(200) -- string return because mode could be multiple values
AS
BEGIN
    -- Declare variables
    DECLARE @strReturn VARCHAR(200);
    -- Calculate the mode
    SELECT @strReturn = (
            -- Concatenate the value(s) to a string
            SELECT CONVERT(NVARCHAR(20), dTbl2.value) + '; ' AS [text()]
            FROM (
                    -- Find the most common value(s)
                    SELECT TOP 1 WITH TIES dTbl.value
                    FROM (
                            -- Shred the XML to a table
                            SELECT 'value' =
                                tbl.col.value('value[1]','float')
                            FROM @xml.nodes('row') tbl(col)
                        ) AS dTbl
                    GROUP BY dTbl.value
                    ORDER BY COUNT(*) DESC
                ) AS dTbl2
            ORDER BY dTbl2.value
            FOR XML PATH('')
        )
    ;
    -- Remove the trailing value delimiter
    SET @strReturn = LEFT(@strReturn, LEN(@strReturn) - 1);
    -- Return the result of the function
    RETURN @strReturn;
END
GO

To call the function, simply convert your values to XML and then pass it to the function. For example:

-- Pass the XML to the function
SELECT [master].dbo.ufnMode(dTbl.x)
FROM (
        SELECT value
        FROM tbl
        FOR XML PATH
    ) AS dTbl(x)
;

Note that the function assumes that the XML node is named “value”; thus you should alias your input values as such when you convert to XML. Alternately, you could extend the function to accept any name for the node.

Can anyone suggest a more elegant way to calculate mode in TSQL? Please send in your comments!

And what about the median function? Stay tuned for the next post! ūüėČ

Convert Rows to Column (Part II)

In a previous post we demonstrated how to use the FOR XML clause to concatenate multiple rows of one data field into a single row of one data field. While that technique is quite handy, what about when there are multiple fields to be concatenated? In other words, what if the native data has a one-to-many relationship but we need to return a one-to-one record set without losing any of the fields or data?

One possible solution is to simply extend the single-field example and create an additional CROSS APPLY block for each additional field. While not an elegant solution, it does work. Before we explore a more modular solution, let’s refresh our memories on what this first solution would look like.

Before we dig in, let’s create some objects for demonstration purposes.

-- Set the database context
USE [db];
GO

-- If the demo table already exists, drop it
IF OBJECT_ID(N'tbl', N'U') IS NOT NULL
    DROP TABLE tbl;
GO

-- Create the demo table
CREATE TABLE tbl (
    ID INT IDENTITY(1,1) NOT NULL
    , FK INT NOT NULL
    , Value VARCHAR(20) NOT NULL
    , CONSTRAINT PK_tbl PRIMARY KEY CLUSTERED (ID)
    )
;
GO

-- Populate the demo table
INSERT INTO tbl (FK, Value)
VALUES
    (1, 'One')
    , (2, 'Two')
    , (2, 'Three')
;
GO

We’ll also make use of our numbers function to simulate the “one” side of our “one-to-many” data set. The number data will join to the foreign key (FK) we created in the demo table, which serves as our “many” side. Wait — you don’t have a “numbers” function? Get one here.

Okay, so with our environment ready, let’s demonstrate concatenation of multiple rows across multiple fields using FOR XML and CROSS APPLY.

-- First approach; effective but neither scalable nor elegant
SELECT
    nums.n
    , dTbl.ID
    , dTbl2.Value
FROM
    [master].dbo.utfNumbers(3) AS nums
    -- Join to concatenate the first field (ID)
    CROSS APPLY (
        SELECT CONVERT(VARCHAR(20), ID) + '; ' AS [text()]
        FROM tbl
        WHERE tbl.FK = nums.n
        FOR XML PATH('')
    ) AS dTbl(ID)
    -- Join to concatenate the second field (Value)
    CROSS APPLY (
        SELECT Value + '; ' AS [text()]
        FROM tbl
        WHERE tbl.FK = nums.n
        FOR XML PATH('')
    ) AS dTbl2(Value)
;

We are able to return the results we expect; however, there are two major issues. One, every additional field to return requires us to add an additional CROSS APPLY block; and two, every additional CROSS APPLY block causes an additional query/join to the data. If we could use just one CROSS APPLY block for all the columns, we could save both ourselves and the database engine some work.

And, of course, we can do just that. Again we will use the capabilities of XML to do the job. This time, rather than returning the concatenated result of a single field in the CROSS APPLY, we will return the entire record set (all rows and all columns) to the outer query in the form of a single XML value. With the XML value at the top query, we can then parse out the individual fields (nodes) in the outer SELECT statement.

Yes, that was a lot of words. But, I trust that they will make more sense once you see the example.

-- Second approach; a function makes it scalable
SELECT
    nums.n
    , 'IDs' = [master].dbo.ufnXMLConcat(dTbl.x, 'ID')
    , 'Values' = [master].dbo.ufnXMLConcat(dTbl.x, 'Value')
FROM
    [master].dbo.utfNumbers(3) AS nums
    -- Join to all of the child data, returned as XML
    CROSS APPLY (
        SELECT ID, Value
        FROM tbl
        WHERE tbl.FK = nums.n
        FOR XML PATH
    ) AS dTbl(x)
;

Obviously, this syntax is preferable. It is not only easier to author but it also requires fewer data touches for the SQL engine to resolve.

But wait… what is that ufnXMLConcat function? Why it is right here, of course! ūüėČ

-- Set the database context
USE [master];
GO

-- If the function already exists, drop it
IF EXISTS (
    SELECT *
    FROM sys.objects
    WHERE
        [object_id] = OBJECT_ID(N'ufnXMLConcat')
        AND [type] IN ('FN', 'IF', 'TF', 'FS', 'FT'))
DROP FUNCTION ufnXMLConcat;
GO

-- Configure environmental settings
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
GO

-- Create the function
CREATE FUNCTION ufnXMLConcat (
    @xml XML
    , @strNode VARCHAR(50)
    )
RETURNS VARCHAR(MAX)
AS
BEGIN
    -- Declare variables
    DECLARE
        @strReturn VARCHAR(MAX) = ''
        , @strXML VARCHAR(MAX) = CONVERT(VARCHAR(MAX), @xml)
    ;
    -- Return default if XML is null
    IF @strXML IS NULL RETURN @strReturn;
    -- Parse the XML string for the node
    WHILE CHARINDEX('<' + @strNode + '>', @strXML) != 0
    BEGIN
        -- Extract and concatenate the node value
        SET @strReturn += SUBSTRING(
                @strXML
                , CHARINDEX('<' + @strNode + '>', @strXML)
                    + LEN('<' + @strNode + '>')
                , CHARINDEX('</' + @strNode + '>', @strXML)
                    - CHARINDEX('<' + @strNode + '>', @strXML)
                    - LEN('<' + @strNode + '>')
                )
            + '; '
        ;
        -- Remove the node from the XML string
        SET @strXML = SUBSTRING(
                @strXML
                , CHARINDEX('</' + @strNode + '>', @strXML)
                    + LEN('</' + @strNode + '>')
                , LEN(@strXML)
                    - CHARINDEX('</' + @strNode + '>', @strXML)
                    - LEN('</' + @strNode + '>')
            )
        ;
    END
    -- Remove the trailing value delimiter
    SET @strReturn = LEFT(@strReturn, LEN(@strReturn) - 1);
    -- Return the result of the function
    RETURN @strReturn;
END
GO

As you can see, by accepting an XML value and a string node name as input, simple string functions take care of the rest. There are, of course, even more elegant ways to parse XML; but more complex programming is not necessarily permitted within SQL functions.

Although the second example provides a more elegant solution, it should be noted that even this solution does not scale perfectly. For example, if the outer query returns millions of rows and/or the inner query contains many rows or many columns, there could definitely be performance problems. After all, executing functions within a SELECT statement is asking for trouble when the record count gets high.

Does anyone have a more elegant way of handling this business case? If so, please share via a comment.

Special thanks to my colleague, Bryan Clayton, for his assistance in defining the business case and working through alternate methods to parse XML. Look forward to some of those alternatives in future posts!

As ever, don’t forget to clean up!

-- Drop the demo table
IF OBJECT_ID(N'tbl', N'U') IS NOT NULL
    DROP TABLE tbl;

Split Delimited String

Whether you like it or not, if you develop applications, at some point data will be collected as a delimited string and you’ll need to parse it within the database. While a database developer might prefer to work with sets or arrays, delimited strings are a fact of life; deal with it.

There are few good ways of taking a delimited string and breaking it up into discrete values, but, depending on the version of SQL you are using, some are better than others.¬†If you’re running SQL 2005 or later, then you are in luck: your version supports the XML object type and you can make use of one of the most elegant ways of splitting a delimited string.

This method has been around for a while, but, nevertheless, it is one of my favorite utility functions and invariably finds its way into the model or master database on any server I have occasion to develop on.

-- Create a user-defined, table-valued function
CREATE FUNCTION utfSplitString (
    @nstrValues NVARCHAR(MAX) -- the delimited string of values
    , @strDelimiter CHAR(1) -- the delimiting character
    )
RETURNS @tblReturn TABLE (
    Value NVARCHAR(MAX)
    )
AS
BEGIN

    -- Convert the values to XML, replacing the delimiter with a node break
    DECLARE @x XML = CONVERT(
        XML
        , '<root><n>'
            + REPLACE(@nstrValues, @strDelimiter, '</n><n>')
            + '</n></root>'
        )
    ;

    -- Parse the XML into the return table using native XML methods
    INSERT INTO @tblReturn (Value)
    SELECT x.v.value('.', 'VARCHAR(MAX)')
    FROM @x.nodes('/root/n') x(v)
    ;

    -- Return the table records
    RETURN;

END

With the function created, you can now use it to break a delimited string into values. Simply pass in the string and the delimiter and off you go. As follows:

-- Use the function to break a string into values
SELECT Value
FROM utfSplitString(N'Hello,World,!', ',')
;

Does anyone have a more elegant way to split a delimited string? If so, please share it as a comment.

utfNumbers

A helpful function borrowed from Itzik Ben-Gan; a table of numbers comes in very handy, especially when it takes virtually no disk space.

-- Create a table-valued function
CREATE FUNCTION
 dbo.utfNumbers (
      @n AS BIGINT
)
RETURNS TABLE
AS
RETURN
WITH
      L0 AS (SELECT 1 AS c UNION ALL SELECT 1)
      , L1 AS (SELECT 1 AS c FROM L0 AS a, L0 AS b)
      , L2 AS (SELECT 1 AS c FROM L1 AS a, L1 AS b)
      , L3 AS (SELECT 1 AS c FROM L2 AS a, L2 AS b)
      , L4 AS (SELECT 1 AS c FROM L3 AS a, L3 AS b)
      , L5 AS (SELECT 1 AS c FROM L4 AS a, L4 AS b)
      , Nums AS (
            SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n
            FROM L5)
      SELECT TOP(@n) n FROM Nums ORDER BY n
;

To use the function, simply select ‘n’ from the function as though it was a regular table, passing in as a parameter the total number of numbers you want returned. The example below will return 100 rows containing one column of values from one to 100.

-- Display content of the function
SELECT
n
FROM dbo.utfNumbers(100)
;