Tag Archives: analysis

Descriptive Statistics – Median

Yesterday we lamented the absence of two of the basic descriptive statistics aggregate functions: mode and median. We then demonstrated how to build your own mode function. Today we’ll show how to create a median function.

As with our mode function, we’ll pass our record set of values to the function as XML with a node named “value”. The function will return a decimal value; however, the return is easily modified to meet your needs.

-- Set the database context
USE [master];
GO

-- If the function already exists, drop it
IF EXISTS (
    SELECT *
    FROM sys.objects
    WHERE
        [object_id] = OBJECT_ID(N'ufnMedian')
        AND [type] IN ('FN', 'IF', 'TF', 'FS', 'FT'))
DROP FUNCTION ufnMedian;
GO

-- Configure environmental settings
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
GO

-- Create the function
CREATE FUNCTION ufnMedian (
    @xml XML -- XML containing a single node named "value"
    )
RETURNS DECIMAL(16,4) -- decimal return; redefine as needed
AS
BEGIN
    -- Declare variables
    DECLARE
        @decReturn DECIMAL(16,4)
        , @n INT = (SELECT @xml.value('count(/row/value)','int'))
        , @x INT
    ;
    -- Check to see if there is an even or odd number of records
    IF @n % 2 = 0 -- Even; average the two middle values
    BEGIN
        -- Find the index of the lower middle value
        SET @x = @n / 2.0;
        -- Select the average of the lower and next highest values
        SELECT @decReturn = AVG(value)
        FROM (
                SELECT
                    'value' = tbl.col.value('value[1]','float')
                    , 'n' = ROW_NUMBER()
                        OVER(ORDER BY tbl.col.value('value[1]','float'))
                FROM @xml.nodes('row') tbl(col)
            ) AS dTbl
        WHERE n = @x OR n = @x + 1;
    END
    ELSE -- Odd; return the middle value
    BEGIN
        -- Find the index of the middle value
        SET @x = CEILING(@n / 2.0 );
        -- Select the middle value
        SELECT @decReturn = value
        FROM (
                SELECT
                    'value' = tbl.col.value('value[1]','float')
                    , 'n' = ROW_NUMBER()
                        OVER(ORDER BY tbl.col.value('value[1]','float'))
                FROM @xml.nodes('row') tbl(col)
            ) AS dTbl
        WHERE n = @x;
    END
    -- Return the result of the function
    RETURN @decReturn;
END
GO

To call our median function, simply convert your values to XML and then pass it to the function. For example:

-- Pass the XML to the function
SELECT [master].dbo.ufnMedian(dTbl.x)
FROM (
        SELECT value
        FROM tbl
        FOR XML PATH
    ) AS dTbl(x)
;

Note that the function assumes that the XML node is named “value”; thus you should alias your input values as such when you convert to XML.

Can anyone suggest a more elegant way to calculate median in TSQL? Please send in your comments!

Descriptive Statistics – Mode

For those who have attempted rudimentary analysis through TSQL, I am confident that you found a conspicuous absence in the built-in aggregate functions. Specifically, where are the mode and median functions? The mean, summation, and count functions exist, and even standard deviation and variance are there, in both sample and population flavors. But where are median and mode?!

It would be great to be able to define the two missing statistical aggregates as TSQL aggregate functions; unfortunately, that is not permitted. Only Common Language Runtime (CLR) assemblies are allowed to be a function of the aggregate type. However, you can create a scalar value TSQL function that will perform the calculation with roughly the same usage.

Carrying on the XML theme from recent posts, our mode function will accept an array of values as XML.  We’ll then shred it and aggregate it, returning a string result. Why a string? Because mode can return multiple values when there is a tie. After all, in a uniform distribution, each value is equally probable!

Behold, the mode function in TSQL:

-- Set the database context
USE [master];
GO

-- If the function already exists, drop it
IF EXISTS (
    SELECT *
    FROM sys.objects
    WHERE
        [object_id] = OBJECT_ID(N'ufnMode')
        AND [type] IN ('FN', 'IF', 'TF', 'FS', 'FT'))
DROP FUNCTION ufnMode;
GO

-- Configure environmental settings
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
GO

-- Create the function
CREATE FUNCTION ufnMode (
    @xml XML -- XML containing a single node named "value"
    )
RETURNS VARCHAR(200) -- string return because mode could be multiple values
AS
BEGIN
    -- Declare variables
    DECLARE @strReturn VARCHAR(200);
    -- Calculate the mode
    SELECT @strReturn = (
            -- Concatenate the value(s) to a string
            SELECT CONVERT(NVARCHAR(20), dTbl2.value) + '; ' AS [text()]
            FROM (
                    -- Find the most common value(s)
                    SELECT TOP 1 WITH TIES dTbl.value
                    FROM (
                            -- Shred the XML to a table
                            SELECT 'value' =
                                tbl.col.value('value[1]','float')
                            FROM @xml.nodes('row') tbl(col)
                        ) AS dTbl
                    GROUP BY dTbl.value
                    ORDER BY COUNT(*) DESC
                ) AS dTbl2
            ORDER BY dTbl2.value
            FOR XML PATH('')
        )
    ;
    -- Remove the trailing value delimiter
    SET @strReturn = LEFT(@strReturn, LEN(@strReturn) - 1);
    -- Return the result of the function
    RETURN @strReturn;
END
GO

To call the function, simply convert your values to XML and then pass it to the function. For example:

-- Pass the XML to the function
SELECT [master].dbo.ufnMode(dTbl.x)
FROM (
        SELECT value
        FROM tbl
        FOR XML PATH
    ) AS dTbl(x)
;

Note that the function assumes that the XML node is named “value”; thus you should alias your input values as such when you convert to XML. Alternately, you could extend the function to accept any name for the node.

Can anyone suggest a more elegant way to calculate mode in TSQL? Please send in your comments!

And what about the median function? Stay tuned for the next post! 😉