The many faces of DISTINCT in Postgres (2017) by rsecora

Share This Article

Sed ut perspiciatis unde.

I started my programming career as an Oracle DBA. It took a few years but eventually I got fed up with the corporate world and I went about doing my own thing.

When I no longer had the comfy cushion of Oracle enterprise edition I discovered PostgreSQL. After I gotten over not having proper partitions and MERGE statement (aka UPSERT), I found some nice unique features in PostgreSQL. Oddly enough, a lot of them contained the word DISTINCT.

DISTINCT

I created a simple Employee table with name, department and salary using mock data from this site:

haki=# d employee

    Column   |         Type          | Modifiers
------------+-----------------------+-----------
    id         | integer               | not null
    name       | character varying(30) |
    department | character varying(30) |
    salary     | integer               |

haki=# select * from employee limit 5;

    id |      name      |      department      | salary
----+----------------+----------------------+--------
    1 | Carl Frazier   | Engineering          |   3052
    2 | Richard Fox    | Product Management   |  13449
    3 | Carolyn Carter | Engineering          |   8366
    4 | Benjamin Brown | Business Development |   7386
    5 | Diana Fisher   | Services             |  10419

What is DISTINCT?

SELECT DISTINCT eliminates duplicate rows from the result.

The simplest use of distinct is, for example, to get a unique list of
departments:

haki=# SELECT DISTINCT department FROM employee;

        department
--------------------------
    Services
    Support
    Training
    Accounting
    Business Development
    Marketing
    Product Management
    Human Resources
    Engineering
    Sales
    Research and Development
    Legal

(easy CS students, I know it’s not normalized…)

We can do the same thing with group by

SELECT department FROM employee GROUP BY department;

But we are talking about DISTINCT.

DISTINCT ON

A classic job interview question is finding the employee with the highest salary in each department.

This is what they teach in the university:

haki=# SELECT
    *
FROM
    employee
WHERE
    (department, salary) IN (
        SELECT
            department,
            MAX(salary)
        FROM
            employee
        GROUP BY
            department
    )
ORDER BY
    department;

 id |       name       |        department        | salary
----+------------------+--------------------------+--------
 30 | Sara Roberts     | Accounting               |  13845
  4 | Benjamin Brown   | Business Development     |   7386
  3 | Carolyn Carter   | Engineering              |   8366
 20 | Janet Hall       | Human Resources          |   2826
 14 | Chris Phillips   | Legal                    |   3706
 10 | James Cunningham | Legal                    |   3706
 11 | Richard Bradley  | Marketing                |  11272
  2 | Richard Fox      | Product Management       |  13449
 25 | Evelyn Rodriguez | Research and Development |  10628
 17 | Benjamin Carter  | Sales                    |   6197
 24 | Jessica Elliott  | Services                 |  14542
  7 | Bonnie Robertson | Support                  |  12674
  8 | Jean Bailey      | Training                 |  13230

Legal has two employees with the same high salary. Depending on the use case, this query can get pretty nasty.

If you graduated a while back, you already know a few things about databases and you heard about analytic and window functions, you might do this:

WITH ranked_employees AS (
    SELECT
        ROW_NUMBER() OVER (
           PARTITION BY department ORDER BY salary DESC
        ) AS rn,
        *
    FROM
        employee
)
SELECT
    *
FROM
    ranked_employees
WHERE
    rn = 1
ORDER BY
    department;

The result is the same without the duplicates:

 rn | id |       name       |        department        | salary
----+----+------------------+--------------------------+--------
  1 | 30 | Sara Roberts     | Accounting               |  13845
  1 |  4 | Benjamin Brown   | Business Development     |   7386
  1 |  3 | Carolyn Carter   | Engineering              |   8366
  1 | 20 | Janet Hall       | Human Resources          |   2826
  1 | 14 | Chris Phillips   | Legal                    |   3706
  1 | 11 | Richard Bradley  | Marketing                |  11272
...

Up until now, this is what I would have done.

Now for the real treat, PostgreSQL has a special nonstandard clause to find the first row in a group:

SELECT DISTINCT ON (department)
    *
FROM
    employee
ORDER BY
    department,
    salary DESC;

This is wild! Why nobody told me this is possible?

The docs

The many faces of DISTINCT in Postgres (2017) by rsecora

The many faces of DISTINCT in Postgres (2017) by rsecora

Share This Article

Newsletter

DISTINCT

What is DISTINCT?

DISTINCT ON

HackTech

Leave a comment Cancel reply

Editor's Choice

The many faces of DISTINCT in Postgres (2017) by rsecora

The many faces of DISTINCT in Postgres (2017) by rsecora

Share This Article

Newsletter

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter