HYDROLIX BLOG

Ponderings, insights and industry updates

Handling Multiple Groupings in a Single Pass

June 20, 2022

Author: Federico Rodriguez |

Tags: , ,

The Hydrolix platform has been specifically designed using a number of characteristics that optimize storage for VERY large, high cardinality datasets all while using a distributed storage medium (such as S3, Google Cloud Storage etc). The ability to store more data for longer means that some traditional assumptions made when considering OLTP databases (Postgres, MySQL) can lead to sub-optimal query execution and there can be significantly faster and more effective ways to retrieve information.

As an example GROUP BY s are often used in building queries within OLTP databases with multiple executed to filter and aggregate down a set of rows to the data required. This can be a good ploy with smaller datasets, however there are ways that can be employed within the Hydrolix platform to optimize these queries, so they are executed in a single pass, making them much faster and significantly more efficient. To illustrate this the following tutorial is provided:

The Dataset

We’ll start by building a Temporary Table to build a sample dataset we can play around with.

In this tutorial we will:

  • Build a query on this sample table that groups elements in two different ways, and delivers a final report.
  • Tune this query for significant speed and memory usage gains when run on Hydrolix.

Feel free to run these queries yourself on your own Hydrolix cluster (I recommend using the clickhouse-client, quick installation instructions here).

NOTE: The Temporary table will be removed as soon as your session is completed.

CREATE TEMPORARY TABLE page_hits (
    timestamp DateTime,
    country Nullable(String),
    user_agent Nullable(String),
    session_id Nullable(String),
    isp Nullable(String),
    region Nullable(String)
);
INSERT INTO page_hits VALUES ('2022-06-16 20:34:29','US','...Chrome/65.0.3325.181...','123','Verizon','Region-East'),('2022-06-16 20:34:30','US','...Chrome/65.0.3325.181...','123','Verizon','Region-East'),('2022-06-16 20:34:31','US','...Chrome/65.0.3325.181...','123','Verizon','Region-East'),('2022-06-16 20:34:32','US','...Chrome/65.0.3325.181...','234','Verizon','Region-East'),('2022-06-16 20:34:33','US','...Safari/537.36...','345','AT&T','Region-East'),('2022-06-16 20:34:29','FR','...Chrome/65.0.3325.181...','890','AT&T','Region-West')('2022-06-16 20:34:34','FR','...Firefox/54.0...','456','Spectrum','Region-West'),('2022-06-16 20:34:32','FR','...Firefox/54.0...','456','Spectrum','Region-West'),('2022-06-16 20:34:34','UK','...Safari/537.36...','567','Spectrum','Region-West'), ('2022-06-16 20:35:31','US','...Chrome/65.0.3325.181...','231','Verizon','Region-East'),('2022-06-16 20:35:32','US','...Chrome/65.0.3325.181...','214','Verizon', 'Region-East'),('2022-06-16 20:35:33','US','...Safari/537.36...','341','AT&T','Region-West'),('2022-06-16 20:35:34','FR','...Firefox/54.0...','461','Spectrum','Region-West'),('2022-06-16 20:34:29','US','...Chrome/65.0.3325.181...','012','Spectrum','Region-West'),('2022-06-16 20:35:34','UK','...Safari/537.36...','157','Xfinity','Region-West'),('2022-06-16 20:35:34','US','...Safari/537.36...','177','Xfinity','Region-West'),('2022-06-16 20:35:40','LB','...Safari/537.36...','159','UncommonISP','Region-West'), ('2022-06-16 20:35:34','CH','...Safari/537.36...','159','VeryUncommonISP','Region-West'),  ('2022-06-16 20:35:34','CH','...Safari/537.36...','159','VeryUncommonISP','Region-West');

Each event in this table is a page hit to website. Every hit has a certain session id, ISP (Internet Service Provider), user agent, country and availability region associated with it.

TimestampCountryuser_agentsession_idispregion
2022-06-16 20:34:29US…Chrome/65.0.3325.181…123VerizonEast
2022-06-16 20:34:30US…Chrome/65.0.3325.181…123VerizonEast
2022-06-16 20:34:31US…Chrome/65.0.3325.181…123VerizonEast
2022-06-16 20:34:32US…Chrome/65.0.3325.181…234VerizonEast
2022-06-16 20:34:33US…Safari/537.36…345AT&TEast
2022-06-16 20:34:29FR…Chrome/65.0.3325.181…890AT&TWest
2022-06-16 20:34:34FR…Firefox/54.0…456SpectrumWest
2022-06-16 20:34:32FR…Firefox/54.0…456SpectrumWest
2022-06-16 20:34:34UK…Safari/537.36…567SpectrumWest
2022-06-16 20:35:31US…Chrome/65.0.3325.181…231VerizonEast
2022-06-16 20:35:32US…Chrome/65.0.3325.181…214VerizonEast
2022-06-16 20:35:33US…Safari/537.36…341AT&TWest
2022-06-16 20:35:34FR…Firefox/54.0…461SpectrumWest
2022-06-16 20:34:29US…Chrome/65.0.3325.181…012SpectrumWest
2022-06-16 20:35:34UK…Safari/537.36…157XfinityWest
2022-06-16 20:35:34US…Safari/537.36…177XfinityWest
2022-06-16 20:35:40LB…Safari/537.36…159UncommonISPWest
2022-06-16 20:35:34CH…Safari/537.36…159VeryUncommonISPWest
2022-06-16 20:35:34CH…Safari/537.36…159VeryUncommonISPWest

The problem

Imagine you’ve been asked to come up with the following report:

  • For every ISP, report the percentage of page hits from unique session ids that occur outside of the US
  • Report the popularity (% page hits) of the browsers Safari, Chrome, and Firefox, per ISP, availability region.
  • Only return results for ISPs that contain more than 1 unique session

Notice from the table that:

  1. Multiple hits can come from the same session (session is not unique).
  2. Page hits can exist in one of two availability regions East, or West.
  3. UncommonISP and VeryUncommonISP are the only two ISPs with a single unique session id, and should be filtered out from our final result set.

The First Approach

We will try to build a pipeline, with a separate query for each part of the report.

Get list of ISPs w/ more than 1 unique session

SELECT isp
FROM page_hits
GROUP BY isp
HAVING uniq(session_id) > 1

Get Percentage of Unique, Outside-US Sessions Per-ISP

SELECT
    arraySum(non_us_uniq_sessions) / total_uniq_per_isp AS outside_us_hits_pct,
    isp
FROM
(
    SELECT
        sum(per_country_uniq) AS total_uniq_per_isp,
        groupArray(country) AS countries,
        groupArray(per_country_uniq) AS uniq_sessions,
        arrayFilter((s,c) -> c != 'US', uniq_sessions, countries) AS non_us_uniq_sessions,
        isp
    FROM
    (
        SELECT
            uniq(session_id) AS per_country_uniq,
            country,
            isp
        FROM page_hits
        GROUP BY
            isp,
            country
    )
    GROUP BY isp
)
outside_us_hits_pctisp
1VeryUncommonISP
0Verizon
0.5Xfinity
1UncommonISP
0.3333333333333AT&T
0.75Spectrum

This query does the following:

  1. Starting at the innermost query, get unique sessions for every country and isp
  2. Group countries and unique-sessions-per-country into arrays, for each isp
  3. Use the multi-argument version of arrayFilter to filter one array by another. Every time an element in the countries array is != US, the corresponding unique session count is omitted from the uniq_session array.
  4. In the outermost query, sum up the unique sessions array values to get total for non-us countries, and divide by all unique for ISP to find percentage.

Get Browser Popularity by ISP, Region

SELECT 
    countIf(position(user_agent, 'Chrome') != 0) / count() as chrome_pct,
    countIf(position(user_agent, 'Safari') != 0) / count() as safari_pct,
    countIf(position(user_agent, 'Firefox') != 0) / count() as firefox_pct,
    isp,
    region
FROM page_hits
GROUP BY isp, region
chrome_pctsafari_pctfirefox_pctispregion
0.50.50AT&TWest
010XfinityWest
0.20.20.6SpectrumWest
100VerizonEast
010UncommonISPWest
010AT&TEast
010VeryUncommonISPWest

Uses countIf function to only count sessions for an ISP and region that contain specific browser strings in their user_agent

Putting it together using CTE

WITH 
    filtered_isps AS (
        SELECT isp
        FROM page_hits
        GROUP BY isp
        HAVING uniq(session_id) > 1
    )
    , outside_us_hits AS (
        SELECT
            arraySum(non_us_uniq_sessions) / total_uniq_per_isp AS outside_us_hits_pct,
            isp
        FROM
        (
            SELECT
                sum(per_country_uniq) AS total_uniq_per_isp,
                groupArray(country) AS countries,
                groupArray(per_country_uniq) AS uniq_sessions,
                arrayFilter((s,c) -> c != 'US', uniq_sessions, countries) AS non_us_uniq_sessions,
                isp
            FROM
            (
                SELECT
                    uniq(session_id) AS per_country_uniq,
                    country,
                    isp
                FROM page_hits
                GROUP BY
                    isp,
                    country
            )
            GROUP BY isp
        ) 
    )
    , browser_popularity AS (
        SELECT 
            countIf(position(user_agent, 'Chrome') != 0) / count() as chrome_pct,
            countIf(position(user_agent, 'Safari') != 0) / count() as safari_pct,
            countIf(position(user_agent, 'Firefox') != 0) / count() as firefox_pct,
            isp,
            region
        FROM page_hits
        GROUP BY isp, region
    )
SELECT 
    chrome_pct,
    safari_pct,
    firefox_pct,
    isp,
    region,
    outside_us_hits_pct
FROM (
    SELECT * FROM (
        SELECT * FROM outside_us_hits
        INNER JOIN (select * from browser_popularity)
        USING (isp)
    ) INNER JOIN (select isp from filtered_isps)
    USING (isp)
)
chrome_pctsafari_pctfirefox_pctispregionoutside_us_hits_pct
100VerizonEast0
010XfinityWest0.5
0.50.50AT&TWest0.333333333
010AT&TEast0.333333333
0.20.20.6SpectrumWest0.75

The Hydrolix-Friendly Approach

We will rewrite that above query in a way that

  • Avoids JOINs
  • Avoids multiple trips to grab data
  • Applies functions on columns

Percentage of Unique, Non-US Sessions Per-ISP Using uniqIf

SELECT
    uniqIf(session_id, country != 'US') / uniq(session_id) AS outside_us_hits_pct,
    isp
FROM page_hits
GROUP BY isp

Here we’ve rewritten the previous 23-line version of this query to just a few lines, using very handy ClickHouse feature: Aggregation Combinators

Many of you may have used countIf, but might not know that any ClickHouse aggregation function can include an -If suffix to only perform the function if the passed-in predicate is true.

We are thus able to both filter and aggregate the column session_id, in a single function call. Notice that this cannot be done by adding WHERE country != 'US' — we then lose the ability to grab all unique sessions for isp in the same pass.

tip:

Whenever you find yourself using a ClickHouse aggregation function inside a subquery, ask yourself if adding -If would get rid of complexity, and perhaps the subquery altogether.

Browser Popularity + Percent-Outside-US Queries in the Same Pass with ARRAY JOIN

WITH
    outside_us_hits AS (
        SELECT
            uniqIf(session_id, country != 'US') / uniq(session_id) AS outside_us_hits_pct,
            groupArray(user_agent) as user_agents,
            groupArray(region) as regions,
            isp
        FROM page_hits
        GROUP BY isp
        HAVING uniq(session_id) > 1
    )
SELECT 
    any(outside_us_hits_pct) as outside_us_hits_pct,
    countIf(position(user_agent, 'Chrome') != 0) / count() as chrome_pct,
    countIf(position(user_agent, 'Safari') != 0) / count() as safari_pct,
    countIf(position(user_agent, 'Firefox') != 0) / count() as firefox_pct,
    isp,
    region
FROM outside_us_hits
ARRAY JOIN
    user_agents as user_agent,
    regions as region
GROUP BY isp, region

In some respects, I find ARRAY JOIN unfortunately named. While JOIN (INNER, OUTER, etc) should be avoided in ClickHouse, ARRAY JOIN should be sought out!

Previously, two separate queries that performed aggregations on different GROUP BYs were presented in the same dataset using INNER JOIN. This is a very common pattern, however in Hydrolix, it can slow down queries.

Instead, the Percent-Outside-US query is run as it was before, but now keeps track of other columns (user_agent, and region) that we’ll need in the outer query.

The ARRAY JOIN “plops” the user_agents and regions arrays as new columns in the table, right next to the others. Because the table is columnar this is a trivial operation.ARRAY JOIN does not need to keep track of any key that is the same in both tables, and is an entirely different operation (and much faster) than the INNER JOIN used to combine the result set in the previous query.

The query above is 28% faster than the first approach

Further Improvements

While groupArray can offer significant improvements to a query, it can also blow up memory. The user_agents included in the page_hits table are surrounded by ... to denote they can be of arbitrary length. Because groupArray holds the array of values it has grouped in memory for some period of time (until it is ARRAY JOINed back into the table) this could cause Out-of-Memory issues in certain scenarios.

Here’s nice trick to use as workaround:
Notice we are only using the user_agent string in boolean expressions in the outer query. We do not care about its entire contents, just whether or not it contains a certain string.

groupArray accepts expressions as well as columns. Instead of holding an entire user_agents String Array in memory, we could hold an array of the boolean expressions we need. Because the array of booleans will take up many order of magnitudes less memory than the array of potentially-long strings, we can save a significant amount of memory this way.
So:

groupArray(user_agent) as user_agents,

Turns into:

groupArray(position(user_agent, 'Chrome') != 0) as chrome_user_agents,
groupArray(position(user_agent, 'Safari') != 0) as safari_user_agents,
groupArray(position(user_agent, 'Firefox') != 0) as firefox_user_agents,

in the inner query.

In practice, with long user_agent strings the following query:

WITH
    outside_us_hits AS (
        SELECT
            uniqIf(session_id, country != 'US') / uniq(session_id) AS outside_us_hits_pct,
            groupArray(position(user_agent, 'Chrome') != 0) as chrome_user_agents,
            groupArray(position(user_agent, 'Safari') != 0) as safari_user_agents,
            groupArray(position(user_agent, 'Firefox') != 0) as firefox_user_agents,
            groupArray(region) as regions,
            isp
        FROM page_hits
        GROUP BY isp
        HAVING uniq(session_id) > 1
    )
SELECT
    any(outside_us_hits_pct) as outside_us_hits_pct,
    countIf(chrome_user_agent) / count() as chrome_pct,
    countIf(safari_user_agent) / count() as safari_pct,
    countIf(firefox_user_agent) / count() as firefox_pct,
    isp,
    region
FROM outside_us_hits
ARRAY JOIN
    chrome_user_agents as chrome_user_agent,
    safari_user_agents as safari_user_agent,
    firefox_user_agents as firefox_user_agent,
    regions as region
GROUP BY isp, region

Took 150x less memory than using groupArray(user_agent)!

Conclusion

It is common and in some databases beneficial to think in terms of calculating results of a large query separately and then joining them together to deliver the final result set.

When using Hydrolix and ClickHouse, you should always be on the lookout for ways of combining multiple passes into one. The tools we used to do this were:

  • Aggregation Function Combinators (-If)
  • groupArray and ARRAY JOIN
  • Using expressions inside of groupArray for memory saving

Share Now