HYDROLIX BLOG

Ponderings, insights and industry updates

Managing your noisy neighbour

November 22, 2021

Author: David Sztykman |

Tags: , ,

Nobody likes a noisy neighbour.
For most of the time, we are unable to pick our data neighbours and we rarely have a choice on who they are and what impact they will have on us.

Indeed, when a business provides access to a centrally located data source, everyone wants to use it their way, to answer their business questions on their timeline.
Putting up with waiting your turn and ensuring you do not run your query when “Big Data Bobby” runs theirs, becomes the norm.

Let’s take an example with logs and the dev-sec-ops model, in this model we often have the same data being used by multiple teams with multiple workloads.
If someone does an expensive query because they need to look at trend for the last year, 6 months etc it has an impact across everyone else using it.

Poorly written queries compound the problem impacting everyone and after a few bad incidents most companies start to realise that a central, one data platform to rule them all solution, doesn’t fit their needs. User access restrictions, capacity band-aids and circuit breakers all become part of the “management” of the system.

These solutions in general mean limiting access to fewer users, limiting the query to smaller data, for shorter periods of time etc…

Finally, after trying all the hacks and band-aids they finally give up and start building a new infrastructure, indexing the data into multiple tools, multiple infrastructures, across multiple locations to ensure high performance for each one.

Hydrolix to the rescue

Hydrolix has the advantages of decoupling the storage from the compute, which means we can store data for an unlimited period while also creating dedicated compute search pools for each team! There’s no need to duplicate data, the same data can be used across teams, at the same time on their own dedicated resources

Let’s take the example of a company running their dashboard on Superset.

They have multiple teams using superset: their SOC doing investigation and their Sales Engineering also using it for customer trend analysis and POC.

To avoid stepping on each other toes we’ll create 2 separates query pool pointing to the same data sources:

Create new query pool

In the latest release of Hydrolix we have added a custom SETTINGS in SQL to specify the query pool your query is going to leverage:

SELECT count(timestamp) FROM sample.cts
WHERE timestamp >= toString(yesterday())
SETTINGS hdx_query_pool_name='soc'

This query will run in the newly created pool named soc.

Automatic pool selection in Superset

Superset is very flexible and easily customisable. In this example we’ll modify the query generated from the user to add the SETTINGS and the proper hdx_query_pool_name to use.

On superset we can create a new role and attach users to this role:

In the superset configuration we can list the different role of the users and automatically modify the SQL query to add the proper settings, this is done via the SQL_QUERY_MUTATOR function:

import re
def SQL_QUERY_MUTATOR(sql, username, security_manager, database):
    ##find current user and roles
    user = security_manager.find_user(username=username)
    user_roles = security_manager.get_user_roles(user)
    for user_role in user_roles:
        if user_role.name == "soc":
            pool = "soc"
        elif user_role.name == "sales":
            pool = "sales"
        else:
            pool = "default"
    ##regex to search if users is already using SETTINGS in his query        
    settings_search = re.search(r'.*(SETTINGS\s+hdx_.*)\s+LIMIT', sql, flags=re.I|re.M)
    to_ret = ""
    ##if user is already using SETTINGS, split the query and the current SETTINGS in new variable
    ##and add a comma after the current SETTINGS to be able to add another settings which will be
    ##the query pool name extracted from the group the users belongs to
    if settings_search:
        settings = settings_search.group(1)
        query = re.sub(settings, '', sql, flags=re.I|re.M)
        to_ret += f"{query} {settings}, "
    else:
        to_ret += f"{sql} SETTINGS "
    to_ret += f"hdx_query_pool_name='{pool}'"
    return to_ret

This block of code is looking at the role the user has, and based on the role override the SQL Query to add the settings to route the request to the proper query pool.

This is done completely dynamically and is transparent for the users, but each team is now completely isolated and has the performance they decide based on the number and type of server in the pool!

Between autoscaling and spot instances you can control your costs and limit each pool in a very flexible way!

No more problematic neighbours!
Now everyone in the neighbourhood can have access to the Data.

Share Now