HYDROLIX BLOG

Ponderings, insights and industry updates

Hydrolix from Zero to Ingest

April 6, 2021

Author: David Sztykman | PM - Integrations and Ecosystem

Tags: ,

As I recently joined Hydrolix, literally three days ago, I wanted to share my experience from not knowing the product to starting to use it very quickly.

Step 1: Understand the product

It seems like very basic advice, but before using anything I think it’s important to understand the basic functionality of the product.

So I’ll do a summary for you:

• Hydrolix is a cloud data platform optimized for append-only data such as logs, security events, traces, any data with a timestamp really. 

• The data is highly compressed and stored on cloud storage.
We segment the data using timestamp as the primary key, so each segment has a start and end date.

• What makes Hydrolix different is every part of the search engine is split up in different services. This design allows us to scale independently the ingest, the search and the merge of data.

• Everything is deployed in your own VPC so the data never leaves your zone.

• As the data is stored on cloud storage we don’t need to manage replication or high availability as this is by design.

Now that we’ve got the basic out let’s deep dive into a deployment.

Step 2: Deploy your first environment

You can deploy Hydrolix into your AWS VPC very quickly using our CLI tool hdxctl.

hdxctl is pretty much our control tower for Hydrolix, it’ll generate license requests and create the deployment using cloudformation.

Before doing anything you’ll need to download and install hdxctl into a t2.micro Ubuntu server. You’ll find the latest version in our documentation here:
https://docs.hydrolix.io/guide/setup/download

wget -O hdxctl https://hdx-infrastructure.s3.amazonaws.com/hdxctl-v2.7.7 && chmod +x hdxctl

In order to deploy a Hydrolix cluster you’ll need a license, and you can generate the license request directly from CLI using the following:

hdxctl get-license --admin-email email@organization.org --organization "Organization" --account-id 1234567890 --host hostorg --region us-east-2 --full-name "David Sztykman"

Let’s decompose this license request:
admin-email: self explanatory, the email of the administrator who’ll receive the license for the cluster

organization: the name of the company which will use the cluster

account-id: our AWS account ID you can find it in the upper right corner of your AWS console

host: The hostname generated by Hydrolix to access your cluster. In my example this will be hostorg.hydrolix.live

region: The AWS region associated with this license request
full-name: self explanatory, user name associated with this license

Once you generate your license request, you’ll receive an email with your Client ID, this client id is required to create a cluster.

The client ID looks like hdxcli-xxxx

To create a cluster you’ll again use the hdxctl command line:

hdxctl --region us-east-2 create-cluster --wait hdxcli-xxxxxxxx

Let’s decompose this create cluster request :
region is the region in AWS linked to the license you received

hdxcli-xxxxxxxx is the Hydrolix ClientID received via email after the license request.

The create cluster might take up to 20 minutes.

Once the cluster is created, you can view it using:

hdxctl clusters
CLIENT_ID        CLUSTER_ID    CREATED    HOST    STATUS    WHO    REGION
---------------  ------------  ---------  ------  --------  -----  ---------
hdxcli-xxxxxxxx  hdx-yyyyyyyy  -                                   us-east-2

If you don’t see the status as created you need sync parameters which will gives you a response like:

hdxctl clusters --sync
CLIENT_ID        CLUSTER_ID    CREATED              HOST                STATUS           WHO    REGION
---------------  ------------  -------------------  ------------------  ---------------  -----  ---------
hdxcli-xxxxxxxx  hdx-yyyyyyyy  2021-04-02 08:14:08  hostorg.hydrolix.live.  CREATE_COMPLETE  david  us-east-2

If you go on your AWS console and filter in EC2 with your cluster ID you’ll see a few instances running different services. I’ll provide more explanation on the different services in a separate blog post.

You’ll also see a new S3 bucket using your cluster ID. This is where we’ll store the data and also where some configuration files are stored.

We’re almost ready to go! The last thing we need to do is enable access from our IP into the cluster.

As a former Akamai, I always use the http://whatismyip.akamai.com/ to get my public IP but you can use whatever you want.

We’ll add our IP to the allowed list using hdxctl, by default you need to use a subnet so for a single IP add a /32:

hdxctl update hdxcli-xxxxxxxx hdx-yyyyyyyy --ip-allowlist "90.91.214.123/32"

If you need to add multiple IPs you can add multiple times the –ip-allowlist parameter with different IPs.

Step 3: Log in, create project and table

Now that the cluster is up and running and we have access to it, you should have received an email with a link to connect to our user interface.

Click on the link to create a new password for your user.

After that you should have access to an interface which looks like this:

In this example we are in the “sample_project”, if you want you can create a new project or you can create a new table in the sample_project.

For now I’ll use the API to understand better the worfklow.
I’m using Visual Studio Code with the plugin REST Client:

We now have a new project called newhire, and in that project we have a table called csv.

Now the fun part begins, we have the following CSV file that we want to ingest:


So from that CSV file we need to create the proper transformation to ingest it properly.

As I mentioned earlier, Hydrolix uses timestamp as the primary key, so in that example we’ll use the close_date column.

A sample date here is: 01-14-11

We need to find the proper format in Go to render that timestamp properly so we can ingest the data.

Fortunately, there are great resources on the internet such as Go Date time builder.

We know the first field is Month in 00 format, second one is Day and the last one is Year.

So using the date time builder format to recognize the timestamp is something like:

Now that we know the format for the timestamp we can create our transform reading the documentation.

Here, the transformation specifies that the format is csv, we also specify how we want to treat the data in the different columns, the name of the column; and finally that we want to skip the first line as those don’t contain data but name.

The name of this transformations is csv_new_hire_transform.

Step 4: Ingesting and viewing our data

Ingestion can be done via several mechanisms from ingest API over https, to batch ingest reading S3 bucket and finally subscribing to a Kafka topic.

For this example we’ll use the ingest API over https.

curl -s \
     -H 'content-type: text/csv' \
     -H 'x-hdx-table: newhire.csv' \
     -H 'x-hdx-transform: csv_new_hire_transform' \
     https://hostorg.hydrolix.live/ingest/event -X POST -d @data.csv                         

This will generate a POST request specifying the project and table we’ll use via the header x-hdx-table.

We also specify the transformation we want to apply via the header x-hdx-transform.

The body of the post will be our CSV file.

Now that the data is ingested we can start querying it!

Hydrolix is compatible with Clickhouse SQL engine which means that you can use any clickhouse drivers and data visualization tool.

By default we have a Grafana instance deployed in our cluster, but for my example I’ll keep using Visual Studio Code.

We now have a fully functional example from cluster creation to ingest and query, using Visual Studio Code.

To use SQL from VS you need to install the Clickhouse plugin.

Once installed you’ll add a new connection which will use the following information:

Once you have added your connection, you can select the language mode as SQL, this will allow you to write your SQL query with autocompletion on advanced functions from clickhouse and run the query into your Hydrolix deployment.

We now have a fully functional example from cluster creation to ingest and query, using Visual Studio Code.

Don’t forget to bring down your environment to reduce your spending:

hdxctl scale --off hdxcli-xxxxxxxx hdx-yyyyyyyy

If you want to resume your cluster you can just use:

hdxctl scale --minimal hdxcli-xxxxxxxx hdx-yyyyyyyy

That’s the beauty of separating all the roles and storing the data into a cheap cloud storage!

If you want to try it out for yourself (and for free), get your license here. We’ll be more than happy to help you.

Share Now

Leave a comment

Your email address will not be published. Required fields are marked *