Ponderings, insights and industry updates

Automatic Index Cloudfront Logs

May 12, 2021

Author: David Sztykman |

Tags: , , ,

Hydrolix is able to automatically import batch data from AWS S3.

In this blog post we’ll see how we use this feature to index Cloudfront logs into Hydrolix.

Setting up access to S3 Bucket

When we deploy a new hydrolix cluster by default it only has access to a single S3 bucket, the one where we’ll store the data.

The batch peer needs to have access to your S3 bucket to be able to pull the logs, we’ll use the command hdxctl to update the bucket-allowlist and include the S3 repository, for example:

hdxctl update hdxcli-xxxxx hdx-yyyyy --bucket-allowlist "your_s3_repo"

If your S3 bucket is encrypted, you need to specify the KMS ARN to decrypt it using --batch-bucket-kms-arn:

hdxctl update hdxcli-xxxxx hdx-yyyyy --batch-bucket-kms-arn "arn"

Creating your project and table

Let’s create a new project and a new table to index the data.
For this example, we create a new project called aws and within that project we create a new table called cloudfront.

The table will have specific settings for AutoIngest where we specify the S3 Bucket pattern we are matching to ingest data.

In my example the logs are sent to my bucket docs-access-logs in the folder cf.

I’m using a regex to specify that I only want file extension gz.
For more details you can check our documentation.

Creating the transform

Cloudfront log format is defined here as you can see they split the date and hour into 2 separate fields, so in our transformation we are going to create a virtual field which is the concatenation of the 2 fields.

You can find more information on scripted field support here.

As the data is compressed using gzip we also need to specify that the compression is gzip.

And finally the logs in Cloudfront contain 2 comments at the top of each file that we need to ignore.

The full example is available in our GitLab repository, if you need details on how to use this you should read the post regarding VSCode.

For Hydrolix everything is now ready to ingest the cloudfront logs.

Now we need to use notification from the S3 bucket using SQS to let the batch-peer know that new files are available to pick up.

Configure S3 notification

Now that everything is set up on Hydrolix the last remaining part is to configure Notification from your S3 bucket to SQS.

Log into your AWS console and click on the S3 bucket you are using to store your cloudfront logs and click on Properties:

Scroll down to Notifications and create a new notification:

In the notification tab specify the path where cloudfront data are stored and the file extension gz.

Select Event Types: All object create events

Scroll down to the destination and select SQS queue and choose the queue called hdxcli-YYYY-autoingest:

And that’s it!

Hydrolix is now ingesting new Cloudfront logs as soon as we are notified via SQS!

Share Now