Hydrolix is able to automatically import batch data from AWS S3.
In this blog post we’ll see how we use this feature to index Cloudfront logs into Hydrolix.
Setting up access to S3 Bucket
When we deploy a new hydrolix cluster by default it only has access to a single S3 bucket, the one where we’ll store the data.
The batch peer needs to have access to your S3 bucket to be able to pull the logs, we’ll use the command hdxctl
to update the bucket-allowlist
and include the S3 repository, for example:
hdxctl update hdxcli-xxxxx hdx-yyyyy --bucket-allowlist "your_s3_repo"
If your S3 bucket is encrypted, you need to specify the KMS ARN to decrypt it using --batch-bucket-kms-arn
:
hdxctl update hdxcli-xxxxx hdx-yyyyy --batch-bucket-kms-arn "arn"
Creating your project and table
Let’s create a new project and a new table to index the data.
For this example, we create a new project called aws and within that project we create a new table called cloudfront.
The table will have specific settings for AutoIngest where we specify the S3 Bucket pattern we are matching to ingest data.
In my example the logs are sent to my bucket docs-access-logs in the folder cf.
I’m using a regex to specify that I only want file extension gz.
For more details you can check our documentation.
Project and table creation code
### Global variable to replace with your own needs
@base_url = https://host.hydrolix.live/config/v1/
@username = username@email.com
@password = mysuperpassword
### See our documentation the proper pattern s3 bucket pattern: https://docs.hydrolix.io/guide/alt_ingest/file-streams#file-streams-aka-auto-ingest
@s3bucket = ^s3://docs-access-logs/cf/*.gz
@projectname = aws
@tablename = cloudfront
### Login authentication get access token and UUID Org variable
# @name login
POST {{base_url}}login
Content-Type: application/json
{
"username": "{{username}}",
"password": "{{password}}"
}
### Store, parse the login response body to store the access token and organization id
@access_token = {{login.response.body.auth_token.access_token}}
@org_id = {{login.response.body.orgs[0].uuid}}
### Create a new project named aws
# @name new_project
POST {{base_url}}orgs/{{org_id}}/projects/
Authorization: Bearer {{access_token}}
Content-Type: application/json
{
"name": "{{projectname}}",
"org": "{{org_id}}"
}
### Store, parse project ID from response
@projectid = {{new_project.response.body.uuid}}
### Create a new table named {{tablename}} in the {{projectname}} project, auto-ingest cloudfront logs using SQS notification
# @name new_table
POST {{base_url}}orgs/{{org_id}}/projects/{{projectid}}/tables/
Authorization: Bearer {{access_token}}
Content-Type: application/json
{
"name": "{{tablename}}",
"project": "{{projectid}}",
"description": "Auto Ingest Cloudfront logs from S3 Bucket",
"settings": {
"autoingest": {
"enabled": true,
"pattern": "{{s3bucket}}",
"max_rows_per_partition": 1000000,
"max_minutes_per_partition": 15,
"input_aggregation": 4
}
}
}
### Store, parse table ID from response
@tableid = {{new_table.response.body.uuid}}
Creating the transform
Cloudfront log format is defined here as you can see they split the date and hour into 2 separate fields, so in our transformation we are going to create a virtual field which is the concatenation of the 2 fields.
You can find more information on scripted field support here.
As the data is compressed using gzip we also need to specify that the compression is gzip.
And finally the logs in Cloudfront contain 2 comments at the top of each file that we need to ignore.
Full transform for Cloudfront logs
#### Creates a a transform for the csv format and upload to our table
# @name new_transform
POST {{base_url}}orgs/{{org_id}}/projects/{{projectid}}/tables/{{tableid}}/transforms/
Authorization: Bearer {{access_token}}
Content-Type: application/json
{
"name": "aws_cloudfront_transform",
"type": "csv",
"table": "{{tableid}}",
"settings": {
"is_default": true,
"compression": "gzip",
"output_columns": [
{
"name": "timestamp",
"position": 0,
"datatype": {
"type": "datetime",
"script": "new Date(row['date'] +' '+ row['hour'])",
"format": "2006-01-02 15:04:05",
"virtual": true,
"primary": true
}
},
{
"name": "date",
"position": 0,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "hour",
"position": 1,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "x-edge-location",
"position": 2,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "bytes_server_to_client",
"position": 3,
"datatype": {
"type": "uint64"
}
},
{
"name": "clientip",
"position": 4,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "method",
"position": 5,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "host",
"position": 6,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "url",
"position": 7,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "status",
"position": 8,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "referer",
"position": 9,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "user-agent",
"position": 10,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "query-string",
"position": 11,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "cookie",
"position": 12,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "x-edge-result-type",
"position": 13,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "x-edge-request-id",
"position": 14,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "x-host-header",
"position": 15,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "protocol",
"position": 16,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "bytes_client_to_server",
"position": 17,
"datatype": {
"type": "uint64"
}
},
{
"name": "time-taken",
"position": 18,
"datatype": {
"type": "uint64"
}
},
{
"name": "x-forwarded-for",
"position": 19,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "ssl-protocol",
"position": 20,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "ssl-cipher",
"position": 21,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "x-edge-response-result-type",
"position": 22,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "protocol-version",
"position": 23,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "fle-status",
"position": 24,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "fle-encrypted-fields",
"position": 25,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "client_port",
"position": 26,
"datatype": {
"type": "uint64"
}
},
{
"name": "ttfb",
"position": 27,
"datatype": {
"type": "uint64"
}
},
{
"name": "x-edge-detailed-result-type",
"position": 28,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "content-type-response",
"position": 29,
"datatype": {
"type": "string",
"index": true
}
},
{
"name": "content-length-response",
"position": 30,
"datatype": {
"type": "uint64"
}
},
{
"name": "content-range-response-start",
"position": 31,
"datatype": {
"type": "uint64"
}
},
{
"name": "content-range-response-end",
"position": 32,
"datatype": {
"type": "uint64"
}
}
],
"format_details": {
"skip_head": 2,
"delimiter": "\t"
}
}
}
The full example is available in our GitLab repository, if you need details on how to use this you should read the post regarding VSCode.
For Hydrolix everything is now ready to ingest the cloudfront logs.
Now we need to use notification from the S3 bucket using SQS to let the batch-peer know that new files are available to pick up.
Configure S3 notification
Now that everything is set up on Hydrolix the last remaining part is to configure Notification from your S3 bucket to SQS.
Log into your AWS console and click on the S3 bucket you are using to store your cloudfront logs and click on Properties:
Scroll down to Notifications and create a new notification:
In the notification tab specify the path where cloudfront data are stored and the file extension gz.
Select Event Types: All object create events
Scroll down to the destination and select SQS queue and choose the queue called hdxcli-YYYY-autoingest
:
And that’s it!
Hydrolix is now ingesting new Cloudfront logs as soon as we are notified via SQS!