Ponderings, insights and industry updates

Index and enrich Akamai logs in realtime

June 7, 2022

Author: David Sztykman |

Tags: , , ,

Akamai new feature Datastream v2 allows streaming in real time of logs into HTTPS endpoint.

In this blog post we’ll see how to setup Akamai and Hydrolix to receive, parse and index logs in realtime.

Setup Akamai

To setup Akamai Datastream you can follow the documentation:

Create a new DataStream:

  • Destination select Custom HTTPS
  • Name enter Hydrolix
  • Endpoint URL https://hydrolixhostname/ingest/event?table=project.table&transform=akamai_transform
  • Authentication select none
  • Go to Custom Header and select Content-Type: application/json
  • Check the box Send compressed data
  • In the Datasets page, pick JSON logs for the log format:

The summary should look like that:

Once datastream configuration is deployed you can update your regular delivery configuration to deliver logs using datastream v2.

In property manager select the behavior DataStream2 and specify the the Stream Name configured earlier and the sampling rate to apply:


Akamai sends the following data by default:

Some of those fields are easily parsed and indexed, but some requires more advanced treatment, in particular the breadcrumbs.

In this example the breadcrumbs data is urlencoded, we need to decode it and then parse the information it contains.
The value we are working with are://BC/[a=,c=g,k=0,l=1]

By reading the documentation on breadcrumbs we have multiple information we can extract:

  • a = component IP
  • c=g This is at the edge ghost
  • k=0 request end time in ms
  • l=1 turn around time in ms

There are several more information which can be included into the breadcrumbs.

Being able to extract those information at indexing is very valuable, as it’s a one time process and the expensive extraction is done only once.

At Hydrolix we have added a new feature allowing users to write SQL statement at indexing time.

We can write a SQL statement to extract the Edge_IP:

Let’s decompose this statement:

  • decodeURLComponent(breadcrumbs) is to extract the breadcrumbs in decoded URL format.
  • extract(decoded_breadcrumbs, '(\[[^[]*c=g[^]]*\])') is a regex to extract all the data between [ and ] which contains c=g (c=g means edge).
  • And finally extract(from_previous_extract, 'a=([^,\]]+)') is the final regex to extract the value after a=.

To summarise in this select we are extracting the value after a= where [ c=g ] and put the extracted information into a new column called Edge_IP.

In this case we are creating a new column based on the extracted value of the incoming data.

A few other column needs some transformation, the UA is urlencoded and the referer too.
We can override those in the SQL statement:

This example is overriding the current value and replacing it with the urldecoded value.

One critical aspect is the latency of ingestion, we can calculate it dynamically using:

This will calculate the difference in seconds between when the user requested the content and the time we receive the data and create a partition.

The full SQL statement we are using including all the breadcrumbs is the following:

We end up with the following information indexed into Hydrolix:

Full example

To use this example you can refer to the VSCode blog example:

Akamai Transform example

As you can see in the transform we have some new column which are generated by the sql transform:

The source for latency is from_input_field: sql_transform which is where the data is generated.

We can add many more column generated from the Akamai raw data, we can extract enrich and go even further with custom function and dictionary!
Let’s see it in action with CMCD (Common Media Client Data).

Share Now