Ponderings, insights and industry updates

Analyzing Fastly CDN transaction logs with Hydrolix

February 24, 2021

Author: Tobin Sears | Sales Engineering, Hydrolix


In this post, we will show you how to analyze the Fastly CDN logs you have streamed to Hydrolix.

This post is part of a series showing how to use Hydrolix and an open source dashboard to maximize your Fastly CDN observability quickly, cheaply, and in your own VPC. Before following these quick instructions for analyzing Fastly transaction logs using Hydrolix, check out how to configure Fastly to stream logs into Hydrolix with an HTTPS endpoint and how to configure the Hydrolix streaming intake for those logs. Once you are querying the logs, you can also set up a dashboard to visualize your queries. You can also always refer to our Hydrolix documentation.

This short tutorial covers a few example queries that are designed to tell us a bit more about the Fastly log data that we are collecting. These are example queries to give you a sense for observing your Fastly CDN infrastructure with Hydrolix. You can certainly write your own Clickhouse queries better suited to your own needs, and we encourage you to reach out to us if you need any help.

The data that we will be querying in these examples is stored in the fastly.xlogs table.

Defining the dataset

Let’s start with understanding a bit more about the data that we are working with. The extended log format adds quite a few more columns so visual confirmation of the table structure is always useful. We can employ a simple DESCRIBE command to take a closer look at the table definition.

DESCRIBE fastly.xlogs

Returns the table definition:

cache_status					String					
client_ip					String					
content_type					String					
geo_city					String					
geo_continent_code				String					
geo_country_code				String					
geo_datacenter					String					
geo_region					String					
h2_stream_id					String					
host						String					
is_cacheable					Nullable(UInt8)					
is_h2						Nullable(UInt8)					
is_h2_push					Nullable(UInt8)					
is_ipv6						Nullable(UInt8)					
is_tls						Nullable(UInt8)					
origin_host					String					
protocol					String					
req_body_size					Nullable(UInt32)					
req_header_size					Nullable(UInt32)					
request						String					
request_accept_charset				String					
request_accept_content				String					
request_accept_encoding				String					
request_accept_language				String					
request_cache_control				String					
request_connection				String					
request_dnt					String					
request_forwarded				String					
request_referer					String					
request_user_agent				String					
request_via					String					
request_x_forwarded_for				String					
request_x_requested_with			String					
resp_body_size					Nullable(UInt32)					
resp_header_size				Nullable(UInt32)					
response_age					Nullable(UInt32)					
response_cache_control				String					
response_expires				String					
response_last_modified				String					
response_tsv					String					
service_id					String					
service_version					String					
socket_cwnd					Nullable(UInt32)					
socket_nexthop					String					
socket_ploss					Nullable(Float64)					
socket_tcpi_delta_retrans			Nullable(UInt32)					
socket_tcpi_last_data_sent			Nullable(UInt32)					
socket_tcpi_rcv_mss				Nullable(UInt32)					
socket_tcpi_rcv_rtt				Nullable(UInt32)					
socket_tcpi_rcv_space				Nullable(UInt32)					
socket_tcpi_rtt					Nullable(UInt32)					
socket_tcpi_rttvar				Nullable(UInt32)					
socket_tcpi_snd_mss				Nullable(UInt32)					
socket_tcpi_total_retrans			Nullable(UInt32)					
status						String					
time_elapsed					Nullable(UInt32)					
time_end					Nullable(DateTime)					
time_start					DateTime					
tls_client_cipher				String					
tls_client_cipher_sha				String					
tls_client_protocol				String					
tls_client_servername				String					
tls_client_tlsexts_sha				String					
url						String		

For this example, we recently moved to an extended log format for all of our Fastly real-time streaming logs. The following query will help to establish exactly when that cutover took place as well as providing us with the most recent record and overall row count. min() and max() aggregate functions against our selected datetimecolumn, time_start, will provide us with the earliest and latest records for our data.

SELECT min(time_start), max(time_start), count() 
FROM fastly.xlogs 
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now())

Find the top N requested urls per week

The url column represents page requests which can contain query string parameters and references to static assets.

We can clean up the urls at query time using cutQueryString(url) to strip any query string params and also filter out the static assets using match(string, pattern) to exclude unwanted results. Finally, we can group byStartOfWeek(time_start) and topK(N)(column) function to calculate top values. A sub-query will do the filtering, cleaning, and time grouping.

SELECT week, topK(10)(path) 
   SELECT toStartOfWeek(time_start) AS week, cutQueryString(url) AS path 
   FROM fastly.xlogs  
   WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
   AND  match(url,'^.*\.(css$|js$|json$|jpg$|js$|png$|svg$|xml$)') != 1 
   GROUP BY week, path) 
GROUP BY week 

Find the top 3 slowest urls per host

While it’s certainly useful to know what the top requested urls for the week are, data becomes significantly more actionable with the application of context. What would happen if our top requested urls were also our slowest? It goes without saying that they probably wouldn’t be the top N urls for long. As such, we may also want to keep an eye out for potentially problematic urls by monitoring response times in the event that we can proactively identify issues and resolve them before they become a problem.

This query is a bit different in that it looks at the full urls that have the highest time_elapsed values (in milliseconds) since the start of the week. In this case, it’s useful to have full visibility into the entire url since we are looking to identify areas of concern from a latency perspective; hence, we won’t be using cutQueryString(url) but instead using concat(host, url) to gives us the full url path.

If you didn’t notice, we are using a relatively unknown clause, LIMIT BY, which is unique to Clickhouse. It helps us limit the results returned in the subquery to 3 urls per host and is a great way to gain greater control of your data. The LIMIT n BY expression clause selects the first n rows for each distinct value of expressions. The key for LIMIT BY can contain any number of expressions.

SELECT week, concat(host, url) as full_url, (time_elapsed/1000) as time_elapsed_msec 
   SELECT toStartOfWeek(time_start) AS week, host, url, time_elapsed 
   FROM fastly.xlogs 
   WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
   GROUP BY week, host, url, time_elapsed 
   LIMIT 3 BY host) 
GROUP BY week, full_url, time_elapsed_msec 
ORDER BY time_elapsed_msec

Identify the top error causing urls using HTTP 4xx/5xx response codes

Keeping with the theme of leveraging the data to better understand the health of our website, HTTP response codes offer an option for monitoring client and server-side errors. HTTP 4xx status codes typically represent client-side errors and HTTP 5xx status codes are reserved for cases where the server has encountered an error and is incapable of performing the request. In this example, instead of setting the timeframe toStartOfWeek(), we use interval 7 day with now() to provide a trailing time window and match(status, ‘^[45]\d\d’) to pick up any status string result that starts with ‘4’ or ‘5’.

SELECT time_start, status, host, url, count(), cache_status 
FROM fastly.xlogs 
WHERE (time_start >= now() - interval 7 day) and match(status, '^[45]\d\d') 
GROUP BY time_start, host, url, status, cache_status 
ORDER by count() DESC 

Find out what % of requests go to docs.hydrolix.io

Our website hosts both our documentation (docs.hydrolix.io) and live query API (try.hydrolix.io) that executes these SQL examples. Previously, we found our topN urls but what percentage of that traffic is directed at our docs site?

The host column represents which domain the traffic is targetting. It’s a simple task at query time to compare a foreground data set to the full data set. We will use the countIf(column, condition) which is a combinator aggregation function on count(). The condition will be based on startsWith(string, condition). This allows us to conditionally count a foreground subset of the data. The time will be grouped on toStartOfDay(datetime). As this returns a time component by default, we format the output using formatDateTime(datetime, format) to remove the time.

formatDateTime(toStartOfDay(time_start), '%D') as day, 
countIf(startsWith(host,'docs')) / count() * 100 as %percent_docs 
FROM fastly.xlogs  
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 

Top 10 weekly visitor locations

Now that we have a pretty good handle on our top urls and are on the lookout for potential issues that may impact the user experience, it begs the question where are all these visitors coming from?

Everyone likes a good geolocation-based stat so let’s use a simple query to find out. Note that there are some null values present in the dataset so we’ve added city != (null) to avoid returning results that don’t provide us with a city.

SELECT toStartOfWeek(time_start) as week, 
geo_city AS city, geo_country_code AS country, count() as count 
FROM fastly.xlogs 
WHERE ((time_start BETWEEN '2021-01-01 00:00:00' AND now()) AND city != '(null)') 
GROUP BY week, city, country 

While the query itself is simple, this may serve as a starting point for better understanding who your clients are, where they are coming from, and ultimately using the additional insight to ensure you are best serving your audience. Alternatively, if you are more of a security-minded individual, you could combine a geolocation-based query with uniq(client_ip)request_refererrequest_user_agentrequest_x_forwarded_for and look for non-standard request types as an example.

Find the impact of response time vs cache status

The time_elapsed column provides an indicator to the responsiveness of our website. In theory, this should be impacted by the cache_status.

It’s worth mentioning that you could use the quantiles(l1, l2 ..)(column) function to quickly grab multiple percentile quantiles on a single scan but we’ve chosen to use separate quantileExact(level)(expr) functions for higher resolution and better clarity in the results formatting.

toDate(time_start) AS day, host, cache_status, 
(quantileExact(0.50)(time_elapsed)/1000) AS 50_percentile, 
(quantileExact(0.75)(time_elapsed)/1000) AS 75_percentile,
(quantileExact(0.99)(time_elapsed)/1000) AS 99_percentile 
FROM fastly.xlogs 
WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
GROUP BY day, host, cache_status 
ORDER BY day, cache_status 

As expected, a cache MISS will have a big impact on response times; although, the worst response times seem to occur with a cache ERROR. Hey, that sounds like a great opportunity to possibly modify the top N query from above with the addition of a cache_status filter on cache_status ERROR!

Monitor the cache error rate

Based on what we just discovered above with cache ERROR, a high-level metric that helps to monitor overall cache error rate sounds like it would be of value. Similar to the approach we used in the query to determine what percent of visitors are visiting docs.hydrolix.io versus try.hydrolix.io, we leverage countIf(cache_status = 'ERROR') to count and increment only when cache_status = 'ERROR'.That value is then divided by the total number of requests to produce a percentage rate.

toStartOfMonth(day) AS month, 
sum(error) AS error, 
sum(total) AS sumtotal, 
round((error/sumtotal)*100, 2) AS error_rate 
   SELECT toDate(time_start) AS day, 
   count(request) AS total, 
   countIf(cache_status = 'ERROR') AS error 
   FROM fastly.xlogs 
   WHERE (time_start BETWEEN '2021-01-01 00:00:00' AND now()) 
   GROUP BY day 
   ORDER BY day ASC) 
GROUP BY month 
ORDER BY month

Monthly HTTP response code metrics

For our final example, we produce an extremely useful suite of monthly metrics based on HTTP response code status. countIf is used in conjunction with (match(haystack, pattern) to produce selective and running totals based on response codes in the subquery. We subsequently take those totals, summarize them using sum(), and produce monthly totals and rates.

toStartOfMonth(day) AS month, 
sum(reqtotal) AS total_requests, 
sum(1XX_2XX) AS success_total, 
round((success_total/total_requests)*100, 2) AS success_rate, 
sum(3XX) AS redirect_total, 
round((redirect_total/total_requests)*100, 2) AS redirect_rate, 
sum(4XX_5XX) AS error_total, 
round((error_total/total_requests)*100, 2) AS error_rate 
   SELECT toDate(time_start) AS day, 
   count(request) AS reqtotal, 
   countIf(match(status, '^[12]\d\d')) AS 1XX_2XX, 
   countIf(match(status, '^3\d\d')) AS 3XX, 
    countIf(match(status, '^[45]\d\d')) AS 4XX_5XX 
   FROM fastly.xlogs 
   WHERE (time_start BETWEEN day AND now()) 
   GROUP BY day 
   ORDER BY day ASC) 
GROUP BY month 
ORDER BY month

While all of this data is undoubtedly useful in the right context, deriving value from metrics-oriented datasets can be difficult without the right visualization strategy. Hydrolix is quieried through Clickhouse, so we automatically support any visualization tool that Clickhouse supports. For the next example, we’ll show the Grafana dashboards we use for Fastly CDN observability.

Share Now