HYDROLIX BLOG

Ponderings, insights and industry updates

Big Data with small bill

June 24, 2021

Author: David Sztykman |

Tags: , , ,

A month ago I started indexing Certificate transparency logs into Hydrolix.
I’ve already created a blog post about this that you can view here.

This one is more about the performance and TCO of Hydrolix for this dataset.

But first what does a single certificate entry look like :

[
  {
    "data.cert_index": "1036303569",
    "data.cert_link": "http://ct.googleapis.com/logs/xenon2021/ct/v1/get-entries?start=1036303569&end=1036303569",
    "data.leaf_cert.all_domains": [
      "lakovacitechnologie.cz",
      "www.lakovacitechnologie.cz"
    ],
    "data.leaf_cert.extensions.authorityInfoAccess": "CA Issuers - URI:http://r3.i.lencr.org/\nOCSP - URI:http://r3.o.lencr.org\n",
    "data.leaf_cert.extensions.authorityKeyIdentifier": "keyid:14:2E:B3:17:B7:58:56:CB:AE:50:09:40:E6:1F:AF:9D:8B:14:C2:C6\n",
    "data.leaf_cert.extensions.basicConstraints": "CA:FALSE",
    "data.leaf_cert.extensions.certificatePolicies": "Policy: 1.3.6.1.4.1.44947.1.1.1\n  CPS: http://cps.letsencrypt.org",
    "data.leaf_cert.extensions.ctlSignedCertificateTimestamp": "",
    "data.leaf_cert.extensions.extendedKeyUsage": "TLS Web server authentication, TLS Web client authentication",
    "data.leaf_cert.extensions.keyUsage": "Digital Signature, Key Encipherment",
    "data.leaf_cert.extensions.subjectAltName": "DNS:www.lakovacitechnologie.cz, DNS:lakovacitechnologie.cz",
    "data.leaf_cert.extensions.subjectKeyIdentifier": "D0:64:F8:BE:92:AA:5A:1B:78:E2:3B:43:21:2B:D0:DC:74:39:FC:CD",
    "data.leaf_cert.fingerprint": "CC:19:23:02:27:8E:90:C2:A1:8D:23:00:82:30:D9:FF:13:49:BA:48",
    "data.leaf_cert.issuer.C": "US",
    "data.leaf_cert.issuer.CN": "R3",
    "data.leaf_cert.issuer.L": "",
    "data.leaf_cert.issuer.O": "Let's Encrypt",
    "data.leaf_cert.issuer.OU": "",
    "data.leaf_cert.issuer.ST": "",
    "data.leaf_cert.issuer.aggregated": "/C=US/CN=R3/O=Let's Encrypt",
    "data.leaf_cert.issuer.emailAddress": "",
    "data.leaf_cert.not_after": "2021-09-22 08:03:28",
    "data.leaf_cert.not_before": "2021-06-24 08:03:29",
    "data.leaf_cert.serial_number": "322B4EE5F85BB09C283C30E55E8BBC90523",
    "data.leaf_cert.signature_algorithm": "sha256, rsa",
    "data.leaf_cert.subject.C": "",
    "data.leaf_cert.subject.CN": "lakovacitechnologie.cz",
    "data.leaf_cert.subject.L": "",
    "data.leaf_cert.subject.O": "",
    "data.leaf_cert.subject.OU": "",
    "data.leaf_cert.subject.ST": "",
    "data.leaf_cert.subject.aggregated": "/CN=lakovacitechnologie.cz",
    "data.leaf_cert.subject.emailAddress": "",
    "data.seen": 1624525483,
    "data.source.name": "Google 'Xenon2021' log",
    "data.source.url": "ct.googleapis.com/logs/xenon2021/",
    "data.update_type": "PrecertLogEntry",
    "message_type": "certificate_update",
    "timestamp": "2021-06-24 09:04:43"
  }
]

For the last 30 days we have ingested:

  • 262,026,996 logs entries
  • raw data sent to Hydrolix is 497,664GB

One of the advantage of Hydrolix is our compression mechanism; currently the whole dataset with every field index is 37,712GB which is more than 92% compression!

Hydrolix is storing the data into s3 bucket, so storing this whole dataset is about 1$ per month (we are purely talking about storage cost here).

Query Performance

Hydrolix is stateless, you can spin up different server type and number to query the data.

For this example I have been using 3 – c5n.9XLarge servers from AWS.

Let’s start with a query looking for *kafka* in my array data.leaf_cert.all_domains.

SELECT arrayFilter(x -> x LIKE '%kafka%', data.leaf_cert.all_domains) as result
FROM sample.cts
WHERE notEmpty(result)

678822 rows in set. Elapsed: 1.195 sec. Processed 261.88 million rows, 21.03 GB (219.10 million rows/s., 17.59 GB/s.)

Here the response is 678822 rows and it took 1.195 sec to send the response.
Hydrolix has processed 261.88 million rows which are 21.03GB
We have been processing data at 219.10 million rows/s., 17.59 GB/s

The first question is why if the whole dataset is 38GB did we only processed 21.03GB?
Because Hydrolix uses a column storage, this means we retrieve only the column we need to execute the query.
This column here is data.leaf_cert.all_domains.

Another example this time using predicate pushdown on top of the current query I also added another filter data.update_type = "PrecertLogEntry"

SELECT arrayFilter(x -> (x LIKE '%kafka%'), data.leaf_cert.all_domains) AS result
FROM sample.cts
WHERE notEmpty(result) AND (`data.update_type` IN ('PrecertLogEntry'))

678822 rows in set. Elapsed: 0.713 sec. Processed 156.85 million rows, 16.39 GB (220.05 million rows/s., 22.98 GB/s.)

As you can see we are processing only 16.39GB this time, because we use the filter data.update_type = PrecertLogEntry, so we are only fetching rows in the column data.leaf_cert.all_domains when this condition is true. This reduces even more the amount of data we transfer.

Let’s do another query this time let’s count the number of certificates group by certificate authority issuer:

SELECT data.leaf_cert.issuer.O, COUNT()
FROM sample.cts
GROUP BY data.leaf_cert.issuer.O
ORDER BY COUNT()

252 rows in set. Elapsed: 1.137 sec. Processed 262.34 million rows, 5.97 GB (230.66 million rows/s., 5.25 GB/s.)

Here the response is 252 rows and it took 1.137 sec to send the response.
Hydrolix has processed 262.34 million rows which are 5.97GB
We have been processing data at 230.66 million rows/s., 5.25 GB/s

Again we only fetch the column we need to send the response.

By default Hydrolix uses timestamp as primary key to index the data.

Getting data from yesterday vs 3 weeks ago is the same process, we fetch the proper partition from s3.

In this example I can move into my 30d of data as quickly as if we have one year or more.

You can actually play with this dashboard and see from yourself here

Hydrolix differentiates its approach by leveraging cloud storage and fetch only the data needed to do the query.
With our compression algorithm we have blazing fast performance as you saw earlier.

By retrieving the data quickly we can avoid storing the data locally; this is the network performance we achieve for this example:

Share Now