0x9B4B Free Telemetry

Some time ago I wrote about leveraging GCP free tier for log collection. I also started using it for telemetry though I never updated the post with those details.

Since Google refactored their Monitoring away from that crappy StackDriver interface, it’s actually quite nice, so might as well write the setup down (up? ☁️).

Even though GCP monitoring seems oriented to metrics from GCP services it also allows you to create logs-based metrics.
This basically allows you to defined rules to extract data points from logs into metrics that can be monitored and alerted on.

First things first: based on that previous post about GCP logging, and assuming you have the setup mentioned there to ship logs to GCP/stackdriver, you can use fluentbit to generate these logs.

~# /opt/td-agent-bit/bin/td-agent-bit -i mem -o stdout
Fluent Bit v1.5.6

[2020/11/24 01:41:24] [ info] [engine] started (pid=21309)
[2020/11/24 01:41:24] [ info] [storage] version=1.0.5, initializing...
[2020/11/24 01:41:24] [ info] [storage] in-memory
[2020/11/24 01:41:24] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/11/24 01:41:24] [ info] [sp] stream processor started
[0] mem.0: [1606182084.325065322, {"Mem.total"=>948084, "Mem.used"=>772048, "Mem.free"=>176036, "Swap.total"=>102396, "Swap.used"=>6608, "Swap.free"=>95788}]
[1] mem.0: [1606182085.325055771, {"Mem.total"=>948084, "Mem.used"=>772088, "Mem.free"=>175996, "Swap.total"=>102396, "Swap.used"=>6608, "Swap.free"=>95788}]
[2] mem.0: [1606182086.325105700, {"Mem.total"=>948084, "Mem.used"=>772904, "Mem.free"=>175180, "Swap.total"=>102396, "Swap.used"=>6608, "Swap.free"=>95788}]

configuration file /etc/td-agent-bit/td-agent-bit.conf would be updated with something like:

...

[INPUT]
    Name cpu
    Tag cpu
    Interval_Sec 10

[INPUT]
    Name mem
    Tag mem
    Interval_Sec 10

...

[OUTPUT]
    Name stackdriver
    google_service_credentials /etc/gcreds.json
    Match *

Logs will then show up in GCP Logs Viewer:

As an extra step, as fluentbit mem outputs bytes, GCP does not allow any transformation of the data and visualization with multiple hostnames is friendlier when using percentage (instead of absolute values), I’ve created a fluenbit math filter.

...
[INPUT]
    Name mem
    Tag mem
    Interval_Sec 10

[FILTER]
    Name math
    Match mem
    Operation div
    Field Mem.used
    Field Mem.total
    Output_field Mem.usage
...

This [FILTER] entry will add Mem.usage to the output (which is Mem.used / Mem.total).

In the Logs Viewer you can then click Actions, Create Metric and set it up.

Type counter will just measure the number of log lines for the given query while distribution will extract an actual value from each line (distribution being the right choice here).

Metric setup, it can now be added to a dashboard and/or used as alert source

Note

These metrics will count both towards logs and metric ingestion limits. At the moment they are respectively 10GB and 150MB (per month).

I have some exclusion filters on the logs (for frequent lines that noone cares). About 10 hosts shipping logs to this (about 30 docker services) and still hardly go above 3GB used for logs.
Metric ingestion does get to 150MB quickly if we start adding a few…