0x474E (In)fluent(a)bit

I’ve moved log collection and telemetry out of GCP Stackdriver (mentioned before) to on-prem InfluxDB (running in a raspberry pi 4 4GB)

FluentBit was the agent of choice at the time but Telegraf seemed like a good candidate now.

Why

fluentd has a lot of plugins but its crappy ruby codebase eats way too many resources for me (why would anyone?) to use it.
Luckily, the same guys gave it an extreme makeover in C and called it fluentbit. Sadly, its community hasn’t grown that much so not so many plugins available out of the box, but making new ones is quite easy: I’ve built this docker image to make it easier to build the plugins and already made a few of my own

As Telegraf is in Go (worse than C but way better than ruby) and it has almost as many plugins available as fluentd, it seemed like a good reason to measure how worse it was resource-wise (than fluentbit) to be able to weight it properly.

How

TL;DR;

  • setup telegraf and fluentbit collecting the same metrics: memory, cpu and disk IO
  • setup a second telegraf to monitor those two agents (with procstat)
  • compare baseline telemetry data to make sure they do the same
  • compare telemetry data of the collector processes to weight telegraf resource-hogging

Test Details

Test hardware: raspberry pi 2

All telemetry data goes to an influxdb 2.0.4 instance

  • download telegraf
curl -LO https://dl.influxdata.com/telegraf/releases/telegraf-1.17.2_linux_armhf.tar.gz
  • create test buckets

    • test_telegraf
    • test_fluentbit
    • test_results
  • install fluentbit

curl https://packages.fluentbit.io/fluentbit.key | sudo apt-key add -
echo deb https://packages.fluentbit.io/debian/buster buster main > /etc/apt/sources.list.d/fluentbit.list
apt update
apt install td-agent-bit
  • telegraf configuration
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false
[[outputs.influxdb_v2]]
  urls = ["https://my.influxdb"]
  token = "<TOKEN>"
  organization = "myorg"
  bucket = "test_telegraf"
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
[[inputs.diskio]]
[[inputs.mem]]
  • start telegraf
telegraf-1.17.2/usr/bin/telegraf --config telegraf.conf
  • fluentbit configuration
[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    info
    HTTP_Server  Off
    Plugins_File /etc/td-agent-bit/plugins.conf
[INPUT]
    Name cpu
    Tag met.cpu
[INPUT]
    Name mem
    Tag met.mem
[INPUT]
    Name disk
    Tag met.disk
[FILTER]
    Name record_modifier
    Match *
    Record hostname ${HOSTNAME}
[OUTPUT]
    Name influxdb_v2
    Match met.*
    Host my.influx
    Port 443
    tls on
    tls.verify off
    org myorg
    bucket test_fluentbit
    http_token <TOKEN>
    Tag_Keys hostname
  • start fluentbit
/opt/td-agent-bit/bin/td-agent-bit -c fluentbit.conf
  • configuration to monitor agents (using telegraf)
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false
[[outputs.influxdb_v2]]
  urls = ["https://my.influxdb"]
  token = "<TOKEN>"
  organization = "myorg"
  bucket = "test_results"
[[inputs.procstat]]
  pattern = "/opt/td-agent-bit/bin/td-agent-bit -c fluentbit.conf"
[[inputs.procstat]]
  pattern = "telegraf-1.17.2/usr/bin/telegraf --config telegraf.conf"
  • start second telegraf (agent monitor)
telegraf-1.17.2/usr/bin/telegraf --config telemon.conf

Results

Left this running for over a week and these were the results:

Both lines (from fluentbit and telegraf) basically match so they produced identical telemetry data. Exception might be disk IO where they use different units and I didn’t bother to find out how to convert 🏝️

And the results that matter: fluentbit process telemetry versus telegraf

CPU

fluentbit (blue line) at 0.4% and telegraf between 0.6% and 0.7%.
we can also see the frequent ups and downs that probably come with any garbage-collected language versus the the flat(ter) line of fluentbit and its finetuned malloc and free timings.

MEM

fluentbit (blue line) uses around 10mb of memory (1% in this RPi 2) and telegraf uses between 20mb and 40mb.

Disk IO

fluentbit read count (blue line) increases much faster than telegraf’s (red line), yet write count lines are basically the same. Not entirely sure what this means though.

Conclusion

Fluentbit has a long way to go in plugin availiability and even much longer way for Windows targets… So if I had Windows machines or I needed a specific plugin and could not build it myself, I think telegraf resource usage wouldn’t be an issue.

But as I already have fluentbit setup for my own plugins, if I need any out of the core ones, I’ll keep fluentbit for the smaller footprint.