Home network monitoring

Background 

First of all I have to say, that I’m fairly into designing, although I don’t design things nor I know how to do it. However, I like things which are well and beautiful designed and I would say,  I have an eye for design, at least I try. That’s why functionality and user experience, is not the only thing which counts for me.

For quite some time I was looking for a solution to visualize data which I collect in my smart home and IT systems at home. In the beginning I was using Metabase, which is a business intelligence tool, that connects to a variety of database types like Postgres, MongoDB or MySQL. I know it from work, there we use it for network (GSM, UMTS, LTE) intelligence. In my home setup, I was running a simple instance of Metabase with Postgres as database, which worked great. Nevertheless, there were some features I was missing and also the user experience of Metabase was sometimes not that great. Besides that, I was looking for a more light-wight solution which makes collecting data for different data sources easy. I didn’t want to write a lot of custom solutions. Before one of my colleagues told my about Telegraf, Influxdb, Grafana, I was playing around with Elasticsearch, Logstash and Kibana aka the ELK stack for some days, but somehow the ELK stack and me didn’t become friends. Luckily my friend let me know about Telegraf, Influxdb and Grafana. That’s basically the journey I made to find those beautiful and easy to use tools. Let’s stop the babbling and get to the solution which is currently in service and solves all of my problems and needs. 

Telegraf 

A plugin driven data collector with a variety of input  sources and output targets, besides that it can also aggregate and process data. Personally I’m using the following input plugins:

  • CPU – to get the usage
  • Disk – to see how much space is left
  • Docker – to monitor my containers
  • Logparser Grok – to parse nginx logs
  • Mem – to see how much memory is ocuppied
  • Net – to get the current network throughput
  • Ping – to run ping tests
  • System – to check the uptime of a host
  • SNMP – to query metrics from my Ubiquiti Unifi AC HD Access Point

As I’m storing all of my data in Influxdb, I have only one output plugin:

  • Influxdb

A lot of plugins do not need additional configuration, they are directly usable. But there are also some which need quite some additional configuration. An example for that is SNMP, to get it working with the Unifi AC HD you need to download the MIBs, a detailed guide for that can be found on the offical Ubiquiti forum. Also the logparser grok to parse Nginx logs needs some configuration:

# # Stream and parse log file(s).
[[inputs.logparser]]
# ## Log files to parse.
# ## These accept standard unix glob matching rules, but with the addition of
# ## ** as a “super asterisk”. ie:
# ## /var/log/**.log -> recursively find all .log files in /var/log
# ## /var/log/*/*.log -> find all .log files with a parent dir in /var/log
# ## /var/log/apache.log -> only tail the apache log file
files = [“/dockerdata/nginx/prod/log/proxy.log”]
#
# ## Read files that currently exist from the beginning. Files that are created
# ## while telegraf is running (and that match the “files” globs) will always
# ## be read from the beginning.
from_beginning = true
#
# ## Method used to watch for file updates. Can be either “inotify” or “poll”.
# # watch_method = “inotify”
#
# ## Parse logstash-style “grok” patterns:
# ## Telegraf built-in parsing patterns: https://goo.gl/dkay10
[inputs.logparser.grok]
# ## This is a list of patterns to check the given log file(s) for.
# ## Note that adding patterns here increases processing time. The most
# ## efficient configuration is to have one pattern per logparser.
# ## Other common built-in patterns are:
# ## %{COMMON_LOG_FORMAT} (plain apache & nginx access logs)
# %{COMBINED_LOG_FORMAT} (access logs + referrer & agent)
patterns = [“%{COMBINED_LOG_FORMAT}”]
#
# ## Name of the outputted measurement name.
# measurement = “apache_access_log”
#
# ## Full path(s) to custom pattern files.
# custom_pattern_files = []
#
# ## Custom patterns can also be defined here. Put one pattern per line.
# custom_patterns = ”’
#
# ## Timezone allows you to provide an override for timestamps that
# ## don’t already include an offset
# ## e.g. 04/06/2016 12:41:45 data one two 5.43µs
# ##
# ## Default: “” which renders UTC
# ## Options are as follows:
# ## 1. Local — interpret based on machine localtime
# ## 2. “Canada/Eastern” — Unix TZ values like those found in https://en.wikipedia.org/wiki/$
# ## 3. UTC — or blank/unspecified, will return timestamp in UTC
# timezone = “Canada/Eastern”
# ”’

To install Telegraf you can follow the official guide. 

As you can see, Telegraf is easy to configure and let’s you collect data in a breeze. It’s also well documented at Github and if you open an issue or question, help is just around the corner.

Influxdb

Is a time series database, which was mainly made to store metrics and events. It supports a REST API, which can be used with or without authentication, I prefer to have authentication. A request to Influxdb can look like this:

curl -k -G -u ‘user:password’ ‘https://host.domain/query?pretty=true’ –data-urlencode “db=default_cell_check” –data-urlencode “q=SELECT “percent_packet_loss” FROM “default_lte_stick_1_gprs_swisscom_ch_ping”, “default_lte_stick_2_gprs_swisscom_ch_ping” GROUP BY “url” ORDER BY time DESC LIMIT 1″

As you can see in this example, the data can be queried in a SQL fashion, which I think is really cool. However, at home I’m not using such queries, I always connect through Grafana to query data. The above example is from work.

If you’re running an influxdb instace on your host, you can directly connect to it by simply type influx into the shell. But because I’m running Influxdb on a different port than 8086 and with authencation it’s influx -username user -passwor “password” -port 443
Below you can find some examples on how to operate influxdb:


> show users
user admin
—- —–
user1 true
user2 false

> show databases
name: databases
name
—-
_internal
telegraf_data

 > use telegraf_data
Using database telegraf_data

> show measurements
name: measurements
name
—-
cpu
disk
diskio
docker

Maybe you ask yourself now “What’s a measurement?”, well that’s like a table in a standard SQL database. 

> select * from system limit 1
name: system
time host load1 load15 load5 n_cpus n_users uptime uptime_format
—- —- —– —— —– —— ——- —— ————-
1521662490000000000 srv-ubn-services-1 0.02 0 0.04 3 2 290956 3 days, 8:49

> grant read on telegraf_data to user1
>

> grant all on telegraf_data to user1
>

 

For sure, there are a lot more commands for detailed configuration available, to have a look on them, head over to the offical documentation.

As my home network is not a productive system, I haven’t setup Influxdb in a high availability and scalability manner, to achive that, I would need to setup a second Influxdb instance on a second host. Maybe I’ll add that some day 😉 In Influxdb you can also set really detailed data retention policies, to configure how long and how detailed data should be stored. I don’t use this currently.

Grafana

To visualize and dive into my data, I’m using Grafana which is a mighty web based time series visualization program, that brings visualization of different database types and plugins together. It’s really powerful and makes exploring your data so much easy. To name some supported database types:

  • Postgres
  • Mysql
  • Influxdb

And some plugins:

  • NTOPNG
  • World Ping (allows you to ping urls / IPs from globally distributed probes)
  • Cloudflare Grafana App

In Influxdb, you don’t even have to write SQL queries, you can do it in some kind of graphical way. Let’s have a look on how I query the CPU usage of my Docker containers:

 

 

It’s nothing different than:

SELECT distinct(“usage_percent”) FROM “docker_container_cpu” WHERE $timeFilter GROUP BY time($__interval), “container_name” fill(null)

The above query is part of the Docker Containers dashboard. I’m running almost every application in Docker:

 

 View picture in full size

 

As I run everything virtualized on Esxi, I also have a dashboard to monitor my two Esxi nodes, unfortunately one is currently down, due a lighting strike. To query and store the Esxi metrics into Influxdb, you can use this program.

 

View picture in full size

 

But as a Network Engineer, my favorite dashboard is for sure the network dashboard.

 

 View picture in full size

 

To get the WAN Interface – volume, distribution of application, traffic per application and overall traffic data, I set up the NTOPNG plugin which connects to my Pfsense firewall to query the data. The bad thing about this is, that you can no write your own quries, but Influxdb support for NTOPNG is on the roadmap. The UNIFI data comes from Influxdb, where it was inserted by Telegraf originally queried from the UNIFI AP. The internet speed results are also coming form Influxdb, inserted by Home Assistant and the following plugin.

 

I have some additional Dashboards, but due to Security reasons, I can not share those with you.

 

You can find the Grafana json templates for all of the here displayed dashboards in my Github repo.

 

Coming Up

Let me know in the comments or at Twitter, if you would like to see some detailed plugin configuration or whatever.