Building the greatest monitoring stack with Prometheus
Table of Contents
Perhaps not the “greatest” monitoring stack, but this should give you a fantastic (in my opinion) baseline to be well on your way to a useful monitoring stack using Prometheus, Grafana and a bunch of exporters!
Ill upload my current docker-compose up to github, though this will very much be a living file and will be iterated on. Ill update this post as and when that happens.
As a little preface, ill go through my needs and requirements. I love self-hosting as much as possible. I also like to provide small services to the public, should they want them. I have a pretty extensive home network, along with a collection of hosted dedicated servers and VPS’s provided a bunch of services. historically i have run my monitoring from inside my network, but im trying to thin down on the number of machines im running at home because of the UK’s energy crisis and ridiculous costs. So for now my monitoring solution is on a Hetzner VPS.
I wanted to make this as “production” ready as possible, not to noisey but ensuring i get alerted when i need to actually action something. Something which does seem pretty difficult to do.
My current montiroing stack consists of the following containers:
- Prometheus
- Traefik
- Alertmanager
- Grafana
- Blackbox-Exporter
- node-exporter
- watchtower
- PushProx – though this is deployed off the monitoring stack, inside my home network.
Traefik
Lets start with our entry point into our monitoring stack. This is one of the first times ive used Traefik in anger, i usually go for my tried and tested Nginx reverse proxy. But as im trying to push everything into docker-compose, i figured i should do it properly.
traefik:
image: traefik:v2.9
container_name: traefik
restart: unless-stopped
networks:
- monitoring
ports:
- "80:80"
- "443:443"
- "8080:8080"
command:
- "--api.insecure=true"
- "--providers.docker"
- "--metrics.prometheus=true"
- "--entrypoints.web.address=:80"
- "--entrypoints.web.http.redirections.entrypoint.to=websecure"
- "--entrypoints.web.http.redirections.entrypoint.scheme=https"
- "--entrypoints.websecure.address=:443"
- "--certificatesresolvers.route53.acme.dnschallenge=true"
- "--certificatesresolvers.route53.acme.dnschallenge.provider=route53"
#- "--certificatesresolvers.route53.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
- "--certificatesresolvers.route53.acme.email=email_address"
- "--certificatesresolvers.route53.acme.storage=/letsencrypt/acme.json"
- "--accesslog=true"
- "--accesslog.filepath=/var/log/traefik/access.log"
- "--accesslog.format=json"
volumes:
- "./traefik:/var/log/traefik/"
- "./letsencrypt:/letsencrypt"
- "/var/run/docker.sock:/var/run/docker.sock"
environment:
- TZ=Europe/London
- AWS_ACCESS_KEY_ID=${TRAEFIK_AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${TRAEFIK_AWS_SECRET_ACCESS_KEY}
- AWS_REGION=${TRAEFIK_AWS_REGION}
- AWS_HOSTED_ZONE_ID=${TRAEFIK_AWS_HOSTED_ZONE_ID}
labels:
- "com.centurylinklabs.watchtower.enable=true"
A couple of things to go over in this:
- I am using Route53 for DNS
- I have enabled the API – you dont need to do this.
- The ACME staging CAserver is commented out. Id suggest uncommenting this until you are happy things are working.
- You will need to replace you email address in there.
- This is listening on both 80 and 443, but is also forwarding requests from 80 to 443 to ensure everything is served over HTTPS.
Prometheus
The core to our monitoring stack:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
- "--web.enable-lifecycle"
networks:
- monitoring
labels:
- "traefik.http.routers.prometheus.rule=Host(`prom.domain.name`)"
- "traefik.http.routers.prometheus.tls=true"
- "traefik.http.routers.prometheus.tls.certresolver=route53"
- "com.centurylinklabs.watchtower.enable=true"
This is somewhat simple. The only really interesting section is the labels, as these are whats linking them to Traefik and telling Traefik to get me a LetsEncrypt cert.
Alertmanager
We need to actually be alerted when things fail, so thats next.
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
volumes:
- ./alertmanager:/config
- alertmanager_data:/data
command:
- "--config.file=/config/alertmanager.yml"
- "--web.external-url=https://alertmanager.domain.name"
networks:
- monitoring
labels:
- "traefik.http.routers.alertmanager.rule=Host(`alertmanager.domain.name`)"
- "traefik.http.routers.alertmanager.tls=true"
- "traefik.http.routers.alertmanager.tls.certresolver=route53"
- "com.centurylinklabs.watchtower.enable=true"
Again, not really much to go through. Though one of the more important changes is the --web.external-url=
command. This makes it so alerts come through with a link to the proper alertmanager hostname.
Grafana
The beautiful way to display our data!
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel
networks:
- monitoring
labels:
- "traefik.http.routers.grafana.rule=Host(`grafana.domain.name`)"
- "traefik.http.routers.grafana.tls=true"
- "traefik.http.routers.grafana.tls.certresolver=route53"
- "com.centurylinklabs.watchtower.enable=true"
Not too much to go through here, its wise to get up-to-speed on how Grafana installs plugins etc when using docker containers. There is an example in there already.
Blackbox Exporter
I use this to monitor HTTP responses from my webservers along with checking how long their certs have left.
blackbox-exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
networks:
- monitoring
dns: 8.8.8.8
restart: unless-stopped
volumes:
- ./blackbox-exporter:/config
command: --config.file=/config/blackbox.yml
labels:
com.centurylinklabs.watchtower.enable: "true"
Theres nothing really special about this config to be honest. Pretty boilerplate.
Node Exporter
This is what is going to provide you your OS level metrics. Probably should be the first exporter you run on your machines.
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.rootfs=/rootfs"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
networks:
- monitoring
labels:
com.centurylinklabs.watchtower.enable: "true"
Just like blackbox exporter, this is a pretty self-explanatory exporter.
Watchtower
The final container is just watchtower, this will keep your other containers up to date!
services:
watchtower:
image: containrrr/watchtower:latest
environment:
- WATCHTOWER_LABEL_ENABLE=true
volumes:
- /var/run/docker.sock:/var/run/docker.sock
labels:
com.centurylinklabs.watchtower.enable: "true"
Summary
This about sums up my monitoring stack. I find it is a great baseline and brings in a lot of important metrics which are actually useful. In a next post ill go over some of the dashboards used to bring this data to life!