Building the greatest monitoring stack with Prometheus

Perhaps not the “greatest” monitoring stack, but this should give you a fantastic (in my opinion) baseline to be well on your way to a useful monitoring stack using Prometheus, Grafana and a bunch of exporters!

Ill upload my current docker-compose up to github, though this will very much be a living file and will be iterated on. Ill update this post as and when that happens.

As a little preface, ill go through my needs and requirements. I love self-hosting as much as possible. I also like to provide small services to the public, should they want them. I have a pretty extensive home network, along with a collection of hosted dedicated servers and VPS’s provided a bunch of services. historically i have run my monitoring from inside my network, but im trying to thin down on the number of machines im running at home because of the UK’s energy crisis and ridiculous costs. So for now my monitoring solution is on a Hetzner VPS.

I wanted to make this as “production” ready as possible, not to noisey but ensuring i get alerted when i need to actually action something. Something which does seem pretty difficult to do.

My current montiroing stack consists of the following containers:

  • Prometheus
  • Traefik
  • Alertmanager
  • Grafana
  • Blackbox-Exporter
  • node-exporter
  • watchtower
  • PushProx – though this is deployed off the monitoring stack, inside my home network.

Traefik

Lets start with our entry point into our monitoring stack. This is one of the first times ive used Traefik in anger, i usually go for my tried and tested Nginx reverse proxy. But as im trying to push everything into docker-compose, i figured i should do it properly.

traefik:
        image: traefik:v2.9
        container_name: traefik
        restart: unless-stopped
        networks:
          - monitoring
        ports:
          - "80:80"
          - "443:443"
          - "8080:8080"
        command:
          - "--api.insecure=true"
          - "--providers.docker"
          - "--metrics.prometheus=true"
          - "--entrypoints.web.address=:80"
          - "--entrypoints.web.http.redirections.entrypoint.to=websecure"
          - "--entrypoints.web.http.redirections.entrypoint.scheme=https"
          - "--entrypoints.websecure.address=:443"
          - "--certificatesresolvers.route53.acme.dnschallenge=true"
          - "--certificatesresolvers.route53.acme.dnschallenge.provider=route53"
            #- "--certificatesresolvers.route53.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
          - "--certificatesresolvers.route53.acme.email=email_address"
          - "--certificatesresolvers.route53.acme.storage=/letsencrypt/acme.json"
          - "--accesslog=true"
          - "--accesslog.filepath=/var/log/traefik/access.log"
          - "--accesslog.format=json"
        volumes:
          - "./traefik:/var/log/traefik/"
          - "./letsencrypt:/letsencrypt"
          - "/var/run/docker.sock:/var/run/docker.sock"
        environment:
          - TZ=Europe/London
          - AWS_ACCESS_KEY_ID=${TRAEFIK_AWS_ACCESS_KEY_ID}
          - AWS_SECRET_ACCESS_KEY=${TRAEFIK_AWS_SECRET_ACCESS_KEY}
          - AWS_REGION=${TRAEFIK_AWS_REGION}
          - AWS_HOSTED_ZONE_ID=${TRAEFIK_AWS_HOSTED_ZONE_ID}
        labels:
          - "com.centurylinklabs.watchtower.enable=true"

A couple of things to go over in this:

  • I am using Route53 for DNS
  • I have enabled the API – you dont need to do this.
  • The ACME staging CAserver is commented out. Id suggest uncommenting this until you are happy things are working.
  • You will need to replace you email address in there.
  • This is listening on both 80 and 443, but is also forwarding requests from 80 to 443 to ensure everything is served over HTTPS.

Prometheus

The core to our monitoring stack:

     prometheus:
        image: prom/prometheus:latest
        container_name: prometheus
        restart: unless-stopped
        volumes:
          - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
          - ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml
          - prometheus_data:/prometheus
        command:
          - "--config.file=/etc/prometheus/prometheus.yml"
          - "--storage.tsdb.path=/prometheus"
          - "--web.console.libraries=/etc/prometheus/console_libraries"
          - "--web.console.templates=/etc/prometheus/consoles"
          - "--web.enable-lifecycle"
        networks:
          - monitoring
        labels:
          - "traefik.http.routers.prometheus.rule=Host(`prom.domain.name`)"
          - "traefik.http.routers.prometheus.tls=true"
          - "traefik.http.routers.prometheus.tls.certresolver=route53"
          - "com.centurylinklabs.watchtower.enable=true"

This is somewhat simple. The only really interesting section is the labels, as these are whats linking them to Traefik and telling Traefik to get me a LetsEncrypt cert.

Alertmanager

We need to actually be alerted when things fail, so thats next.

      alertmanager:
        image: prom/alertmanager:latest
        container_name: alertmanager
        restart: unless-stopped
        volumes:
          - ./alertmanager:/config
          - alertmanager_data:/data
        command:
          - "--config.file=/config/alertmanager.yml"
          - "--web.external-url=https://alertmanager.domain.name"
        networks:
          - monitoring
        labels:
          - "traefik.http.routers.alertmanager.rule=Host(`alertmanager.domain.name`)"
          - "traefik.http.routers.alertmanager.tls=true"
          - "traefik.http.routers.alertmanager.tls.certresolver=route53"
          - "com.centurylinklabs.watchtower.enable=true"

Again, not really much to go through. Though one of the more important changes is the --web.external-url= command. This makes it so alerts come through with a link to the proper alertmanager hostname.

Grafana

The beautiful way to display our data!

	

      grafana:
        image: grafana/grafana:latest
        container_name: grafana
        restart: unless-stopped
        volumes:
          - grafana_data:/var/lib/grafana
          - ./grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards
          - ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources
        environment:
          - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER}
          - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
          - GF_USERS_ALLOW_SIGN_UP=false
          - GF_INSTALL_PLUGINS=grafana-piechart-panel
        networks:
          - monitoring
        labels:
          - "traefik.http.routers.grafana.rule=Host(`grafana.domain.name`)"
          - "traefik.http.routers.grafana.tls=true"
          - "traefik.http.routers.grafana.tls.certresolver=route53"
          - "com.centurylinklabs.watchtower.enable=true"

Not too much to go through here, its wise to get up-to-speed on how Grafana installs plugins etc when using docker containers. There is an example in there already.

Blackbox Exporter

I use this to monitor HTTP responses from my webservers along with checking how long their certs have left.

	

      blackbox-exporter:
        image: prom/blackbox-exporter:latest
        container_name: blackbox-exporter
        networks:
          - monitoring
        dns: 8.8.8.8
        restart: unless-stopped
        volumes:
          - ./blackbox-exporter:/config
        command: --config.file=/config/blackbox.yml
        labels:
          com.centurylinklabs.watchtower.enable: "true"

Theres nothing really special about this config to be honest. Pretty boilerplate.

Node Exporter

This is what is going to provide you your OS level metrics. Probably should be the first exporter you run on your machines.

	

      node-exporter:
        image: prom/node-exporter:latest
        container_name: node-exporter
        restart: unless-stopped
        volumes:
          - /proc:/host/proc:ro
          - /sys:/host/sys:ro
          - /:/rootfs:ro
        command:
          - "--path.procfs=/host/proc"
          - "--path.rootfs=/rootfs"
          - "--path.sysfs=/host/sys"
          - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
        networks:
          - monitoring
        labels:
          com.centurylinklabs.watchtower.enable: "true"

Just like blackbox exporter, this is a pretty self-explanatory exporter.

Watchtower

The final container is just watchtower, this will keep your other containers up to date!

    services:
      watchtower:
        image: containrrr/watchtower:latest
        environment:
          - WATCHTOWER_LABEL_ENABLE=true
        volumes:
          - /var/run/docker.sock:/var/run/docker.sock
        labels:
          com.centurylinklabs.watchtower.enable: "true"

Summary

This about sums up my monitoring stack. I find it is a great baseline and brings in a lot of important metrics which are actually useful. In a next post ill go over some of the dashboards used to bring this data to life!