Skip to main content

network health monitoring

Network Health Monitoring & Automation

Date: 2026-01-07
Purpose: Automated network monitoring, health checks, and alerting


GitLab CI Health Check Pipeline

Location: .gitlab/ci/network-health.yml

network-health-check: stage: health tags: - local script: # Critical checks - | echo "=== Network Health Check ===" echo "Date: $(date)" # Cloudflare Tunnel echo "Checking Cloudflare Tunnel..." if curl -fsS https://mesh.bluefly.internal/health >/dev/null 2>&1; then echo " Cloudflare Tunnel: OK" else echo " Cloudflare Tunnel: FAILED" exit 1 fi # Tailscale Devices echo "Checking Tailscale devices..." tailscale ping -c 2 bluefly-m4.tailcf98b3.ts.net || exit 1 echo " BlueFly Mac: Online" # Service Health echo "Checking agent-mesh service..." curl -fsS http://bluefly-m4.tailcf98b3.ts.net:3005/health || exit 1 echo " Agent-Mesh Service: OK" # DNS Resolution echo "Checking DNS resolution..." dig +short mesh.bluefly.internal @1.1.1.1 || exit 1 echo " DNS Resolution: OK" echo "All health checks passed" only: - schedules schedule: "*/15 * * * *" # Every 15 minutes when: on_success

Local Health Check Script

Location: scripts/network-health-check.sh

#!/bin/bash # Network Health Check Script FAILURES=0 check_service() { local name=$1 local url=$2 if curl -fsS "$url" >/dev/null 2>&1; then echo " $name: OK" return 0 else echo " $name: FAILED" FAILURES=$((FAILURES + 1)) return 1 fi } check_tailscale() { local hostname=$1 if tailscale ping -c 2 "$hostname" >/dev/null 2>&1; then echo " Tailscale $hostname: Online" return 0 else echo " Tailscale $hostname: Offline" FAILURES=$((FAILURES + 1)) return 1 fi } echo "=== Network Health Check ===" echo "Date: $(date)" echo "" check_service "Cloudflare Tunnel" "https://mesh.bluefly.internal/health" check_service "Local Service" "http://localhost:3005/health" check_tailscale "bluefly-m4.tailcf98b3.ts.net" check_tailscale "glinet-router.tailcf98b3.ts.net" echo "" if [ $FAILURES -eq 0 ]; then echo " All checks passed" exit 0 else echo " $FAILURES check(s) failed" exit 1 fi

Monitoring Metrics

Key Metrics to Track:

  • Cloudflare tunnel uptime
  • Tailscale device online status
  • Service response times
  • DNS resolution success rate
  • Network latency

Alert Thresholds:

  • Service downtime > 1 minute
  • Device offline > 15 minutes
  • Response time > 1 second
  • DNS resolution failure