Stop Over-Engineering: A 100-line bash script that saved my servers



This content originally appeared on DEV Community and was authored by Sandro 🦖☄

We’ve all been there. Your website goes down at 3 AM. MySQL crashed. NGINX stopped responding. And you’re scrambling to SSH into the server while your phone buzzes with angry customer emails.

Then someone suggests: “You should use Prometheus + Grafana + Alertmanager + PagerDuty!”

Sure. Or… hear me out… you could just use a 100-line bash script that checks your sites every minute and restarts services automatically when they fail.

The Problem with Enterprise Monitoring

Don’t get me wrong – tools like Datadog, New Relic, and Prometheus are amazing. But they’re also:

  • 🎯 Overkill for small projects
  • 💰 Expensive for startups
  • 🧩 Complex to set up and maintain
  • 🐌 Slow to deploy (days/weeks of configuration)
  • 📚 Require learning new query languages and dashboards

Meanwhile, your website is still down.

Enter: The 100-Line Solution

What if monitoring could be this simple?

# 1. Add your websites
echo "https://example.com" >> sites.txt

# 2. Install
sudo ./install.sh

# 3. Done. Seriously.

That’s it. Every minute, your server now:

  1. ✅ Checks if your websites respond
  2. 🔍 Detects if services are overwhelmed (not just down!)
  3. 🔧 Automatically restarts MySQL, NGINX, or Apache
  4. 📝 Logs only failures (no disk space waste)
  5. 🔄 Tracks failure counts to avoid false positives

How It Works (The Smart Part)

Most monitoring tools just check if a service is “running.” That’s not enough.

Here’s what makes this script intelligent:

1. Load-Based Detection

# Don't just check if MySQL is running...
# Check if it's actually RESPONSIVE
check_mysql_health() {
    # Try to ping MySQL
    if timeout 3 mysqladmin ping; then
        # It's alive! But is it overwhelmed?
        current_connections=$(mysqladmin status | grep -oP 'Threads: \K\d+')

        if [[ "$current_connections" -gt 150 ]]; then
            # Too many connections - restart before it crashes
            return 1
        fi
    fi
}

Your site can be down even when services show as “running” – when they’re overloaded with traffic or locked up processing queries.

2. Advanced Health Checks

# NGINX example: Test config + connectivity + load
check_nginx_health() {
    # 1. Validate config before trying to use it
    nginx -t 2>/dev/null || return 1

    # 2. Can it accept connections?
    timeout 2 bash -c "echo > /dev/tcp/localhost/80" || return 1

    # 3. Is it drowning in connections?
    active_conn=$(curl -s http://localhost/nginx_status | grep -oP 'Active connections: \K\d+')
    [[ "$active_conn" -gt 1000 ]] && return 1

    return 0  # All good!
}

3. Smart Recovery Logic

# Only restart after 3 consecutive failures (avoid false positives)
if [[ "$current_failures" -ge 3 ]]; then
    # Restart services in order: Database first, then web server
    for service in "${SERVICES[@]}"; do
        systemctl restart "$service"
    done
fi

Real-World Example

Let’s say your e-commerce site suddenly gets featured on Reddit (congrats! 🎉). Traffic spikes 10x:

Traditional Monitoring:

  • 📊 Dashboards show high CPU/memory
  • 🚨 Alerts fire
  • 👨‍💻 You get paged
  • ⏰ You wake up, investigate, manually restart services
  • 💸 Lost sales during downtime

This Script:

  • 🔍 Detects MySQL has 200 active connections (threshold: 150)
  • 🤖 Automatically restarts MySQL in 3 seconds
  • 📝 Logs: "MySQL OVERLOADED (200 connections) - restarted"
  • 😴 You stay asleep
  • 💰 Sales continue

Installation (Seriously, It’s This Easy)

# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/site-monitor.git
cd site-monitor

# 2. Add your websites
cat > sites.txt << EOF
https://example.com
https://api.example.com
https://www.example.com
EOF

# 3. Optional: Customize thresholds
vim config.conf  # Adjust MySQL/NGINX/Apache thresholds

# 4. Install (creates cron job, sets up logging)
sudo ./install.sh

# 5. Watch it work
sudo tail -f /var/log/site-monitor/monitor.log

Output:

[2025-10-20 14:23:45] FAILURE: https://example.com - HTTP 000 (1/3 failures)
[2025-10-20 14:24:45] FAILURE: https://example.com - HTTP 000 (2/3 failures)
[2025-10-20 14:25:45] FAILURE: https://example.com - HTTP 000 (3/3 failures)
[2025-10-20 14:25:46] RECOVERY: Starting recovery for https://example.com
[2025-10-20 14:25:47] RECOVERY: MySQL OVERLOADED (187 connections) - restarted
[2025-10-20 14:25:49] RECOVERY: NGINX responsive - no action needed
[2025-10-20 14:25:50] RECOVERY: Recovery completed
[2025-10-20 14:26:45] SUCCESS: https://example.com back online (HTTP 200)

Configuration Options

Everything is configurable in config.conf:

# HTTP Settings
TIMEOUT=10                    # Request timeout
FAILURE_THRESHOLD=3           # Failures before recovery

# Services to manage (in order)
SERVICES=("mysql" "nginx")    # Or: ("mysql" "apache2")

# Load Thresholds
MYSQL_MAX_CONNECTIONS=150     # Restart if connections exceed this
NGINX_MAX_CONNECTIONS=1000    # Restart if connections exceed this
APACHE_MAX_WORKERS=150        # Restart if busy workers exceed this

# Logging
LOG_SUCCESS=false             # Only log failures (save disk space)

When to Use This vs. Enterprise Tools

Use This Simple Script When:

  • 🎯 You have < 50 websites to monitor
  • 💰 You’re on a budget (it’s free!)
  • ⚡ You need it deployed TODAY
  • 🔧 You manage your own Ubuntu servers
  • 🎓 You want to understand what’s happening (no black box)

Use Enterprise Tools When:

  • 📊 You need fancy dashboards and metrics
  • 🌍 You have distributed microservices
  • 👥 You have a dedicated DevOps team
  • 💼 You need compliance/audit trails
  • 🔗 You need integration with 50+ other tools

Performance & Resource Usage

This script is incredibly lightweight:

  • CPU: Near zero (runs for ~1 second per minute)
  • Memory: ~5MB
  • Disk: <1MB logs per month (with default settings)
  • Network: One HTTP GET per site per minute

Compare that to running Prometheus + Grafana (hundreds of MB of RAM).

Production-Ready Features

Don’t let the simplicity fool you – this runs in production:

✅ State Tracking: Counts consecutive failures per site
✅ Log Rotation: Yearly rotation via logrotate
✅ Error Handling: Graceful failures, timeout protection
✅ No Dependencies: Just bash + curl + systemctl (already on Ubuntu)
✅ Tested: Works on Ubuntu 22.04 LTS

Advanced Use Cases

Multi-Server Deployment

Deploy to multiple servers with different site lists:

# Server 1: Monitor frontend sites
echo "https://app.example.com" > sites.txt

# Server 2: Monitor API endpoints
echo "https://api.example.com" > sites.txt

# Server 3: Monitor admin tools
echo "https://admin.example.com" > sites.txt

Custom Services

Not just MySQL/NGINX! Add any systemd service:

# Add Redis, PHP-FPM, whatever you need
SERVICES=("mysql" "nginx" "redis-server" "php8.1-fpm")

Integration with Existing Tools

Still want Slack notifications? Just add a webhook:

# In monitor.sh, add after line 320:
curl -X POST "YOUR_SLACK_WEBHOOK" \
  -d "{\"text\":\"🚨 $url is down! Auto-recovering...\"}"

The Philosophy: Simple > Complex

This project follows the Unix philosophy:

  • Do one thing well
  • Use plain text for data
  • Build small, composable tools

Your monitoring doesn’t need to be fancy. It needs to:

  1. Detect failures ✅
  2. Fix them automatically ✅
  3. Tell you what happened ✅

Mission accomplished in 100 lines of bash.

Try It Yourself

The code is open source (MIT License):

🔗 GitHub: https://github.com/sgumz/site-monitor

Installation takes 2 minutes. Give it a try!

Closing Thoughts

Sometimes the best solution isn’t the one with the most features – it’s the one that solves your problem today without creating new ones.

Could this bash script replace Datadog for a Fortune 500 company? No.

Could it save your small SaaS business from 3 AM wake-up calls? Absolutely.

What’s your take? Do you prefer simple scripts or enterprise monitoring? Any horror stories about over-engineered solutions? Drop a comment below! 👇


This content originally appeared on DEV Community and was authored by Sandro 🦖☄