This entry is really here to remind me what went wrong with Munin so the next time it happens I am likely to remember. I use Munin on my web servers to monitor them. I run Munin nodes on the web-servers and the Munin master is another computer that just serves up the Munin web pages and runs some rsync commands via cron jobs. This system has been working faultlessly for 2 years and then on September 26 2012 it stopped updating the graphs. Yesterday I finally got around to figuring out why it didn’t work any longer.
First thing I did was take a look at the munin logs on the clients, these are in the /var/log/munin directory. These logs were showing a last modified date of September 26, the same date the munin graphs stopped updating. I opened up the last log in nano and there was exactly nothing of help in there. So I tried re-starting the munin node process with:
I sat there watching the log files for 5 minutes hoping they’d update. Sadly they didn’t. So I started up a new PuTTY session with the machine running the munin master took a look at the munin logs in /var/log/munin/. They all had the September 26 modified date too. I had a quick look through the logs and couldn’t see anything in there either. Next step was to force a manual update of the munin master and see what happened there. I did this by changing to the munin user with:
su - munin --shell=/bin/bash
And then running the munin-update (which gets all the data from the munin-clients)
When I ran this command the update, which should be hundreds of steps was just four steps with the last one complaining about there being no free disk space! Eureka!
Next step was to check the disk usage on the server. I sorted this by directory and displayed it in human readable format with:
du -sh * | sort
This indicated that a directory that contained database backups was using up almost all the disk space on the server. A quick removal of older files from this location freed up a lot of space. I forced a munin update again and hey presto everything started working.
All of this would have been easily solved if I actually had munin monitoring the server that acts as my munin master, but of I course I haven’t done that. Stupid me. I’ll put that on the to-do-list, but for now at least the problem is solved and will not re-occur for several months.