Exercise for the Reader

June 20, 2009

Server Monitoring with Pretty Pictures

Filed under: Uncategorized — Seth Porter @ 1:40 pm

(Here are the pretty pictures. The actual discussion is after the break.) (And okay, maybe they aren’t that pretty, but I’d say they’re more aesthetic than you could reasonably expect for monitoring health stats on a Linux server. Further defensive parentheticals will be reserved for later in the post.)



For a while now I’ve been meaning to post something about Calliope, my (Debian) Linux server at home. I’ve got a nice draft somewhere about all the useful services I’ve got running, why I chose the packages I did, and so forth. At the end of the day, though, the basic take-away is “pick some services you want to run, install them, and Google till you get them to work”. (I’ll probably post it someday, though perhaps not after giving it such a heart-warming buildup.)

Anyway, the more I think about it, the most distinctive part of Calliope isn’t the  actual services she provides, but rather the tools I’ve assembled for keeping an eye on her. This isn’t like a desktop box, where I’m sitting at the console or the machine is turned off. Instead, most of my interaction is indirect — listening to music on the PS3, or looking at files shared through Windows networking or Apache, or even just acquiring a network address through DHCP. The common thread in all these cases is that I’ll notice abrupt failure, but I’m not directly logged in to see notifications or status messages.

One traditional solution has always been e-mail. I send myself nightly notifications of backup success, mostly because it gives me a warm fuzzy feeling to know that my Subversion repository is safely in at least two places (and on a recent trip, I found it a surprisingly reassuring touch of home), but fundamentally I’m not prepared to page through long logs or status reports on my phone (where I read most of my e-mail). In fact, I’m not prepared to do that even on my desktop or netbook, unless I already know there’s a problem (and remember, theonly readily visible symptoms of problem are “slow” or “failed to connect”).

So to summarize, I’ve got an uninvolved admin staff (myself), who wants things to Just  Work, and doesn’t want to have to explain to his wife that the internet is down right now because he was  in the middle of a project when he got bored with it. Fortunately, I’ve got Debian-stable as a pretty damn rock solid baseline to build from, so most problems will be the result of misconfiguration, user error, or mechanical failure. (The last is also a challenge, since the box is out in the hallway with the cats’ litter box — not a lot of foot traffic to notice things like fan failure or hard drive Squeaks of Doom.) Oh, and because I don’t do Linux server admin for a day job, I’m not necessarily going to be able to distinguish between dire-sounding-but-routine conditions and actual symptoms, since I lack a good intuitive baseline.

Anyway, enough exploration of the problem space. (Well, never enough, but at some point requirements analysis turns into plain old griping.) I actually have a solution that’s working pretty well for me so far, and is probably the biggest difference between this box and previous Windows servers I’ve set up.

Dashboard Health Charts

At the beginning of the post, I showed the charts I see when I turn on my computer in the morning to check weather and news; I’ll repeat them here. These are abbreviated versions of the full dataset, just enough to give me a quick picture of anything that’s going drastically wrong.



(In real usage, these are a 2×2 grid, but that breaks the margins here.) From the top, we see percentage of free space on various hard drive volumes, CPU utilization, memory usage, and system temperatures, all over the last 28 hours.

These particular charts show a pretty boring day. You can see the CPU spike at midnight from cron jobs, and Friday evening when I was actually using the machine. I can’t remember off-hand the reason for the circa 6 am spike, but it’s not unexpected. Temperatures are drift a little higher than usual, but that’s echoing the outdoor temps (the lower grey line on that chart).

Some technically inclined readers may be thinking that 10% utilization is awfully high for an idle box; what’s not shown on this chart is that the CPU freq is governing itself down to 300 MHz when it’s idle. Sometime I need to get some empirical evidence on whether this is a better trade (for power saving) than keeping the freq up and having the box strictly idle more often. A better chart would probably multiple the utilization percentage by the CPU frequency scaling; maybe I’ll try that one of these days, but I suspect that the current setup has much the same effect as a logarithmic chart — pulling the “10% at 300 MHz” and “100% at 2.4 GHz” samples closer to the middle, which makes for finer relative detail, even as it destroys absolute comparability.

Anomalies

In summary, in the time it takes me to decide if I’m checking weather or CNN, I’ve already got a quick gut sense that Calliope is basically healthy and happy. For comparison, here’s a (somewhat simulated) chart of what I saw at on 6/8, or rather what I would have seen if I hadn’t been just back from a trip and catching up on my jetlag. Imagine that the bars are the same 5 minute-ish accuracy seen above; I’ll discuss the progressive degradation later. (Because I’m working quickly, this is also a full chart with legend, rather than the abbreviated version I see on the daily dashboard.)

CPU Utilization - Prolonged SpikeAnyway, here we see an approximately 4 hour long spike in CPU utilization, at a time when I certainly wasn’t doing anything interactively. For a cross check, here’s the temperature chart for the same period:

Temperature Spike

Temperature Spike

(I’m somewhat tipping my hand about being able to drill into this information, what with the legends and all, but pretend you’re just seeing the dashboard version.) We see a significant spike in CPU temperature over the same period, as well as a small rise in overall system temperatures. I’ll refrain from posting the drive freespace and memory charts, since they didn’t really show any correlated activity. (Which in itself says something: this probably wasn’t a massive spam run or anything, since I would expect to see a lot of memory usage if a user process was taking up that kind of CPU time for four hours.)

The first time I saw this pattern, I was a little panicked. There was some good news, though:

  • I knew this wasn’t routine, since I’d been seeing these charts for weeks and knew what to expect
  • I had good data on the start and end of the spike, letting me go straight to the right part of the log files
  • I had some idea of what hadn’t happened, based on the lack of signal in the disk and memory charts

To cut to the chase, it turns out that as configured on Debian, mdraid (the Linux RAID subsystem) does a full resynch of all RAID-ed drives on the first Sunday of every month. (Basically, this is reading the data and checksums from the three drives involved and making sure that they actually agree; sort of a sanity check that what’s ending up on disk is the same data that we’re sending.) Once I figured that out, I decided that I approved, and now I expect that spike every month. (In fact, I’d be concerned if it weren’t there.) Which is exactly the point — having an easily interpreted set of charts in my face every day, along with longitudinal data to provide a baseline, lets me get on with my life with some expectation that the server is doing the same.

Historical Baseline

As I say, I see these charts most days, which means I’m accumulating a baseline behind my eyes. However, Calliope also records longitudinal data for posterity. (As you saw above in the “RAID spike” charts, the temporal resolution progressively degrades; as promised, I’ll eventually talk about this.) The disk space charts are pretty boring over time; they mostly serve as a record of kernel upgrades and significant data transfers (like getting our wedding videos onto the RAID drive). CPU utilization is a little more interesting:

CPU Utilization - Past Month

CPU Utilization - Past Month

The periodic nature of background nightly and weekly tasks is clearly visible, as is the periodicity of my own usage. You’ll note that peak usage is over the theoretical 200% cap (100% per CPU). This is because the chart is individually taking the max value of the various utilization types in the aggregation time span, then stacking them. It’s not technically correct, but it serves its purpose. I could plot averages or minima, either overlaid on the same chart or as separate charts, but then I’d probably have to use lines instead of the pretty bars (which I think help for the quick visual read).

For prettiness (at least if you’re into that sort of thing), it’s hard to beat the memory chart. There’s not much drama here, but that’s the good news: a long term growth in the red region would indicate a persistent memory leak, and I’d have to build some system to kill and restart the offended processes (or just reboot periodically).

Memory Usage - Past 4 MonthsBasically what you see in the memory chart is the reboot frequency, when everything resets to zero. Beyond that, the dark green is free-free, while the bright green is disk cache. So mostly you can see how quickly the system ends up reading four gig from disk that the kernel decides is worth caching. That white line down the middle is the result of missing samples; as I recall that’s when I shut the machine down to move it into the hallway, and incidentally hook it up to an uninterruptible power supply.

For the last of the “core” charts, here are two longer-term temperature graphs (past week and past month):

Temperatures - 1 weekTemperatures - 1 monthThere’s a nice periodicity here. The grey ambient line is really outside temperature (I don’t remember if it’s at PIT or Allegheny County Airport), so it’s not local temp but it’s a decent proxy for general climate trends. Someday I’ll get an in-house temperature sensor (probably 1-wire, that stuff sounds pretty neat — I didn’t even know I had gardening automation needs!), but for the moment this is good enough. I think it would be possible to calculate the thermal “half-life” (I don’t know the formal term; I mean the notionally constant time for the difference between interior and exterior temperatures to be halved) of our house, or at least that hallway, based on the lateral displacement between exterior temps and their reflection in the case-internal temperatures. Maybe someday I’ll get around to it, and maybe offset the two data streams to align them. (Or get really fancy and strip out the correlation term, leaving normalized case internal temps… but that’s probably a bad idea because the magic smoke doesn’t care about whether it’s weather-caused or activity-caused.)

Other Datasets

So far I’ve talked a lot about those four charts. There’s a good reason; I chose them because they should reveal several classes of resource exhaustion problems, and serve as pretty good proxies for overall system activity. In fact, as a friend at work pointed out, the temperature data is probably a pretty good stealth- / spy-ware detection scheme: lots of software will try to hide from system logs, but at some level you can’t do anything with a compromised box without generating some heat (and it would be peculiarly paranoid piece of software that spent a lot of effort on spoofing the CPU thermo-sensors).

In any event, those are the charts I look at every day, and they tell me a lot. However, they are not the only datasets I’m gathering. A few examples:

DNS Queries and Results - 1 Week

DNS Queries and Results - 1 Week

This chart shows DNS (Domain Name Service — mapping names like “www.cnn.com” to a numeric address) activity and success or failure. I mostly keep this one around as a monitor not on Calliope itself, but rather on the rest of the network. Many types of malware will at some point have to do some name resolution, and some will result in an absolute flood of DNS queries. Likewise, there’s a chance that this would detect someone connecting to my wireless network and piggybacking on my internet connection. Right now all looks quiet; you can see pretty clearly that I do most of my computer work in the middle of the evening.

I’ve also got similar charts for disk and network adapter activity, but they’re remarkably uninteresting — with at most two users connecting, and very little cache contention, they only register above threshold values if I’m streaming video or the like. The data gathering tool I use to collect ambient temperature information is hitting weather.com, and incidentally collects wind speed, humidity and barometric pressure; vaguely interesting, but nothing that’s not already available on the web. At various times I’ve gathered stats on Apache utilization, but again the load is mostly too low to even show up on the default scales (“queries per second” is a little ambitious when you hit a site maybe 50 times a day).

There is one important set of health stats which I record, even though I don’t look at them too often: the reported motherboard voltages of the various power supply lines. This is a great example of why gathering longitudinal data is so important; honestly, I couldn’t tell you what level of deviation I’d expect on the 1.5V line (and when I look at instantaneous stats on my desktop using SpeedFan, all I can say is “yeah, that’s kind of near 1.5V…”). However, with a baseline from when the computer was known to be working, I can pretty easily spot changes in the trend or disturbing variations. Here are three such charts. First, the latest 28 hour window:

Motherboard Voltages - 28 hoursTo get everything on the same chart, I’ve roughly normalized the voltages (by subtracting off the nominal voltage). Some of these are so stably off-center that I suspect my correction terms may be wrong (for example, is Vcore really 1.5 volts or is it perhaps supposed to be 1.525V?), but I’m really just trying to get them all in the same area so I can compare deltas. (Similarly, one could argue that I should be showing percentage deviation rather than absolute, but that would require me to actually make sure what the design-spec voltages really are.)

My primary take-away from this chart is that the jitter is really quite small in absolute terms; the 12V line is noisiest over time (probably partly due to varying fan loads), and even then it’s only drifting in a range of 0.022V.

To see the correlation with the other data, and see if these values change under load, let’s examine that same 28 hour interval with the RAID sync that we looked at above:

Motherboard Voltages - RAID Sync time period

Motherboard Voltages - RAID Sync time period

Wow, that’s a nice correlation with the temperature and CPU load charts. The 12V line drops most in absolute terms, probably driving the fans, but we can see that everyone took a hit. Not out of spec, I don’t think, but enough to make me wonder if in the long term I might want a little more power supply now that I’ve jammed this box full of disk drives.

For the final voltage chart, let’s look at all data available (four months worth):

Motherboard Voltages - 4 month period

Motherboard Voltages - 4 month period

Okay, there’s a pretty clear change in the middle of this one. Fortunately, I think I know the reason; recall the Memory Utilization chart for the same period, when I mentioned that I’d moved the box and also put it on a UPS? By my guess, this is a pretty graphic depiction of the benefits of active voltage regulation rather than running off raw wall power. (I think it also moved to a different circuit, so it’s possible that this is really showing the benefits of not being on the same circuit as the microwave oven. I guess I could have controlled for that, testing with the UPS on the same circuit first, but we really wanted the server out of the living room.) Again we see the benefits of historical data: I couldn’t have told you what an acceptable or expected level of jitter was on the +5V line, but I can surely see that it has stabilized a great deal.

Coming Attractions

I’m realizing that this post has gotten a lot longer than I originally thought, so I’m going to break my promise. I won’t discuss the implementation of this (although there are some strong hints in the watermarking on the charts); instead I’ll use that as incentive to post sooner than six months from now. I hope this has been somewhat interesting, and perhaps suggested some ways to get more value out of self-monitoring stats than taking a quick look at SpeedFan and saying “Yup, that’s a reported voltage all right!”

Advertisements

2 Comments »

  1. […] Exercise for the Reader Seemingly trivial problems in practical software development « Server Monitoring with Pretty Pictures […]

    Pingback by The Ugly Truth Behind Pretty Pictures « Exercise for the Reader — June 27, 2009 @ 12:06 am

  2. […] few months ago, I hit the milestone of a full year’s worth of data collected by my server monitoring tools. I started a writeup at the time, which I’m finally revisiting in hopes of actually posting […]

    Pingback by Server Monitoring, Revisited « Exercise for the Reader — July 5, 2010 @ 12:36 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: