Exercise for the Reader

July 5, 2010

Server Monitoring, Revisited

Filed under: Computing Infrastructure — Seth Porter @ 12:35 pm

In which we belatedly celebrate a year’s worth of baseline data; entertain an aside into intrusion detection via passive measurements; resolve some of the Mystery of the 12V Deviations; and conclude with daydreams of better sensors and monitoring beyond the box itself.

A Year of Data

A few months ago, I hit the milestone of a full year’s worth of data collected by my server monitoring tools. I started a writeup at the time, which I’m finally revisiting in hopes of actually posting it. (I’ve also learned that I seem to be capturing exactly a year of historical data, which is unfortunate; I’d perhaps naively assumed that the powers-of-two backoff in storage frequency would give me an infinite time window, albeit eventually degrading to a single sample for the oldest time slice.)

First, a quick overview; here is a year’s worth of smoothed temperature data (the average-per-interval is plotted, whereas my usual “is something wrong” daily charts are plotting the maximum in a given interval).

Smoothed 1 Year Temperatures

The temperature curve is about what you’d predict for a computer that lives in a poorly-insulated room in Pittsburgh. I lost my outside temperature correlation data (RRDWeather gave up on me a while ago – maybe weather.com changed their policies, or just their web service API, and I never really did anything beyond sending an e-mail to the developer). However, I’m quite confident that they’re well aligned; anyone who lived here this year can pick out, say, that late warm spell we had at the end of November. The one thing that concerns me a little is the decreasing separation between the board and the drive temps; that makes me wonder if I’ve obstructed airflows a little (leading to more indirect heating of the drives). Still and all, there are no major deviations here, and a year later I’m basically in the same temp bracket that I was last July, which is somewhat soothing (and the whole point of establishing this baseline data).

One of the ongoing puzzles from this server-monitoring data has been the voltage data. It has a variety of odd characteristics, including daily and seasonal cycles, as well as a fair amount of apparent noise. I had thought they’d quieted down a lot when I installed a uninterruptible power supply (with supply-voltage smoothing). I still think that’s true for most of the lines, particularly the 3.3V and 1.5V rails, which are actually providing power to the chips. However, there’s still a significant variation on the 5V and 12V rails. Looking at this data for the past month, you can clearly see a daily cycle on the 12V rail through Weeks 24 and 25:

Voltages June-July 2010

In the chart here, you can also see a “reboot discontinuity” at the beginning of week 23; from memory, I think this was a kernel update.

An Aside Into Passive Monitoring

The occasional almost-vertical drops in voltage (noticeable in this chart at the reboot and the end of week 26) are strongly correlated with system usage (for instance, the CPU utilization chart); a co-worker has suggested that this would be an interesting approach to intrusion detection. I might get more into this in a later post, but history suggests otherwise, so please tolerate a brief excursion to look at another set of charts, in this case the history of the DNS daemon running on the server. First, the last 28 hours of data:

DNS Data - Last 28 Hours

I haven’t spent as much time exploring the reporting mechanism for this chart, so things don’t add up as nicely as some of the others, but the intent here is that the thin blue line records the number of requests, which is then partitioned into success or failure cases. I’m not sure if the open gap (between the explained results and the number of queries) is due to failures other than NXDomain (non-existent domain), or if this is simply an artifact of differing time scales and reporting methods. I’m also not sure about the sampling time frame – this is obviously “queries per time period”, but without knowing the granularity, the absolute scale (area under the curve / total queries per hour) is meaningless. However, the basic view of local network activity is pretty compelling, as long as interpreted strictly in relative terms.

For a moment of background: Calliope, the server, performs both DHCP and DNS duties. Except for a couple of local hosts, DNS is simply forward and cached from Verizon’s DNS provider. The interesting bit here is that my local server provides DNS with zero “time to live”, so clients will immediately ask again if they need to resolve the same host repeatedly. This means that the record of DNS activity on the server is a reasonably accurate record of internet usage across all hosts in my network, even the uninstrumented ones like the PlayStation etc.

In this chart, you can see a midnight spike, probably primarily cron jobs and the like. More interestingly, you can see my flurry of surfing in late afternoon, trying to figure out whether the T was a viable way to see the city fireworks, followed by several dark hours when we were off watching things go boom. Then this morning you can see machines coming on for the first time (checking for updates, etc), a period of quiet coffee drinking (lately I’m reading “The Worst Journey in the World”, rather than surfing, over coffee), and then a steady rumble of activity as I fired up my main box and started working on this post.

The weekly and monthly charts have similar stories to tell, but not much additional interest; overall activity tends to peak on the weekends, and during the week there’s a definite shift to activity later in the day (when I get home from work), but this is all predictable (though useful from an intrusion-detection point of view). Before returning to the earlier question of voltage excursions, I’ll just post one more DNS chart, this time showing the past year’s activity:

DNS Data - Past Year

For the most part, the patterns have disappeared into noise, though there’s still a definite periodicity; I wonder if my cycle of amusements (computer vs books vs television, and so forth) is really that predictable? There’s also a spike around Christmas, from shopping and travel planning, and an interesting nadir in March, which I think I can explain: this was a period of high intensity at work, and just before I started working on a consulting side-project (which explains the above-average usage for the following months). In any event, I find this data kind of interesting, though the real fun would be in correlating it against a diary or other activity measures to see if it’s really a valid proxy for how much time I spend on the computer.

Back to the Voltage Fluctuation

Anyway, returning to the voltage dilemma, we saw a month’s worth of data previously; here is the data for the full year:

Motherboard Voltages: Past Year

It’s not the easiest chart to read, I admit. In all cases the plots are deviations from “nominal voltages”. Most of the traces are really rock solid, varying only a few millivolts if that. However, there’s clearly something going on with the 5V and 12V lines: they vary together, and they have a clear seasonal component. That pattern from October and November is particularly interesting, if you remember that both those months had sharp cold spikes followed by surprisingly mild weather.

I’ve probably given away the game by now, but I’ll walk through it anyway. Since statistics are pretty tricky on these datasets (the data points are actually averages over varying intervals, and I suspect that does evil things to the normal distribution assumption), I decided to satisfy my curiousity with a visual correlation instead. To do this, I plotted a key temperature curve (reported motherboard temperature) on the same graph as the 5V and 12V deviation plots, then inverted the scale on the temperature (so valleys are the highest temperatures) and iteratively adjusted constant and scale until they lay in the same general region. (I probably could have also just inverted temp and thrown it onto a secondary axis, but this was easier.) As a result, the vertical units for the voltage lines are millivolts of deviation from their spec voltages, and for the temperature they’re completely synthetic units. Note that the chart is kind of mis-titled: for the reasons noted above (“I’m not good at stats”), this is not a direct plot of correlation, but simply a superposition of the two datasets (suitably massaged). We already know that the 12V line and the 5V line are pretty well aligned, so look for how well the green line tracks the pattern established by the voltages:

Temperature / Voltage Analysis - Past Year

Looking at this chart, as has been clearly foreshadowed, we have a strong “correlation” over the course of the year. My explanation for the basic pattern here is the system fans. There are both 5V and 12V fans in the system. Since the server is relatively underutilized (only accessible by me and my wife, and typically for light duty), the usage of these fans is the dominant change in power usage over time.

There is clearly some overshoot in both directions, in warmest days of summer and the coldest days of winter. I invoke two explanations for this. In summer, these are NOT independent measures – the fans are working to cool the motherboard. I would infer that temperatures were actually higher in August than July, but the fans kicked in to keep the motherboard temperature roughly constant across those two months.

This rationale doesn’t seem to work to explain the winter overshoots, since the fans only work to push the temperatures in one direction. There are two scenarios here. One is that we over-compensate with heating when it’s coldest outside; this generates a noisy graph on a yearly scale (since the temperatures are varying widely over the course of the day, and the fan load correspondingly). A second explanation would be that the fans are essentially idle through the winter, and as a result they cease to be the dominant driving force on particularly the 12V line. This allows the daily noise of varying utilization to show through, instead of being drowned in the all-day-long load of the fans.

Unfortunately, I lack the data to disambiguate these last two scenarios. Most of the fans are controlled by an off-board fan controller (sitting in a drive bay, but not providing instrumentation data back to the host computer). The BIOS-controlled fans should report their speed, but I’ve never been able to get good data out of them; I suspect that at heart the problem is that this motherboard came from a pre-packaged HP “media computer”, and they may have saved a few pennies by cutting out some of the monitoring tool support. At some point I’ll upgrade my dev box, and transplant its rather more enthusiast motherboard to the server. That’ll be a setup chore to be sure, but perhaps I’ll be able to get some direct information. In the meantime, having such a long baseline allows me to be pretty confident in my indirect conclusions. (Note that this is the whole point of this post: if I point-sampled this data at any time throughout the year, I really wouldn’t be able to draw any conclusions at all. However, a year’s worth of data tells a pretty compelling story, even without formal stats to back it up.)

Sampling Outside the Box: 1-Wire and Friends

I have some longer-range thoughts or daydreams, time and money permitting (primarily time). I spent quite a while looking for a fan controller which ”would” report its sensor readings to the host. Apparently they exist, but the only examples I was able to find that really fit my needs were these beasts from Matrix Orbital. While extremely in their own way, I really don’t need an OLED screen displaying weather etc on my server box, when the whole point is that I don’t see or hear the server anyway. However, some of the tech specs on this monster mentioned that it can gather and report data from any “1-wire” sensor source. That set me off on a whole other trail.

It turns out that “1-wire” is a Texas Instruments physical and logical protocol for simple, low-power sensor networks. It looks to be primarily designed for industrial applications, but there is a hobbyist community as well (particularly for weather stations, it seems, although I found a fascinating story of a buried moisture sensor driving an automated sprinkler system – now that’s gardening! [Update: Okay, I think the hobby-built CO2 sensor beats the automated sprinklers; no link because I noticed belatedly that it’s deep in an “Advanced Marijuana Cultivation” forum – I’m still bemused by how Google ends up cross-cutting conversations!]). There are USB “host adapters” available, as well as a pretty wide range of sensor units (and raw adapters which simply sample a voltage and report it) – the net result seems to be that you can monitor pretty much anything you want, if you obey the topology and run length constraints and don’t mind writing some software to interrogate and record.

So my new dream is to expand my coverage significantly. This project started as a way to monitor the health of my server (and particularly to spot dead fans) without being able to see or hear it. However, there’s no reason I can’t use the same infrastructure to monitor our living space as a whole. I’d love to track ambient temperatures in various parts of the house, to see where the insulation is failing (and prove out my intuition that it really is ”cold” in the dead space under my computer desk). Weather metrics are less near and dear to my heart, but it could be amusing to have a windspeed sensor or track rainfall. Moisture sensors in the basement could give a approximation of Home Heartbeat, without the vendor lock-in and high prices. I haven’t found a good AC sensor yet, but it seems like it would be possible to do some approximation of the “Smart Grid” idea (realtime electricity-usage tracking) without having to wait on Duke Light to figure it out. Even further down the road (and after building some trust in the system), it could be combined with an X-10 network for real closed-loop environmental control – far better than simply putting everything on a timer, or “home automation” which requires a user at a web browser to actually do anything. Before I dive into anything that big (and potentially annoying if done wrong), I’ll probably start with simpler closed-loop systems. For example, there’s no reason why my nightly cron tasks can’t wait for system idle, instead of running at midnight – that’s when I get some of my best work done! Some small-scale experiments with this sort of feedback loop would teach me a lot about how to control yoyoing and all the other pitfalls of feedback-based machine control, without annoying my wife.

Advertisements

Create a free website or blog at WordPress.com.