IT Horror Stories

Posted by Nagios on October 30, 2018

We’ve all been there. That moment when your heart sinks, human error meets technology, and your worst nightmare comes true. Some Nagios employees share past IT stories that still haunt them to this day.

Spooky Servers

One Saturday night, while on-call, I was 30 minutes away from work when the commercial power failed. Luckily, the diesel generator kicked in, but the automated transfer switch failed!

With only 19 minutes of UPS power, the entire data center had an ungraceful hard power loss before I could get there. This resulted in hundreds of downed servers, dozens of ESXi servers, and custom databases with startup scripts that required intimate knowledge of how to start them. Needless to say, I was the only tech who responded, and it was a very long, 30+ hour job to bring up the entire HQ infrastructure. My eyes bled!

We only lost one ESXi server, which luckily turned out to be a WebSphere test environment. It was at this point that we realized that some of the backup files were corrupted, so we had to rebuild them from scratch. Yay, more fun! I was very thankful that it was only a test machine.

This happened twice! The generator didn’t kick on the next time. At least the transfer switch worked, and another tech responded to me. That time, it was only about 15 hours of overtime.

Spooky!

-Nagios Support Analyst

Eerie Event Processing

I was a new development team member, eager to prove myself. I had already been working in the codebase for a few years, so I was mostly comfortable with jumping in and making changes.

Always on the lookout for performance enhancements, I came up with a way to decrease the memory footprint for the software to handle specific events. This was a massive improvement for large systems!

The old way would process each event individually as it occurred. The new and improved way would queue the events up until a certain number of events were saved or a certain time period had gone by before it started processing them.

I tested, then the development team tested, and then the QA department tested.

However, when we released the software, we were getting an unusually large number of bug reports, all for the same thing. Events weren’t being processed!

As it turns out, there were some international characters that were causing an issue during the event queueing process, which explains why we didn’t catch it during the QA process.

Thankfully, we identified the issue almost instantly and were able to issue a patch that same day.

-Nagios Product Development Manager

Menacing Manuals

Back in the early 90’s, my first job was as a systems administrator. My first day on the job, a forklift dropped off a 3’x3’x3′ crate in front of my desk. My boss stopped by and said “I need you to have this setup for production by Friday”. I replied, “What is it?”

He went on to tell me it was an Ascend Max TNT remote access concentrator, which contained 960 modems and networking, which contained some proprietary operating systems, but not to worry because inside the crate was the manual, which should contain all the information I need on how to configure it. (This was before you could Google the answer to almost anything.)

He was right about the crate containing the manual; it contained not one but four 300+ page manuals with detailed instructions on every use case other than what we wanted to use it for.

One similar manual can still be found today
https://downloads.avaya.com/elmodocs2/definity/def_r10_new/max/0678_002.pdf

Luckily, after many long hours, I was able to figure out how to configure it.

-Nagios DevOps Engineer

Dreadful Data Center

In 2011, I inherited a datacenter and had a half day to learn as much as possible about the multi-site infrastructure during the IT director’s last day of work. It seemed fairly straight-forward at face value. 33 million users/mo, 35k mailboxes, a few petabytes of used storage, etc. Most of the gear was up to the task, but it was all fairly dated and needed to be overhauled.

Back then, if you didn’t properly configure storage, Windows could only see up to 2TB of space. We apparently had our entire flat-file mail store located on one of these drives, with no clear path to migrate it off at any time soon. The drive was 90% full and climbing. Having to put out many fires per hour, I had not yet configured Nagios to monitor this, and no other performance monitoring or alerting was in place.

One night, the remaining 10% suddenly closed the gap, and the call centers lit up. The mail server program at the time had no ability to see multiple drives, and migrating the data off the maximum drive was projected to take months based on the physical performance limitations. If we tried to move the data any faster than 1 mailbox every 10 minutes, the system would become unstable and take down other items like web services, SQL servers, etc. It’s basically a domino effect. So that’s what we ended up having to do. One mailbox at a time…

Part of me wonders if they are still migrating that data…

-Nagios Operations Specialist


Nagios is there to alleviate all your IT horrors! Download your free, fully-loaded 30-day trial of all the Nagios products here.

Recent Posts

Real-Time Monitoring with Nagios

In this article, you’ll learn how you can implement real-time monitoring with Nagios as well as use cases for when real-time monitoring is not beneficial.