Detecting when you need to system rescue
System Rescue

Kurt provides some tips and recommends some tools to help you detect signs of network intrusion and data corruption.
System rescue – it's definitely an important topic with lots of considerations. Do you go with "bare-metal restore" or just back up the data and all the configs? What about your database? Do you snapshot it, or replicate it and keep a transaction log? What about all the new NoSQL things? More to the point, how do you know when you need to do a system rescue?
Sometimes it's pretty obvious, like when some water spilled onto one of my machines; I stared in horror as the machine made a loud "pop" and the power supply killed the motherboard and then itself. Luckily, I didn't lose any data. Sometimes, however, it's not so clear when you have to do a system rescue. For example, if you find a corrupted file on your system, do you have other corrupted files? Short of opening them all and checking them, you don't know whether you have just one bad file or a completely corrupted filesystem.
File Integrity to the Rescue
Such problems have plagued administrators, well, since computers have had read/write data storage. The good news is that several mature tools can help you address the problems of managing files and ensuring that they are not modified or corrupted. Certain strategies are also helpful when designing and architecting systems to make things more robust. Ultimately, the goal is to prevent data corruption or improper modification as much as possible – by using file permissions, robust filesystems with journaling, and so on. Then, you need to ensure that you can detect file corruption and improper modification and, finally, restore things to a known good state. The two main tools for these tasks are Open Source Tripwire [1] and AIDE [2]. Neither has undergone major changes for a few years, mostly because they are fairly feature complete.
[...]
Buy this article as PDF
(incl. VAT)