Linux > Server Security & Hardening

Disaster Recovery Techniques!


Linux is generally resistant to filesystem corruption because most of the filesystems support journaling (a technique that keeps track of filesystem modifications in a log file before writing them to the Filesystem itself). Filesystems can still become corrupted through software bugs or hardware failures.

Lets discuss about Recovering a Linux system from a boot up failure.

If a server is not booting up, that means it is failing to pass through the Boot Stages. Different reasons can cause a boot failure.

There are 6 Boot Stages   

      1. LILO (Linux Loader) or GRUB

      2. Loading the kernel

      3. Mounting the Disks

      4. Startup Scripts

      5. Runlevel Scripts

      6. Providing a Login Prompt


LILO is the old linux loader and GRUB (Grand Unified Boot Loader) is the advanced version of LILO Conf -  /etc/lilo.conf

GRUB Conf -  /boot/grub/grub.conf

If there is no issues with the hardware and if the MBR ( Master Boot Record ) is loading, the boot loader also should run. In case of a failure the boot loader returns an error code and each error code has consecutive causes.

Most common problem in LILO loader is seeing "L1" error, a common reason for this error is that another operating system may have loaded on top of Linux. This can be resolved by wiping out the broken MBR.

--- Code: ---fdisk /mbr


dd if=/dev/zero of=/dev/hda bs=512 count=1
--- End code ---

With the help of a boot disk we can reinstall the boot loader back into MBR.

Other probable cause for a boot loader failure is errors in configuration files. This can be fixed by correcting the configuration and reloading the loader.

2. Loading the Kernel

The linux loader would hand off further control to the kernel image listed in boot loader configuration file. In case of a corrupted kernel image, the error message varies and in such cases we can make use of a boot disk to load a functioning kernel. Exact issue (whether it is a hardware related or kernel) can be traced by using the log files(/var/log/dmesg).

3. Mounting the Disks

When the kernel loads, it mounts the partitions listed under /etc/fstab. If the mount point fails to load, the system will boot to a single user mode and in such cases we can edit the /etc/fstab file and comment out the mount line to fix the issue temporarily.

To fix the actual errors, we can use the 'fsck' utility after unmounting the file system.

If the superblock has been corrupted, we can restore one of the backup superblocks by using 'fsck' utility. Linux ext2/3 filesystem stores superblock at different backup location so it is possible to get back data from corrupted partition. The location of the backup superblock is dependent on the filesystem's blocksize.

For ext2 filesystems with 1k blocksizes, a backup superblock can be found at block 8193; for filesystems with 2k blocksizes, at block 16384; and for 4k blocksizes, at block 32768.

To use one of these superblocks, run

--- Code: ---#fsck -b [blocknumber] /dev/[harddrive].
--- End code ---

In some cases, where accessing particular files results in I/O errors, the cause is usually hardware. We can confirm this by checking the output of the ‘dmesg’ command for ‘I/O’ messages.

--- Code: ---eg : #dmesg | grep ‘I/O’
--- End code ---

Run a block scan of the hard drive.

Use the ‘badblocks’ command to run the block scan:

--- Code: ---#badblocks -sv /dev/sdc
--- End code ---

Run repair tool on the file system

--- Code: ---#e2fsck -v -y /dev/sdc
--- End code ---

For ext2 and ext3 Filesystems, the ‘e2fsck’ tool with ‘-y’ option will take all available measures to correct the Filesystem, which is usually what’s needed if the Filesystem is corrupted enough to require manual repair:

Drive Failure

In case of a harddrive failure, we can move the entire drive to another device, either as an image file or a new partition.

Use 'dd' tool;

By using this command and its various options, we can move the drive by skipping errors.

--- Code: ---eg : #dd if=/dev/sdc1 of=/dev/sdb1 bs=4k conv=noerror,sync
--- End code ---

Use 'ddrescue' tool;

With this tool, we can get better results and it also keeps a logfile that records each of the bad blocks found. By using those log files we can retry with that bad blocks alone.

--- Code: ---Eg : #ddrescue -n /dev/sdc1 /dev/sdb2 logfile
--- End code ---

With this command we can copy the good blocks alone. Then retry each bad block three times and hopefully rescue some more data:

--- Code: ---#ddrescue -r3 /dev/sdc1 /dev/sdb2 logfile
--- End code ---

Once we complete copying as much of possible data to the new drive, we can run a filesystem repair command on the new drive to correct the errors caused by the unreadable blocks, and then mount the new Filesystem as usual to access the files.

Missing or Corrupted Partition Table
The best option is to boot a live device and run "testdisk"

--- Code: ---#testdisk /dev/sdc
--- End code ---

This tool scans the drive for Filesystems, determining their start points and sizes, and then builds a new partition table to match.

4. Startup Scripts

In addition to the actual mounting, hardware detection occurs, networking is configured, hostname specified, clocks started, portmaps rendered and console settings declared. If one of these scripts fails, we can fix it either by finding the related hardware for the script and fixing the hardware issue or by disabling that specific startup script.

5. Runlevel Scripts

After the startup scripts, the runlevel scripts would start accordingly with the configuration set in "/etc/inittab". If any daemon fails to start, turn of it temporarily to boot up the system safely.

6. Providing a Login Prompt

Most of the virtualized servers come without a GUI.  So the system should allow the users to login through a shell.

This is very basic structure of a recovery format. There are a lot of advanced ways to effectively recover the disk and files.

Thank you for reading!  :)


[0] Message Index

Go to full version