Linux is generally resistant to
filesystem corruption because most of the filesystems support
journaling (a technique that keeps track of filesystem modifications in a log file before writing them to the Filesystem itself). Filesystems can still become corrupted through software bugs or hardware failures.
Lets discuss about
Recovering a Linux system from a boot up failure.If a server is not booting up, that means it is failing to pass through the Boot Stages. Different reasons can cause a boot failure.
There are 6 Boot Stages
1. LILO (Linux Loader) or GRUB
2. Loading the kernel
3. Mounting the Disks
4. Startup Scripts
5. Runlevel Scripts
6. Providing a Login Prompt
1. LILO/GRUBLILO is the old linux loader and GRUB (Grand Unified Boot Loader) is the advanced version of LILO Conf - /etc/lilo.conf
GRUB Conf - /boot/grub/grub.conf
If there is no issues with the hardware and if the MBR ( Master Boot Record ) is loading, the boot loader also should run. In case of a failure the boot loader returns an error code and each error code has consecutive causes.
Most common problem in LILO loader is seeing "L1" error, a common reason for this error is that another operating system may have loaded on top of Linux. This can be resolved by wiping out the broken MBR.
fdisk /mbr
or
dd if=/dev/zero of=/dev/hda bs=512 count=1
With the help of a boot disk we can reinstall the boot loader back into MBR.
Other probable cause for a boot loader failure is errors in configuration files. This can be fixed by correcting the configuration and reloading the loader.
2. Loading the KernelThe linux loader would hand off further control to the kernel image listed in boot loader configuration file. In case of a corrupted kernel image, the error message varies and in such cases we can make use of a boot disk to load a functioning kernel. Exact issue (whether it is a hardware related or kernel) can be traced by using the log files(
/var/log/dmesg).
3. Mounting the DisksWhen the kernel loads, it mounts the partitions listed under /etc/fstab. If the mount point fails to load, the system will boot to a single user mode and in such cases we can edit the /etc/fstab file and comment out the mount line to fix the issue temporarily.
To fix the actual errors, we can use the 'fsck' utility after unmounting the file system.
If the superblock has been corrupted, we can restore one of the backup superblocks by using 'fsck' utility. Linux ext2/3 filesystem stores superblock at different backup location so it is possible to get back data from corrupted partition. The location of the backup superblock is dependent on the filesystem's blocksize.
For ext2 filesystems with 1k blocksizes, a backup superblock can be found at block 8193; for filesystems with 2k blocksizes, at block 16384; and for 4k blocksizes, at block 32768.
To use one of these superblocks, run
#fsck -b [blocknumber] /dev/[harddrive].
In some cases, where accessing particular files results in I/O errors, the cause is usually hardware. We can confirm this by checking the output of the ‘dmesg’ command for ‘I/O’ messages.
eg : #dmesg | grep ‘I/O’
Run a block scan of the hard drive.
Use the ‘badblocks’ command to run the block scan:
#badblocks -sv /dev/sdc
Run repair tool on the file system
#e2fsck -v -y /dev/sdc
For ext2 and ext3 Filesystems, the ‘e2fsck’ tool with ‘-y’ option will take all available measures to correct the Filesystem, which is usually what’s needed if the Filesystem is corrupted enough to require manual repair:
Drive Failure
==============
In case of a harddrive failure, we can move the entire drive to another device, either as an image file or a new partition.
Use 'dd' tool;
By using this command and its various options, we can move the drive by skipping errors.
eg : #dd if=/dev/sdc1 of=/dev/sdb1 bs=4k conv=noerror,sync
Use 'ddrescue' tool;
With this tool, we can get better results and it also keeps a logfile that records each of the bad blocks found. By using those log files we can retry with that bad blocks alone.
Eg : #ddrescue -n /dev/sdc1 /dev/sdb2 logfile
With this command we can copy the good blocks alone. Then retry each bad block three times and hopefully rescue some more data:
#ddrescue -r3 /dev/sdc1 /dev/sdb2 logfile
Once we complete copying as much of possible data to the new drive, we can run a filesystem repair command on the new drive to correct the errors caused by the unreadable blocks, and then mount the new Filesystem as usual to access the files.
Missing or Corrupted Partition Table
=====================================
The best option is to boot a live device and run "testdisk"
#testdisk /dev/sdc
This tool scans the drive for Filesystems, determining their start points and sizes, and then builds a new partition table to match.
4. Startup ScriptsIn addition to the actual mounting, hardware detection occurs, networking is configured, hostname specified, clocks started, portmaps rendered and console settings declared. If one of these scripts fails, we can fix it either by finding the related hardware for the script and fixing the hardware issue or by disabling that specific startup script.
5. Runlevel ScriptsAfter the startup scripts, the runlevel scripts would start accordingly with the configuration set in "/etc/inittab". If any daemon fails to start, turn of it temporarily to boot up the system safely.
6. Providing a Login PromptMost of the virtualized servers come without a GUI. So the system should allow the users to login through a shell.
This is very basic structure of a recovery format. There are a lot of advanced ways to effectively recover the disk and files.
Thank you for reading!