Ops Monkey: Journalling Filesystem in Linux

ext3 filesystem provides journalling as an important enhancement over ext2 filesystem, where the filesystem maintains a journal to log the changes.

How is the journalling feature in filesystem helpful?

After an unclean shutdown of system, due to power failure or system crash, in case of ext2 filesystem, each mounted partition must be checked for consistency using e2fsck program. This causes delay in system boot time significantly, especially in case of large partitions containing a large number of files. During this time, any data on the partitions is unreachable.

In case if fsck need to be run on a live sytem, the partitions need to be remounted as read only. When a filesystem is mounted as readonly, all pending metadata updates (and writes) are then forced to the disk prior to the remount. This ensures the filesystem is in a consistent state and it is now possible to run fsck -n.

The journalling feature provided by the ext3 file system means that this sort of file system check is no longer necessary after an unclean system shutdown.

The time to recover an ext3 file system after an unclean system shutdown does not depend on the size of the file system or the number of files; rather, it depends on the size of the journal used to maintain consistency. The default journal size takes about a second to recover, depending on the speed of the hardware. So journalling has made running fsck after an unclean unmount unnecessary.

Is fsck necessary when journalling feature is available for a filesystem?

In case of extreme cases like hard drive failures, file system consistency check(fsck) is very much necessary.

How journalling works?

There are three modes of journalling

1) ordered (only the metadata is journalled - default)
2) writeback (Only the metadat, but no guarantee for order of commits)
3) journal (both data and metadata are journalled)

Ordered mode :

mount -o data=ordered

In ordered mode, the data blocks related to a metadata change are written to the disk before the metadata is committed to the journal. This ensures that every metadata change recorded in journal actually reflects the writes that have been made to the disk.

Ordered mode is the default journal mode used in most systems.

Writeback mode:

mount -o data=writeback

This is the fastest mode. In this mode, metadata may be committed to the journal even before the databalocks related to the metadata change are written to the disk. Thus files may contain stale data.

Journal mode:

mount -o data=journal

In this mode, both metadata and datablocks related to the metadata change are journalled.

Here, a copy of the modified databalocks are first written to the journal. Then the modified datablocks are writted to the filesystem. Once the I/O data transfer to the filesystem terminates (data is committed to the filesystem), the copies of the blocks in the journal are discarded.

This is the slowest mode of journalling. More total disk I/O is being done here. However, this merges lots of small writes around the disk into efficient linear IO, which helps in avoiding expensive seeks for small, random writes.

How to know if a filesystem has journalling enabled or not?

# dumpe2fs /dev/sda2 | grep -i has_journal
dumpe2fs 1.41.12 (17-May-2010)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

# debugfs -R features /dev/sda2
debugfs 1.41.12 (17-May-2010)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

How to find the size of the journal?

# dumpe2fs /dev/sda2 | egrep -i '(journal|size)'
dumpe2fs 1.41.12 (17-May-2010)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Block size: 4096
Fragment size: 4096
Flex block group size: 16
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Journal size: 128M
Journal length: 32768
Journal sequence: 0x0002cb40
Journal start: 1

How to improve journal performance?

Journal performance can be improved by placing the journal on a seperate device. The external journal partition should be located on a device with similar or better performance characteristics than the device that contains the file system. Important points to note here are

1) external journal will use the entire partition
2) journal partition must be created with same block size as that used by the filesystem it is journalling

Let us see how to go about a creating a external journal partiton. Say, for filesystems in partitions on the device /dev/sda, we want to create a external journal partition on device /dev/sdb.

1) Get the filesystem block size of partition /dev/sda1

# dumpe2fs /dev/sda1 | egrep -i '(journal|size)'

2) Unmount the filesystem, /dev/sda1 and remove the internal journal

Make sure any filesystems are cleanly unmounted before altering them with the tune2fs or e2fsck utilities

# tune2fs -O ^has_journal /dev/sda1

3) Create an external journal device partition

# mke2fs -O journal_dev -b <block-size> /dev/sdb1

4) Update the /dev/sda1 filesystem superblock to use the external journal /dev/sdb1

# tune2fs -j -J device=/dev/sdb1 /dev/sda1

Ops Monkey

Sunday, February 24, 2013

Journalling Filesystem in Linux

No comments:

Post a Comment