Saturday, January 26, 2013

Disks, Tuning Sequential Disk Acess, IO Scheduling


Disks are electro-mechanical devices.

Disks are managed by disk controllers. The disk controllers are connected to the processor(CPU) through bus. Most PC-based systems use PCI (Peripheral Component Interconnect) bus to connect peripheral devices such as  hard disk and sound card to the processor. Technically there are other buses as well. For example, the Universal Serial Bus (USB) is a way of connecting things like cameras, scanners and printers to your computer. It uses a thin wire to connect to the devices, and many devices can share that wire simultaneously. Firewire is another bus, used today mostly for video cameras and external hard drives.

lspci is a command on Unix-like operating systems that prints detailed information about all PCI buses and devices in the system.  It is based on a common portable library "libpci" which offers access to the PCI configuration space on a variety of operating systems.

$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 20)
00:04.0 SCSI storage controller: Qumranet, Inc. Virtio block device
00:05.0 SCSI storage controller: Qumranet, Inc. Virtio block device
00:06.0 SCSI storage controller: Qumranet, Inc. Virtio block device
00:07.0 RAM memory: Qumranet, Inc. Virtio memory balloon

lsusb, is a similar command for USB buses and devices. To make use of all the features of this program, you need to have a Linux kernel which supports the /proc/bus/usb interface (e.g., Linux kernel 2.3.15 or newer)

# lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 002: ID 093a:2500 Pixart Imaging, Inc. USB Optical Mouse


Hard Disk Types


For the computer to interface with peripheral device such as hard disk, we have the following hard disk interfaces(or physical block device interfaces) standards
  • IDE(ATA)
  • SATA
  • SCSI
  • SAS
  • USB
  • iSCI+GigE
  • Fibre Channel
Why serial devices are replacing the parallel devices?

Both SCSI and IDE(ATA), use parallel interface.  

A parallel interface is a channel capable of transferring date in parallel mode — that is transmitting multiple bits simultaneously. However, the parallel technologies were inefficient in handling more bandwidth - Why? Though faster the bits are sent the better, however the timing of signals must be the same. This becomes more difficult with faster and longer connections in case of parallel technologies. So serial technologies, which send bits one after another, are better for high bandwidth usage. A serial bus design is much simpler, sending 1 bit at a time over a single wire at much higher rates than parallel. By combining multiple serial data paths, even faster speeds can be realized that dramatically exceed the capabilities of traditional parallel buses.

So now the serial devices are replacing the parallel devices and now we have devices such as

  • SAS - Serial Attached SCSI
  • SATA - Serial ATA
in common use.

IO Subsystem



The I/O subsystem is a series of processes responsible for moving blocks of data between disk and memory.

In general, each task performed by either kernel or user is one of the following


  • Read - Reading a block of data from disk to memory
  • Write - Writing a block of data from memory to disk


Read or write requests are transformed into block device requests that go into a queue. The I/O subsystem then batches similar requests that come within a specific time window and processes them all at once. What kind of block device requests get batched together? They are


  1. They are the same type of operation (read or write).
  2. They belong to the same block device (i.e. Read from the same block device, or are written to the same block device.
  3. Each block device has a set maximum number of sectors allowed per request. The block device request should not exceed this limit in order for the merge to occur.
  4. The block device requests to be merged immediately follow or precede each other.

Which has more priority - read or write operation?

Read requests are crucial to system performance because a process cannot commence unless its read request is serviced. This latency directly affects a user's perception of how fast a process takes to finish. 

Write requests, are serviced by batch by pdflush kernel threads. Since write requests do not block processes (unlike read requests), they are usually given less priority than read requests.

Tuning sequential read access


Read/Write requests can be either sequential or random. The speed of sequential requests is most directly affected by the transfer speed of a disk drive. Random requests, on the other hand, are most directly affected by disk drive seek time.

Sequential read requests can take advantage of read-aheads.

While reading files, kernel tries to take advantage of sequential disk access. Read-ahead is based on the assumption that an application reading in block of data, block A , most likely will also read blocks of data adjacent to it, block B, C, D and so on. So kernel will read the blocks B, C, D ahead of the application and cache those pages in memory. By doing this, 

  1. the kernel is able to serve application's request for data more quickly. 
  2. reduce the load on the disk controller

This results in improved response time.

But, read-ahead will not make sense if the application has to read random blocks of data or if the application has to re-read the same block of data. However, the read-ahead algorithm is designed to turn itself off if it detects patterns as said above.

The read-ahead window controls how much data the kernel will prefetch when performing file I/O.  From 2.6 kernel, the read-ahead is managed by two internally calculated values : current-window and ahead-window.  While an application is reading pages from current window, the kernel will do I/O(read from disk) and store the data in ahead window. Once the application finishes reading the current window, the ahead window will now become the current window. So now, if the application reads from the current window, then the size of the new read ahead window  will be increased by two pages. But if the application does not read from the current window, then read ahead window will be gradually shrunk.

By default, how much data is read ahead by Linux kernel from a block device?

To know how much data is read-ahead from a block device by kernel, there are two ways to do it

1) # cat /sys/block/sda/queue/read_ahead_kb
128

2) The blockdev command reports read-ahead in sectors, where each sector is always 512 bytes in 2.6 kernel

# blockdev --getra /dev/sda
256

So there are 256 sectors in the read-ahead. So the size is 256*512 = 131072 bytes. It converts to 131072/1024 = 128 kb.

# blockdev --report /dev/sda
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0    160041885696   /dev/sda

How to change the read-ahead value?

Say, I want to change the "read-ahead" value to 2 MB from the default 128 KB. Since, blockdev command reports in sectors

512 bytes = 1 sector
2 MB(2*1024*1024 bytes) = 4096 sectors

# blockdev --setra 4096 /dev/sda

Check if the read-ahead value is changed by running the following commands

# blockdev --getra /dev/sda
4096

# cat /sys/block/sda/queue/read_ahead_kb
2048

# blockdev --report /dev/sda
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw  4096   512  4096          0    160041885696   /dev/sda

To make it permanent upon system reboot, just add this command entry in /etc/rc.local

This also equivalent to setting /sys/block/sda/queue/read_ahead_kb except that the unit is different. The unit is #sectors for blockdev and kilobyte for read_ahead_kb

But here is a catch. The setting the readahead to a value larger than max_sectors_kb (/sys/block/sda/queue/max_sectors_kb) has no effect. The minimum value of both is taken.

In our system,

# /sys/block/sda/queue/max_sectors_kb
512

I/O Scheduling Algorithms


When scheduling I/O requests, the kernel needs to balance between the following goals :
  1. To keep disk access pattern as sequential as possible
  2. Kernel must ensure that all processes must receive IO in a timely fashion to avoid IO starvation
To manage IO scheduling requests, there are various IO scheduler algorithms , also called as elevators, to manage different workloads. Scheduler algorithms are sometimes called “elevators” because they operate in the same manner that real-life building elevators do. The algorithms used to operate real-life building elevators make sure that it services requests per floor efficiently. To be efficient, the elevator does not travel to each floor depending on which one issued a request to go up or down first. Instead, it moves in one direction at a time, taking as many requests as it can until it reaches the highest or lowest floor, then does the same in the opposite direction.


Choosing the best suited I/O elevator not only depends on the workload, but on the hardware, too. Single ATA disk systems, SSDs, RAID arrays, or network storage systems, for example, each require different tuning strategies.

The default I/O scheduler is determined as a kernel compile option. However, the IO scheduler can be changed on the fly per block device. This makes it possible to set different algorithms for e.g. the device hosting the system partition and the device hosting a database.

What are the different IO scheduling algorithms?

  • deadline
  • anticipatory
  • noop
  • cfq (Completey Fair Queuing)

By default the CFQ (Completely Fair Queuing) scheduler is used.

[root@dhcppc5 ~]#  grep -i "cfq" /boot/config-*
/boot/config-2.6.32-042stab068.8:CONFIG_IOSCHED_CFQ=y
/boot/config-2.6.32-042stab068.8:CONFIG_CFQ_GROUP_IOSCHED=y
/boot/config-2.6.32-042stab068.8:CONFIG_DEFAULT_CFQ=y
/boot/config-2.6.32-042stab068.8:CONFIG_DEFAULT_IOSCHED="cfq"

[root@dhcppc5 ~]# cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]

The algorithm in square brackets is the  default IO scheduler.

To change the elevator for a specific device in the running system, run the following command:

echo <SCHEDULER> > /sys/block/<DEVICE>/queue/scheduler

Eg:
echo deadline > /sys/block/sda/queue/scheduler
echo anticipatory > /sys/block/sda/queue/scheduler
echo noop > /sys/block/sda/queue/scheduler
echo cfq > /sys/block/sda/queue/scheduler

Deadline Scheduler

DEADLINE is a latency-oriented I/O scheduler. Each I/O request has got a deadline assigned. Usually, requests are stored in queues (read and write) sorted by sector numbers. The DEADLINE algorithm maintains two additional queues (read and write) where the requests are sorted by deadline. As long as no request has timed out, the “sector” queue is used. If timeouts occur, requests from the “deadline” queue are served until there are no more expired requests. Generally, the algorithm prefers reads over writes.

Anticipatory Scheduler

After servicing an IO request, the Anticipatory IO scheduler will wait for a short amount of time to see if there is any request for a disk block near to the block of data that was recently accessed.

Noop Scheduler

The no-op scheduler just passes down the IO requests to the disk - just queues them.

CFQ Scheduler

CFQ is a fairness oriented scheduler. The IO bandwidth is divided equally among all processes that are doing IO.  The algorithm assigns each thread a time slice in which it is allowed to submit I/O to disk. This way each thread gets a fair share of I/O throughput. It also allows assigning tasks I/O priorities which are taken into account during scheduling decisions.

No comments:

Post a Comment