Ops Monkey: April 2012

Saturday, April 28, 2012

Traceroute

traceroute tells about the path between the two addresses. Say, if we need to trace the path from host A to host B. traceroute fires off packets that are passed through a series of routers/gateways. Normal network trasactions, like a request for a web page, does not report the path they take from host A to host B. traceroute, on the other hand, triggers a response from every router along the way. It does this by utilizing the IP protocol time to live(TTL) field and attempts to elicit an ICMP TIME_EXCEEDED response from each machine. If successful, it captures the IP address of the machine and the time at which the response was received.

How traceroute works in UNIX?

Host A sends 3 UDP packets to host B. These packets are sent to port 33434 in host B. However, the packets returned in response are ICMP packets. If the ICMP response is of type

ICMP Time Exceeded message ( ICMP Type '11' ) response - This means the host responding is not the destination.
An ICMP Destination Unreachable - This means the host responding doesn't know how to get to the destination IP address in the traceroute packets.
Host A sends three UDP packets with TTL value of 1 to host B.
The computer/router on which the messages die because the time to live expired (somewhere between host A and host B ) sends back ICMP Time Exceeded (ICMP Type '11') responses. These messages indicate to host A that the traceroute messages have not yet reached the destination host B.
Host A receives those Time Exceeded messages, notes the time they arrived, compares that to the time the UDP packet was sent and shows the results of that round trip on the screen.
Host A increments the TTL in the IP Header by one, then repeats the previous steps (creates 3 UDP packets, sets the Time to Live to the next highest number, starts a timer, transmits the packets, waits for a response). This process is repeated until the packets reach the destination computer (host B) which host A is tracing the route to.
When the destination computer (clown) receives the packets, it sends back an ICMP Reply (ICMP type '0') and the traceroute program stops.

Friday, April 27, 2012

Web Operations, Linux Websites of Interest

Web Operations Guru John Allspaw's website - http://www.kitchensoap.com/
http://highscalability.com/
O'Reilley Radar Web Operations Blog - http://radar.oreilly.com/operations/
Linux HowTos - http://librenix.com
Linux command line tips - http://commandlinefu.com

Thursday, April 12, 2012

Virtual Address Space and Virtual Memory

Each process has it's own Virtual Address Space which is mapped to physical memory(RAM) by the operating system.

A virtual address does not represent the actual physical location of an object in memory; instead, the system maintains a page table for each process, which is an internal data structure used to translate virtual addresses into their corresponding physical addresses in RAM. Each time a process references an address, the system translates the virtual address to a physical address.

Each time an application is run on an operating system (OS), the OS creates a new process and a new Virtual Address Space for this process.The virtual address space for each process is private and cannot be accessed by other processes unless it is shared. Hence, a process memory is Virtual Address Space memory and not physical memory.

The virtual address space for 32-bit system is 4 gigabytes (GB) in size (2^32 - 1). So when a new process is run on a 32-bit OS, a 4GB virtual address space is allocated for this process.

Virtual Memory of a process is a combination of RAM space used by the process plus Disk space used by the process. Swap space used by a process is thus part of the virtual memory of the process.

The kernel (on the x86 architecture, in the default configuration) splits the 4-GB virtual address space between user-space and the kernel space. The 4 GB virtual address space of a process is split in two parts: 3 GB and 1 GB. The lower 3 GB of the process virtual address space is accessible as the user-space virtual addresses and the upper 1 GB space is reserved for the kernel virtual addresses. This is true for all processes.

The virtual memory or memory map of a process includes atleast the following areas

Text - An area for program's executable code - Set of instructions from the compiler-generated executable file.
An area for the initialized data of the program (data) - Global variables, constants, static variables from the program.
An area for the uninitialized data of the program (BSS) - Uninitialized variables. These are not part of the executable file and their initial value is set to zeros.
Three areas (text, data, BSS) for each shared library
The stack area - Used by the program for variables and storage. It grows and shrinks in size depending on what routines are called and what their stack space requirements are.
hole - This is the address space that is unallocated and unused. It does not tie up physical memory. For most processes, this is the largest portion of the virtual memory for the process

The areas of a process can be found from /proc/<pid>/maps

# cat /proc/1/maps

0012e000-0012f000 r-xp 00000000 00:00 0 [vdso]

00a51000-00a5c000 r-xp 00000000 fd:01 1353 /lib/libnss_files-2.14.so

00a5c000-00a5d000 r--p 0000a000 fd:01 1353 /lib/libnss_files-2.14.so

00a5d000-00a5e000 rw-p 0000b000 fd:01 1353 /lib/libnss_files-2.14.so

08048000-08108000 r-xp 00000000 fd:01 2254 /bin/systemd

08108000-0810d000 rw-p 000bf000 fd:01 2254 /bin/systemd

08b67000-096ab000 rw-p 00000000 00:00 0 [heap]

4108f000-4109c000 r-xp 00000000 fd:01 1313 /lib/libpam.so.0.83.1

4109c000-4109d000 rw-p 0000c000 fd:01 1313 /lib/libpam.so.0.83.1

4aa94000-4aab1000 r-xp 00000000 fd:01 1325 /lib/ld-2.14.so

4aab1000-4aab2000 r--p 0001d000 fd:01 1325 /lib/ld-2.14.so

4aab2000-4aab3000 rw-p 0001e000 fd:01 1325 /lib/ld-2.14.so

4aab5000-4ac3a000 r-xp 00000000 fd:01 1327 /lib/libc-2.14.so

4ac3a000-4ac3b000 ---p 00185000 fd:01 1327 /lib/libc-2.14.so

4ac3b000-4ac3d000 r--p 00185000 fd:01 1327 /lib/libc-2.14.so

4ac3d000-4ac3e000 rw-p 00187000 fd:01 1327 /lib/libc-2.14.so

4ac3e000-4ac41000 rw-p 00000000 00:00 0

4ac43000-4ac59000 r-xp 00000000 fd:01 1336 /lib/libpthread-2.14.so

4ac59000-4ac5a000 r--p 00015000 fd:01 1336 /lib/libpthread-2.14.so

4ac5a000-4ac5b000 rw-p 00016000 fd:01 1336 /lib/libpthread-2.14.so

4ac5b000-4ac5d000 rw-p 00000000 00:00 0

4ac5f000-4ac62000 r-xp 00000000 fd:01 1360 /lib/libdl-2.14.so

4ac62000-4ac63000 r--p 00002000 fd:01 1360 /lib/libdl-2.14.so

4ac63000-4ac64000 rw-p 00003000 fd:01 1360 /lib/libdl-2.14.so

4ac66000-4ac6d000 r-xp 00000000 fd:01 1346 /lib/librt-2.14.so

(Address range in virtual memory) (Permissions) (Offset) (Device#) (Inode) (File name)

Address range in virtual memory - The beginning and ending virtual addresses for this memory area.
Permissions - A bit mask with the memory area's read, write, and execute permissions. This field describes what the process is allowed to do with pages belonging to the area. The last character in the field is either p for "private" or s for "shared."
Offset - Where the memory area begins in the file that it is mapped to. An offset of 0 means that the beginning of the memory area corresponds to the beginning of the file.
Device # - It consist of major & minor numbers of the device holding the file that has been mapped. Confusingly, for device mappings, the major and minor numbers refer to the disk partition holding the device special file that was opened by the user, and not the device itself.
Inode - The inode number of the mapped file.
Filename - The name of the file (usually an executable image) that has been mapped.

Wednesday, April 11, 2012

What is a filesystem?

Each physical disk is partitioned into several filesystems. Partitioning a disk into several file systems makes it easier for administrators to manage data.

The kernel deals on logical level with filesystems rather than with physical disks, treating each disk as a logical device identified by logical device number. The conversion between logical device(file system) addresses and physical device(disk) addresses is done by disk driver.

A file system consists of a sequence of logical blocks, each containing 512, 1024, 2048 or any multiple of 512 bytes.The size of a logical block is homogeneous within a file system but may vary between different file systems created on a disk. Using large blocks increases the effective data transfer between disk and memory, as kernel can transfer more data per disk operation. But at the same time, using large logical block will cause effective storage capacity to drop.

A filesystem has following structure

1) Boot Block :

This occupies the beginning of a file system, typically the first sector, and may contain bootstrap code, which is used for booting(initializing the operating system). Although only one boot block is needed to boot the system, every filesystem has a (possibly empty) boot block.

2) Super Block:

It describes the state of file system, like, how large it is, how many files it can store, where to find free space on the file system, etc.

3) Inode list:

It is a list of inodes in the file system. The kernel references the inodes by index into the inode list. One inode is the root inode of the filesystem: it is the inode by which the directory structure of the file system is accessible after execution of mount system call.

4) Data Block:

The inode list is followed by data blocks. The data block contains the actual file data and administrative data. An allocated data block can belong to one and only one file in the file system.

Layout of ext2/ext3 filesystem

The ext2/ext3 file system organizes formatted disk partition into a series of block groups that have identical structure. The block groups help in reducing file fragmentation. When the kernel allocates data blocks for a file, it will try to allocate blocks from same block group. Also, each block group contains its own inode table - the reason is that keeping an inode table close to the data blocks will reduce seek time.

What each block group contains?

superblock - block group 0 contains the primary superblock, the other groups contain backup superblocks.

group descriptors - information about other structures in the block group.

data block bitmap - set/unset bits for each block in-use/free. The overall size of block group is controlled by this - Eg: With a 4 kb(4096 bytes) block size, the size of a block group is limted to 4096*8 blocks.

inode bitmap - set/unset bits for each inode in-use/free.

inode table - Space for the inodes themselves. Each inode is 128 bytes in sizes, therefore 8 inodes/kb (1024/128) of disk.

What does each block in block group contain?

Each block in a block group contains one of the following pieces of information:

A copy of the filesystem's superblock
A copy of the group of block group descriptors
A data block bitmap
An inode bitmap
A table of inodes
A chunk of data that belongs to a file

If a block does not contain any meaningful information, it is said to be free.

Filesystem SuperBlock

Superblock describes the state of file system, like, how large it is, how many files it can store, where to find free space on the file system, etc.

The super block consists of the following fields

size of the file system
number of free blocks in the file system
a list of free blocks available on the file system
index of the next free block available in the free block list
the size of inode list
the number of free inodes in the filesystem
a list of free inodes in the file system
the index of the next free inode in the free inode list
lock fields for free blocks and free inode lists
a flag indicating that the super block has been modified

We usually create a filesystem on every partition we make on the hard disk. Linux maintains multiple redundant copies of the superblock in every file system. In emergency situations,the backup copies can be used to restore primary damaged superblock.

Say, for example, to get the details of superblock for the partition /dev/sda1

# dumpe2fs /dev/sda1 | grep -i superblock

dumpe2fs 1.41.14 (22-Dec-2010)

Primary superblock at 1, Group descriptors at 2-3

Backup superblock at 8193, Group descriptors at 8194-8195

Backup superblock at 24577, Group descriptors at 24578-24579

Backup superblock at 40961, Group descriptors at 40962-40963

Backup superblock at 57345, Group descriptors at 57346-57347

Backup superblock at 73729, Group descriptors at 73730-73731

Backup superblock at 204801, Group descriptors at 204802-204803

Backup superblock at 221185, Group descriptors at 221186-221187

Backup superblock at 401409, Group descriptors at 401410-401411

Inode

Inode means index node.

Inode is an internal representation of a file. We identify a file by it's name, while the kernel identifies the file by it's inode number. Every file has one inode, but it may have several names, which map into the inode. Each name is called a link. So if any process calls a file, kernel retrieves the inode for that file. When a new file is created, kernel assigns it an unused inode.

Inodes exist in a static form on disk, and the kernel reads them into an in-core(memory) inode to manipulate them.

Disk inodes consist of following fields

File owner identifier. Ownership is divided between an individual owner and a "group" owner and defines the set of users who have access rights to a file. The superuser has access rights to all files in the system.
File type. Files may be of type regular, directory,character or block special, or FIFO(pipes).
File type permissions.
File access times - when the file was last modified, when it was last accessed.
Number of links to file, representing the number of names for the file.
File data distribution in the physical disk. Although users treat data in a file as a logical stream of bytes, the kernel saves the data in discontiguous disk blocks.
The inode identifies the disk blocks that contain the file data.
File size

Distinction between writing the contents of an inode to disk and writing the contents of a file to disk

The contents of a file change only when writing it. The contents of a inode change when changing the contents of a file or when changing it's owner, permission or link settings. Changing the contents of a file automatically implies a change to the inode, but changing the inode does not imply that the contents of a file change.

What does in-core(memory) copy of the inode contain?

The in-core copy has the following details in addition to the fields of the disk inode

Status of in-core inode, indicating if

inode is locked
process is waiting for inode to get unlocked
the in-core representation of file differs from disk copy as a result of change to file data.

The logical device number of the fiesystem that contains the file
The inode number. Since inodes are stored in a linear array on disk, the kernel identifies the number of a disk inode by it's position in the array. The disk inode does not need this field.
Pointers to other in-core inodes
A reference count, indicating the number of instances of that file that are active(such as when opened). An inode is active when a process allocates it, such as opening a file. An inode is on the free list only if the reference count is 0, meaning that the kernel can allocate the in-core inode to another disk inode. The free list of inodes thus servers as cache of inactives inodes. If a process attempts to access a file whose inode is not in the in-core inode pool, the kernel allocates an in-core inode from the free list for it's use.