Wednesday, May 30, 2012

Friday, May 18, 2012

Add/Remove more swap space in Linux


In Linux, a swap space can be a disk partition or a file. Rather than creating an entire partition for swap space, a swap file offers the ability to vary its size on-the-fly, and is more easy to remove too. But, BTRFS filesystem does not currently support swapfiles.

Let us see how to create multiple swap files(not swap partitions) in this article

To know the existing swap space in the system
# swapon -sh
Filename                                Type            Size    Used    Priority
/dev/sda3                               partition       5119992 0       -1

Let us create multiple swap file spaces, each of size around 128MB.

dd command is a useful utility to achieve this. Let me briefly tell about dd command parameters to be used
if = input file
of = output file
bs = block size(Here, Iam using 1024 bytes as block size)
count = To get the desired size of output file, multiply the bs(1024 bytes, in this case) with this count number.

First create container files
# dd if=/dev/zero of=/opt/swp/swp1 bs=1000 count=128000
# dd if=/dev/zero of=/opt/swp/swp2 bs=1000 count=128000

Let us see what kind of file we have created
# file swp2 swp1
swp2: data
swp1: data

Format as swap
# mkswap /opt/swp/swp1
# mkswap /opt/swp/swp2

Let us see what kind of file we have created
#file swp1 swp2
swp1: Linux/i386 swap file (new style) 1 (4K pages) size 31249 pages
swp2: Linux/i386 swap file (new style) 1 (4K pages) size 31249 pages

Adding the newly created swap spaces to startup:


Currently just a single swap partition exists in /etc/fstab

# grep swap /etc/fstab
UUID=fa0d3137-3f0d-48f6-80bb-abed07c79ee0 swap                    swap    defaults        0 0

Replacing the entry for the swap partition with these three lines in /etc/fstab
UUID=fa0d3137-3f0d-48f6-80bb-abed07c79ee0 swap                    swap    defaults        0 0
/opt/swp/swp1 swap swap defaults 0 0
/var/swp/swp2 swap swap defaults 0 0

In this case, all 3 swap partitions have equal priority. So pages are allocated on a round-robin basis between them.

If priority were to be assigned to each of the swap spaces, have the entry in /etc/fstab as follows
UUID=fa0d3137-3f0d-48f6-80bb-abed07c79ee0 swap                    swap    sw,pri=3        0 0
/opt/swp/swp1 swap swap sw,pri=2 0 0
/opt/swp/swp2 swap swap sw,pri=1 0 0

This configuration would prompt the kernel to use the swap space with highest priority first.
So here the highest priority is 3 and this swap space shall be used first. The maximum priority can be 32767 and the lowest 0. If that swap space were to max out, the kernel would start using /opt/swp/swp1, and on to /opt/swp/swp2 after that.

The entries in /etc/fstab come into effect only after boot. So to enable the newly created swap spaces immediately(without booting):
# swapon /opt/swp/swp1
# swapon /opt/swp/swp2

Check the status of swap
# swapon -sh
Filename                                Type            Size    Used    Priority
/dev/sda3                               partition       5119992 0       -1
/opt/swp/swp1                           file            124992  0       -2
/opt/swp/swp2                           file            124992  0       -3

After rebooting, the entries in /etc/fstab come into effect. After reboot(I have given different priorites for each swap space in /etc/fstab)
# swapon -sh
Filename                                Type            Size    Used    Priority
/dev/sda3                               partition       5119992 0       3
/opt/swp/swp1                           file            124992  0       2
/opt/swp/swp2                           file            124992  0       1

Remove swap file:

# swapoff -a
# rm /opt/swp/swp2
# swapon -a

MiB, GiB, MB, GB

MiB stands for MebiByte
MB stands for MegaByte

1 MiB ~ 1.05 MB

GiB stands for GibiByte
GB stands for GigaByte

1 GiB ~ 1.074 GB


1 kilobyte(kB) = 10^3 bytes
1 megabyte(MB) = 10^6 bytes
1 gigabyte(GB) = 10^9 bytes
1 terabyte(TB) = 10^12 bytes
1 petabyte(PB) = 10^15 bytes
1 exabyte(EB) = 10^18 bytes
1 zettabyte(ZB) = 10^21 bytes

Thursday, May 17, 2012

Swapping and Paging


To execute a process, atleast a part of the process must be contained in the primary memory(RAM) to run. The CPU cannot execute a process that exists entirely in secondary memory.

There are 2 memory management policies

  1. Swapping
  2. Paging

Swapping:

Swapping refers to writing the entire process, not just part, to disk from memory or the moving the entire process from the swap device to the main memory for execution. Process size must be less than or equal to the available main memory. It is easier to implementation and  swapping system does not handle the memory more flexibly as compared to the paging.

Paging:
Paging refers to writing portions, termed pages, of a process from memory to disk or moving only the required memory pages from swap device to main memory. Process size does not matter.

  • It provides greater flexibility in mapping the virtual address space into the physical memory of the machine. 
  • Allows more number of processes to fit in the main memory simultaneously. 
  • Allows the greater process size than the available physical memory. 


Wednesday, May 16, 2012

Swap Space, thrashing

Swap space is part of secondary memory(hard disk) and is used as an extension of RAM so that the effective size of usable memory grows correspondingly. Virtual memory is thus a combination of RAM + Swap space.

In Linux, a swap space can be a partition or a file.

So what is special about swap space compared to other parts of hard disk?

Swap space is part of hard disk. No filesystem is written on this part of hard disk. The things that differentiate swap space from the file systems which make up the other part of hard disk are
  1. Space allocation scheme and 
  2. Data structures that catalog free space 
Space Allocation Scheme -> Usually kernel allocates space for files one block at a time - This is done to reduce the amount of fragmentation and hence, unallocatable space in the file system. However, in a swap space, the kernel allocates space in groups of contiguous blocks - Since speed is critical and the system can do I/O faster in one multiblock operation than in several single block operations, the kernel allocates contiguous space on swap device without regard for fragmentation.

Free space data structure --> For file systems, the kernel maintains the free space in a linked list of free blocks, accessible from the filesystem super block. For a swap space, the kernel maintains the free space in an in-core table called map. A map is an array where each entry consists of an address of an allocatable resource and the number of resource units available there.

Are multiple swap spaces allowed?

Multiple swap devices are allowed. In case of multiple swap devices being available, kernel chooses the swap device in a round robin scheme, provided it contains contiguous memory. Administrators can create and remove swap devices dynamically.

In Linux 2.6 kernel, a max of 32 swap areas are allowed to exist in a system (check, man mkswap)

If there are multiple disks, setup swap partitions on each disk and set them to the same priority with  pri option. The kswapd daemon will round robin across the partitions improving the performance.

Do zombie process get swapped out?

Zombie processes are not swapped out, because they do  not use any physical memory.

Recommended size for swap space

RedHat recommendation for size of swap space is

Amount of RAM in the System     Recommended Amount of Swap Space
  • 4GB of RAM or less              a minimum of 2GB of swap space
  • 4GB to 16GB of RAM              a minimum of 4GB of swap space
  • 16GB to 64GB of RAM             a minimum of 8GB of swap space
  • 64GB to 256GB of RAM            a minimum of 16GB of swap space
  • 256GB to 512GB of RAM           a minimum of 32GB of swap space
Thumb rule for Swap space

The 2.2 kernel rule of 2x swap is dead. The recommended thumb rule is as follows
  • Batch Servers       :  4X RAM
  • Database Server    :  <= 1GiB RAM
  • Application Server :  0.5X RAM
  • RAM 1-2 GiB       :  1.5X RAM
  • RAM 2-8 GiB       :  Same size as RAM
  • RAM  > 8 GiB      :  0.75X RAM
What components decide on the size of swap space?
  1. Full core dumps. If there isn't enough swap to handle a full core dump, you might not be able to diagnose certain system panics. Some operating systems use your swap space to dump their core when the system panics. You (or the OS's developers) can use that core dump to diagnose why. 
  2. Core-dump metadata. There is also sometimes a small amount of metadata that goes along with the core dump. Adding an extra 1M to the swap size will cover this.
  3. Preparing in advance for RAM upgrades. Systems without maximum RAM installed may be upgraded in the future. Set up swap to be ready for this.
What is swapping?

The unmapping of page frames from an active process is called swapping.

- Swap-out : page frames are unmapped and placed in page slots on a swap device.
- Swap-in   : page frames are read in from page slots on a swap device ad mapped into process address space.

Is using swap space bad? Thrashing

Using swap space is not actually bad. 

When pages are written to disk, the event is called a page-out, and when pages are returned to physical memory, the event is called a page-in. A page fault occurs when the kernel needs a page, finds it doesn't exist in physical memory because it has been paged-out, and re-reads it in from disk. 

Page-ins are common, normal and are not a cause for concern. For example, when an application first starts up, its executable image and data are paged-in. This is normal behavior.

Page-outs, however, can be a sign of trouble. When the kernel detects that memory is running low, it attempts to free up memory by paging out. Though this may happen briefly from time to time, if page-outs are plentiful and constant, the kernel can reach a point where it's actually spending more time managing paging activity than running the applications, and system performance suffers. This woeful state is referred to as thrashing. Thrashing occurs when the system is spending more time moving pages into and out of a process working set rather than doing useful work. In thrashing, process(es) frequently keep on referencing pages not in memory, thus spending more time waiting for I/O then getting work done.

Not using swap space, but it's intense paging activity is the problem.

How to find intense paging activity - Thrashing?

vmstat command helps with reporting virtual memory statistics. With this tool we can observe page-ins and page-outs as they happen.
emThe most important columns in vmstat command to determine the paging activity are freesi and so. The free column shows the amount of free memory, si shows  amount of memory swapped in from disk (/s) (page-in) and so shows amount of memory swapped to disk (/s) (page-outs). If so column value remains zero, then there is not much paging activity. However, if we observe nonzero values in so column and if free column too keeps fluctuating,  it indicates there is not enough physical memory and the kernel is paging out. By using top and ps the processes that are using the most memory can be identified.

Displaying Swap space details in Linux

There are many ways to display the swap space details in Linux

# swapon -sh
Filename                                Type            Size    Used    Priority
/dev/sda3                               partition       5119992 0       -1

# cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/sda3                               partition       5119992 0       -1

# free -k | grep Swap | awk '{ print $1,$2 }'
Swap: 5119992

# dmesg | grep -i "swap"
Adding 5119992k swap on /dev/sda3.  Priority:-1 extents:1 across:5119992k

How much memory and swap space does each process use? smaps


To find out how much memory and swap space each process is consuming is not possible with standard tools like top or ps. By using the smaps subsystem, introduced in Kernel 2.6.14, it is possible to get the exact amount of memory and swap space used by a process. It can be found at /proc/<pid>/smaps

In the blog http://northernmost.org/blog/find-out-what-is-using-your-swap/, there is a bash script which prints out all running processes and their swap usage


#!/bin/bash
# Get current swap usage for all running processes
# Erik Ljungstrom 27/05/2011
SUM=0
OVERALL=0

for DIR in `find /proc/ -maxdepth 1 -type d | egrep "^/proc/[0-9]"` 
do
         PID=`echo $DIR | cut -d / -f 3`
         PROGNAME=`ps -p $PID -o comm --no-headers`

         for SWAP in `grep Swap $DIR/smaps 2>/dev/null| awk '{ print $2 }'`
         do
                let SUM=$SUM+$SWAP
         done

         echo "PID=$PID - Swap used: $SUM - ($PROGNAME )"
         let OVERALL=$OVERALL+$SUM
         SUM=0
done
echo "Overall swap used: $OVERALL"

Run this script as root user.


To find the process with most swap used, just run the script like so:
$ ./getswap.sh | sort -n -k 5

To avoid processes which are not using swap at all
$ ./getswap.sh | egrep -v "Swap used: 0" |sort -n -k 5


What pages get swappped?


Basically there are 4 types of memory pages

  • Kernel pages - Pages holding the program contents of the kernel itself. Fixed in memory and are never moved
  • Program pages - Pages storing the contents of programs and libraries. These are read-only, so no updates to disk are needed.
  • File-backed pages - Pages storing the contents of files on disk. If this page has been changed in memory it will eventually need to be written out to disk to synchronize the changes
  • Anonymous pages - Pages not backed by anything on disk. When a program requests memory to be allocated to perform computations or record information, the information resides in anonymous pages
The pages that get swapped are
  • Inactive Pages
  • Anonymous Pages
Tuning Swappiness

The following sysctls help in tuning the virtual memory

1) vm.swappiness

Swapping inactive pages means searching for inactive pages and unmapping them. So it consumes more cpu and disk resources than writing anonymous pages to disk.
In order to swap out inactive pages (memory pages with no active references), the kernel has no option but to walk the entire memory space which is quite expensive with large memory sizes.

To make the kernel to prefer swapping anonymous pages rather than the inactive pages, vm.swappiness is used. Increasing swappiness tells the kernel to swap out anonymous pages (memory pages with references that aren't linked to files; e.g. process stacks, buffers, etc.) which is a much cheaper operation because the kernel can look at the page table to determine where these pages are in memory.

The kernel will prefer to swap anonymous pages when:

% of memory mapped in page tables + swappiness >= 100

The default values of vm.swappiness is 60.

# sysctl vm.swappiness
vm.swappiness = 60


Higher the vm.swappiness value, the more the system will swap.  At 100, the kernel will always prefer to find inactive pages and swap them out.
A high swappiness value means that the kernel will be more apt to unmap mapped pages. A low swappiness value means the opposite, the kernel will be less apt to unmap mapped pages.

2) vm.swap_token_timeout

This is used to control how long a process is protected from paging when the system in thrashing. It is measured in seconds.

Buffer and Cache in free and vmstat commands


Buffer - Buffers are a type of cache associated with block devices(such as /dev/sda), and deals with caching of filesystem metadata. When a process needs to access data from a file, the kernel will bring the data into main memory where the process shall examine it, alter it, and request that the data be saved in the filesystem again. When the kernel brings the data into memory, it also needs to bring the auxillary data associated with that data from the disk into memory. The auxillary data is, for instance, 

  • the super block of the file system that describes the free space available on the file system
  • the inode which describes the layout of the file

Caching of filesystem metadata is done in the buffers. That is, the buffers remember what's in directories, what file permissions are, and keep track of what memory is being written from or read to for a particular block device.

Cache - Cache is also known as page cache. This cache only contains the contents of the files alone. When a file is read from disk or network, the contents are stored in pagecache. Also cache contains recently used (but currently unused) memory pages, in case they're needed again. When extra physical memory is not in use, the kernel attempts to put it to work as a cache. The cache stores recently accessed disk data(pages) in memory; if the same data is needed again it can be quickly retrieved from the cache, improving performance.

Both buffers and cache can be very quickly freed if needed by an application. Both buffer and cache help in preventing the kernel from having to read information from disk storage as much as possible.

Major and Minor Page Faults

When a process starts, the kernel searches the CPU caches and then physical memory(RAM). If the data does not exist in either, the kernel issues a major page fault (MPF). A MPF is a request to the disk subsystem to retrieve pages off disk and buffer them in RAM. Once memory pages are mapped into the buffer cache, the kernel will attempt to use these pages resulting in a minor page fault (MnPF). A MnPF saves the kernel time by reusing a page in memory as opposed to placing it back on the disk.

In the following example, the time command is used to demonstrate how many
MPF and MnPF occurred when an application started. The first time the evolution starts, there are many MPFs:

$ /usr/bin/time -v evolution

Major (requiring I/O) page faults: 43
Minor (reclaiming a frame) page faults: 1485

The second time evolution starts, the kernel does not issue any MPFs because the
application is in memory already:

$ /usr/bin/time -v evolution

Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1528

The filesystem cache is used by the kernel to minimize MPFs and maximize
MnPFs.

$ cat /proc/meminfo
MemTotal:        3078328 kB
MemFree:         2382324 kB
Buffers:           67444 kB
Cached:           256276 kB

The system has a total of 3 GB (MemTotal) of RAM available on it. There is
currently 2.3GB of RAM "free" (MemFree),  67 MB RAM that is allocated to disk
write operations (Buffers), and 256 MB of pages read from disk in RAM
(Cached).

Monitoring memory usage( especially, buffers and cache) using free, vmstat and sar commands

For example, while downloading a file, we can see the buffer and cache utilisation going up.
This can be observed using the free, vmstat and sar commands as follows

Before download

$ free -m
             total       used       free     shared    buffers     cached
Mem:          2996       1593       1402          0         89       1165
-/+ buffers/cache:        338       2657
Swap:         4999          0       4999

$ vmstat 2
procs -----------memory----------       ---swap-- -----io---- --system--   -----cpu-----
 r  b   swpd   free      buff     cache      si   so    bi    bo     in   cs       us sy   id wa st
 0  0      0 1435932  91856 1193196    0    0     0     0      13   26      0  0  100  0  0
 0  0      0 1435932  91856 1193196    0    0     0     0      16   30      0  0  100  0  0
 1  0      0 1435932  91856 1193196    0    0     0     0      20   30      0  0  100  0  0
 0  0      0 1435932  91856 1193196    0    0     0   138     20  32      0  0  100  0  0
 0  0      0 1435932  91856 1193196    0    0     0     0      14   33      0  0  100  0  0

# sar -r 2
Linux 2.6.32-042stab057.1 (dhcppc5)     10/08/2012      _x86_64_        (2 CPU)

12:07:51 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
12:07:53 PM   1432116   1635832           53.32     92024       1193432    561516      6.86
12:07:55 PM   1432116   1635832           53.32     92024       1193432    561516      6.86
12:07:57 PM   1432116   1635832           53.32     92024       1193432    561516      6.86
12:07:59 PM   1432116   1635832           53.32     92024       1193432    561516      6.86

During download

# vmstat 2
procs -----------memory----------       ---swap-- -----io----   --system--      -----cpu-----
 r  b   swpd   free      buff     cache      si   so    bi    bo      in   cs          us sy id wa st
 2  0      0 1391876  92552 1229088    0    0     3    11      31   60         0  0 100  0  0
 1  0      0 1388148  92560 1232688    0    0     0     6    4033 8093       6  3 91  1  0
 0  0      0 1384684  92560 1236272    0    0     0     0    4007 8073       6  3 91  0  0
 0  0      0 1380964  92568 1239888    0    0     0     6    3992 8021       6  3 91  0  0
 0  0      0 1377120  92568 1243456    0    0     0    12   4005 8049       6  4 91  0  0
 0  0      0 1373524  92568 1247008    0    0     0     0    3990 7984       6  3 90  0  0
 0  0      0 1369928  92576 1250432    0    0     0     6 3   833 7579       6  4 90  0  0
 3  0      0 1365960  92576 1253952    0    0     0 28880  4070 7955      7  5 88  0  0
 0  0      0 1362356  92584 1257520    0    0     0    10    4067 8111      7  4 89  1  0
 1  0      0 1358520  92584 1261120    0    0     0  4472   4023 8020      7  4 90  0  0
 1  0      0 1354800  92584 1264688    0    0     0     0     4041 8074      6  4 90  0  0
 1  0      0 1351204  92592 1268288    0    0     0    10    4056 8062       6  4 89  1  0
 1  0      0 1347484  92592 1271840    0    0     0     0     3985 8004      5  3 92  0  0
 0  0      0 1343764  92600 1275424    0    0     0     6     3993 8072      5  4 91  0  0

The following can be observed
  •  cache and buffer counters increasing
  • interrupts(in) and context switch(cs) counters increasing
  • disk activity observed in the form of bo counter under io
$ free -m
                   total       used       free     shared    buffers     cached
Mem:          2996       1722       1273          0         90       1285
-/+ buffers/cache:        346       2649
Swap:         4999          0       4999

# sar -r 2
Linux 2.6.32-042stab057.1 (dhcppc5)     10/08/2012      _x86_64_        (2 CPU)

12:35:55 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
12:35:57 PM   1415808   1652140          53.85       92536      1206036    565620      6.91
12:35:59 PM   1412088   1655860          53.97       92536      1209736    565620      6.91
12:36:01 PM   1408492   1659456          54.09       92544      1213132    565620      6.91
12:36:03 PM   1404780   1663168          54.21       92544      1216720    565620      6.91
12:36:05 PM   1400936   1667012          54.34       92552      1220396    565620      6.91

It can be observed that kbbuffers and kbcached counters are increasing

After download

# vmstat 2
procs -----------memory----------       ---swap-- -----io----   --system--       -----cpu-----
 r  b   swpd   free   buff  cache          si   so    bi    bo       in     cs        us sy id  wa st
 4  0      0  21540  88432 2565180    0    0     0     0       4034 8084      6  4 90  0  0
 0  0      0  22276  86700 2565996    0    0     0     0       4052 8066      7  4 90  0  0
 0  0      0  21036  86452 2567468    0    0     0     6       4005 7982      7  4 89  0  0
 0  0      0  21288  86456 2567392    0    0   142 26904  2145 4081      4  2 92  2  0
 0  0      0  21296  86464 2567420    0    0     0    16        30   47  0      0 100   0  0
 2  0      0  21300  86464 2567424    0    0     0     0         23   36  0      0 100   0  0
 0  0      0  21424  86464 2567424    0    0     0     0         28   38  0      0 100   0  0
 0  0      0  21424  86464 2567424    0    0     0     0         23   35  0     0  100   0  0
 0  0      0  21432  86464 2567424    0    0     0     0         20   33  0      0 100   0  0


The following can be observed

  •  The cache and buff counter values remaining the same
  • Decrease in interrupts(in) and context switches(cs)

$ free -m
                  total       used       free     shared    buffers     cached
Mem:          2996       2947         48          0         80       2482
-/+ buffers/cache:        384       2611
Swap:         4999          0       4999


No increase in kbbuffers and kbcached counters observed.

How to free up buffer and cache in memory

Pages in memory are classified as follows
  • Free : Pages in main memory that has not yet been allocated to any process.
  • Dirty : When a file is written to, the new data is stored in pagecache before being written back to a disk or the network. Also on request by a process, a page is copied from disk into memory and modified. When a page has a new or modified data not written back yet, it is called “dirty”. 
  • Clean : A page in memory that has not been modified since being read from disk or a page content that has been written to disk.
  • Active : Page is in use by a process. Both clean and dirty pages may be in use by a process. Clean and dirty pages not in use by a process are termed as inactive clean and dirty pages. 

Reclaim dirty buffers and pages

The sync command can be used to reclaim dirty buffers and cache. The sync() system call tells the kernel to immediately write all the dirty buffers in the buffer cache and all dirty pages in the page cache to disk. There is another system call named fsync(). The fsync() system call writes all dirty buffers and dirty pages associated with a specific file to disk.

Reclaim clean pages

To drop clean caches, buffer(dentries and inodes) from memory. This is the normal command used
    echo 3 > /proc/sys/vm/drop_caches

To free clean pages alone from page cache:
     echo 1 > /proc/sys/vm/drop_caches

To free clean pages in buffer(dentries and inodes):
      echo 2 > /proc/sys/vm/drop_caches

Hence to free up both dirty and clean pages from page cache and buffer, run the following
      sync && echo 3 > /proc/sys/vm/drop_caches

                           (OR)

      sync
      sysctl  -w vm.drop_caches=3

Here is an illustration

Before freeing up buffer and pagecache in memory
# free -m
                   total       used       free     shared    buffers     cached
Mem:          2996       2943         52          0         81       2479
-/+ buffers/cache:        382       2613
Swap:         4999          0       4999

Run the command to free up buffer and page cache in memory
   sync && echo 3 > /proc/sys/vm/drop_caches

After freeing up buffer and pagecache in memory by reclaiming dirty and clean pages in memory
# free -m
                   total       used       free     shared    buffers     cached
Mem:          2996        344       2651          0          0         83
-/+ buffers/cache:        260       2735
Swap:         4999          0       4999