Monday, February 25, 2013

RAID Levels and why RAID10 is better than RAID01

RAID - Redundant Array of Independent Disks

About different raid levels, refer the following links to get an idea

http://www.fatmin.com/2010/04/raid-levels-explained-and-simplified.html

http://www.buzzle.com/articles/raid-levels-explained.html

Why RAID10 is better than RAID01?

RAID10 offers better fault-tolerance(redundancy) than RAID01.

This is illustrated in the below link

http://www.thegeekstuff.com/2011/10/raid10-vs-raid01/



Sunday, February 24, 2013

How to set/change the default storage engine in MySQL


The storage engines are responsible for the storage and retrieval of all data stored in MySQL.

Storage engines available for MySQL are
1) MyISAM (used by default)
2) InnoDB (Supports transactions, row-level locking, and foreign keys)
3) Heap (Hash based, stored in memory, useful for temporary tables)

The ENGINES table in database information_schema provides the default storage engine being used.

MyISAM is the default storage engine.

The default storage engine can be changed. Once the default storage engine type is changed, any new table created will have the new storage engine.However, tables already created will continue to have the old storage engine.

To change it to INNODB storage engine,

1) Add the following line in mysql configuration file /etc/my.cnf under [mysqld] section

[mysqld]
default-storage-engine=InnoDB 

2) Restart mysql
   # service mysqld restart

3) Check if the default storage engine is set as INNODB, from engines table in information_schema database

mysql> use information_schema;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select * from ENGINES\G;
*************************** 1. row ***************************
      ENGINE: MRG_MYISAM
     SUPPORT: YES
     COMMENT: Collection of identical MyISAM tables
TRANSACTIONS: NO
          XA: NO
  SAVEPOINTS: NO
*************************** 2. row ***************************
      ENGINE: CSV
     SUPPORT: YES
     COMMENT: CSV storage engine
TRANSACTIONS: NO
          XA: NO
  SAVEPOINTS: NO
*************************** 3. row ***************************
      ENGINE: MyISAM
     SUPPORT: YES
     COMMENT: Default engine as of MySQL 3.23 with great performance
TRANSACTIONS: NO
          XA: NO
  SAVEPOINTS: NO
*************************** 4. row ***************************
      ENGINE: InnoDB
     SUPPORT: DEFAULT
     COMMENT: Supports transactions, row-level locking, and foreign keys
TRANSACTIONS: YES
          XA: YES
  SAVEPOINTS: YES
*************************** 5. row ***************************
      ENGINE: MEMORY
     SUPPORT: YES
     COMMENT: Hash based, stored in memory, useful for temporary tables
TRANSACTIONS: NO
          XA: NO
  SAVEPOINTS: NO
5 rows in set (0.00 sec)


How to alter the storage engine set for a table?

Tables can be created with different storage engines. All tables in a database need not have the same storage engine. Also, the storage engine for a table too can be altered after creation.


To convert a table from one storage engine to another, use an ALTER TABLE statement that indicates the new engine

ALTER TABLE <tablename> ENGINE = <storage engine name>;

Eg:

Say, I wish to change the storage engine for table dg_cities in database schema sampledata.

1) First let me identify the storage engine used by table dg_cities in database sampledata

mysql> select TABLE_NAME,TABLE_SCHEMA, ENGINE from information_schema.tables where TABLE_NAME like '%dg_cities%'\G;
*************************** 1. row ***************************
  TABLE_NAME: dg_cities
TABLE_SCHEMA: sampledata
      ENGINE: MyISAM
1 row in set (0.00 sec)

2) Altering the storage engine for table dg_cities in database sampledata , from MyISAM to InnoDB

mysql> use sampledata;

mysql> ALTER TABLE dg_cities ENGINE = InnoDB;
Query OK, 1056 rows affected (0.17 sec)
Records: 1056  Duplicates: 0  Warnings: 0

mysql> select TABLE_NAME,TABLE_SCHEMA, ENGINE from information_schema.tables where TABLE_NAME like '%dg_cities%'\G;
*************************** 1. row ***************************
  TABLE_NAME: dg_cities
TABLE_SCHEMA: sampledata
      ENGINE: InnoDB

Find the default storage engine used in MySQL


Default storage engine used by MySQL can be found from information_schema database available in MySQL.

1) Login to mysql prompt

[root@dhcppc3 ~]# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.1.67-log Source distribution

Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| sampledata         |
| test               |
+--------------------+
4 rows in set (0.02 sec)

2) Switch to information_schema database

mysql> use information_schema;

3) In the SUPPORT cloumn lookout for value DEFAULT. This tells which is the default storage engine.
As per below result, MyISAM is the default storage engine.


mysql> select * from ENGINES\G;
*************************** 1. row ***************************
      ENGINE: MRG_MYISAM
     SUPPORT: YES
     COMMENT: Collection of identical MyISAM tables
TRANSACTIONS: NO
          XA: NO
  SAVEPOINTS: NO
*************************** 2. row ***************************
      ENGINE: CSV
     SUPPORT: YES
     COMMENT: CSV storage engine
TRANSACTIONS: NO
          XA: NO
  SAVEPOINTS: NO
*************************** 3. row ***************************
      ENGINE: MyISAM
     SUPPORT: DEFAULT
     COMMENT: Default engine as of MySQL 3.23 with great performance
TRANSACTIONS: NO
          XA: NO
  SAVEPOINTS: NO
*************************** 4. row ***************************
      ENGINE: InnoDB
     SUPPORT: YES
     COMMENT: Supports transactions, row-level locking, and foreign keys
TRANSACTIONS: YES
          XA: YES
  SAVEPOINTS: YES
*************************** 5. row ***************************
      ENGINE: MEMORY
     SUPPORT: YES
     COMMENT: Hash based, stored in memory, useful for temporary tables
TRANSACTIONS: NO
          XA: NO
  SAVEPOINTS: NO
5 rows in set (0.00 sec)

How to find the storage engine set for a table?

Tables can be created with different storage engines. All tables in a database need not have the same storage engine. Also, the storage engine for a table too can be altered after creation.

To find the storage engine set for a table in database, it can done using TABLES table in database INFORMATION_SCHEMA.

mysql> select TABLE_NAME,TABLE_SCHEMA, ENGINE from tables where TABLE_NAME like '%dg_cities%'\G;
*************************** 1. row ***************************
  TABLE_NAME: dg_cities
TABLE_SCHEMA: sampledata
      ENGINE: MyISAM
1 row in set (0.00 sec)

  
So here the table dg_cities in database sampledata uses MyISAM storage engine.



SOFT state and HARD state check intervals in Nagios

If a host or service is in a HARD state, it will be actively checked at intervals equal to the check_interval option in Nagios.

If it is in a SOFT state, it will be checked at intervals equal to the retry_interval option in Nagios.

What is system load average?


The system load average is specified by commands such as w, top and uptime.

The load average is a measure of number of processes that are either in the TASK_RUNNABLE state or in TASK_UNINTERRUPTIBLE state.

TASK_RUNNABLE processes are in run queue and may or may not currently have the CPU.
TASK_UNINTERRUPTIBLE processes are waiting for IO.

The load averages are time-dependant averages and taken over a period of 1,5 and 15 minutes. The sar utility can be used to obtain the 1 and 5 minutes values for the system load average.

To find the number of processes in TASK_RUNNABLE state


$ grep "procs_running" /proc/stat
procs_running 11

$ vmstat 2

Check the first column "r". This gives the number of processes in the "TASK_RUNNABLE" state.


To find the number of processes in TASK_UNINTERRUPTIBLE state

$ grep "procs_blocked" /proc/stat
procs_blocked 1

$ vmstat 2

Check the second column "b". This gives the number of processes in the blocked state or TASK_UNINTERRUPTIBLE state.

/proc/locks


/proc/locks file displays the files currently locked by the kernel. This file has 8 fields.

$ cat /proc/locks
1: POSIX  ADVISORY  WRITE 229903 fd:132:639017 0 EOF
2: POSIX  ADVISORY  WRITE 555736 fd:132:1016088 0 EOF
3: FLOCK  ADVISORY  WRITE 522736 fd:132:1016016 0 EOF
4: FLOCK  ADVISORY  WRITE 522700 fd:132:1015984 0 EOF
5: POSIX  ADVISORY  WRITE 520967 fd:132:1015829 0 EOF
6: POSIX  ADVISORY  WRITE 463519 fd:00:2523477 0 EOF
7: POSIX  ADVISORY  WRITE 47253 fd:00:2523816 0 EOF
8: FLOCK  ADVISORY  WRITE 47223 fd:00:2523810 0 EOF
9: FLOCK  ADVISORY  WRITE 47202 fd:00:2523424 0 EOF
10: POSIX  ADVISORY  WRITE 10573 fd:00:2523558 0 EOF
11: POSIX  ADVISORY  WRITE 10566 fd:00:2523507 0 EOF

Column 1 : Each lock has its own line which starts with a unique number
Column 2 : This column refers to the class of lock used, with FLOCK signifying the older-style UNIX file locks from a flock system call and POSIX representing the newer POSIX locks from the lockf system call.
Column 3 : This has two values  ADVISORY or MANDATORY. ADVISORY means that the lock does not prevent other people from accessing the data; it only prevents other attempts to lock it. MANDATORY means that no other access to the data is permitted while the lock is held.
Column 4 : This reveals whether the lock is allowing the holder READ or WRITE access to the file
Column 5 : This shows the ID of the process holding the lock.
Column 6 : This column shows the ID of the file being locked, in the format of MAJOR-DEVICE:MINOR-DEVICE:INODE-NUMBER.
Column 7, 8 : The last two columns show the start and end of the file's locked region.

How to enable MANDATORY locking on a file?

To enable MANDATORY locking on a file enable the SGID bit for the file but disable the group execute bit, like follows

# chmod g+s-x /path/to/file

Open File Descriptor Limits in Linux


Whenever we mean number of open files limit, we actually mean the file descriptor limit.

Whenever some one asks how many open files are there, usually we just take the output of "lsof | wc -l". But this is not the correct way to do.

So why should we be bothered about open file descriptor limits? 
     Sometime java application server like tomcat may leak file descriptors, that is, will not close the file handles after usage. This may force us to increase the file descriptor(file handle) limit.

Open file descriptor limits are available

  1. For the whole system
  2. Per process wise
  3. Per user wise 
System wide Open File Descriptors(handles) limit

In Linux, there is a limit set in the kernel on how many open file descriptors are allowed on the system.

# cat /proc/sys/fs/file-max
6491402

In this system, 6,491,402 open file descriptors are permitted

How to increase the limit of the file descriptors if "Number Of Maximum Files Was Reached"?

This can be achieved 

a) Using sysctl command. To make the change permanent, add the entries to /etc/sysctl.conf

 Make the following entry in /etc/sysctl.conf 
           fs.file-max = 104854
 
Then run the command
# sysctl -p

          (Or)

b) To change the open file descriptor limit, run the following

 # echo "104854" > /proc/sys/fs/file-max

How many open file descriptors are in actual use in the system?

To find the number of file descriptors that are currently used in the system, run the following

# cat /proc/sys/fs/file-nr
510     0       6491402 

Column1 - 510 - the number of allocated file handles, 
Column2 - 0   - the number of allocated but unused file handles, 
Column3 - 6491042 - the maximum number of file handles (same as /proc/sys/fs/file-max)

No. of file descriptors currently used in the system = Column1 - Column2 (510 - 0) = 510

Process Level Open file descriptors(handles) limit

Linux limits the number of file descriptors that any one process may open; 

To find the default limit per process,

grep -i "max open files" /proc/<PID>/limits

Eg: # grep -i "max open files" /proc/893046/limits
Max open files            1024                 1024                 files

So this process can open 1024 file descriptors

To find the number of file descriptors or handles currently opened by a process

ls -l /proc/<PID>/fd | wc -l

Eg: # ls -l /proc/893046/fd
total 0
lrwx------ 1 root root 64 Feb 17 08:03 0 -> /dev/null
lrwx------ 1 root root 64 Feb 17 08:03 1 -> /dev/null
lrwx------ 1 root root 64 Feb 17 08:03 10 -> socket:[8924389]
l-wx------ 1 root root 64 Feb 17 08:03 2 -> pipe:[8924390]
lr-x------ 1 root root 64 Feb 17 08:03 3 -> eventpoll:[8924395]
lr-x------ 1 root root 64 Feb 17 08:03 4 -> pipe:[8924397]
l-wx------ 1 root root 64 Feb 17 08:03 5 -> pipe:[8924397]
lr-x------ 1 root root 64 Feb 17 08:03 6 -> pipe:[8924398]
l-wx------ 1 root root 64 Feb 17 08:03 7 -> pipe:[8924398]
lrwx------ 1 root root 64 Feb 17 08:03 8 -> socket:[8924387]

User Level Open File descriptor Limit

To find the user level file descriptor limit

 # su - <username>

Show the soft and hard limit using

ulimit -Sn
ulimit -Hn

Eg :
$ ulimit -Hn
1024

$ ulimit -Sn
1024

How to limit open file descriptor limit at user level?

User specific file descriptor limit can be set by editing /etc/security/limits.conf file

Say for user, apache, set the following in /etc/security/limits.conf

apache soft nofile 2048
apache hard nofile 4096

View the change as following

# su - apache
# ulimit -Hn
# ulimit -Sn

Tuning ext3/ext4 file system performance


1) Consider disabling access time updates

mount -o noatime

Each time a file system is read or written, an access timestamp is updated for that file, which causes a read and write for the metadata of that file. On a busy file syste, this can cause a large number of writes. On a journalled filesystem, the access timestamps must be written to the on-disk journal as well.

2) Mount ext3/ext4 filesystem with longer period between journal commits

mount -o commit=15

- default is 5 seconds
- Longer may improve performance but result in more lost data if system crashes.

3) Using Directory Indexing

This feature improves file access in large directories or directories containing many files by using hashed binary trees to store the directory information. It's perfectly safe to use.

Note : Make sure any filesystems are cleanly unmounted before altering them with the tune2fs or e2fsck utilities

      a) Unmount the filesystem

      b) tune2fs -O dir_index /dev/sda2

This will only take effect with directories created on that filesystem after tune2fs is run

     c) In order to apply this to currently existing directories, we must run the e2fsck utility to optimize and reindex the directories on the filesystem

                  e2fsck -D -f /dev/sda2

Journalling Filesystem in Linux


ext3 filesystem provides journalling as an important enhancement over ext2 filesystem, where the filesystem maintains a journal to log the changes.

How is the journalling feature in filesystem helpful?

After an unclean shutdown of system, due to power failure or system crash, in case of ext2 filesystem, each mounted partition must be checked for consistency using e2fsck program.  This causes delay in system boot time significantly, especially in case of large partitions containing a large number of files. During this time, any data on the partitions is unreachable.

In case if fsck need  to be run on a live sytem, the partitions need to be remounted as read only. When a filesystem is mounted as readonly, all pending metadata updates (and writes) are then forced to the disk prior to the remount. This ensures the filesystem is in a consistent state and it is now possible to run fsck -n.

The journalling feature provided by the ext3 file system means that this sort of file system check is no longer necessary after an unclean system shutdown.
The time to recover an ext3 file system after an unclean system shutdown does not depend on the size of the file system or the number of files; rather, it depends on the size of the journal used to maintain consistency. The default journal size takes about a second to recover, depending on the speed of the hardware. So journalling has made running fsck after an unclean unmount unnecessary.

Is fsck necessary when journalling feature is available for a filesystem?

In case of extreme cases like hard drive failures, file system consistency check(fsck) is very much necessary.


How journalling works?

There are three modes of journalling

1) ordered (only the metadata is journalled - default)
2) writeback (Only the metadat, but no guarantee for order of commits)
3) journal (both data and metadata are journalled)

Ordered mode :

mount -o data=ordered

In ordered mode, the data blocks related to a metadata change are written to the disk before the metadata is committed to the journal. This ensures that every metadata change recorded in journal actually reflects the writes that have been made to the disk.

Ordered mode is the default journal mode used in most systems.

Writeback mode:

mount -o data=writeback

This is the fastest mode. In this mode, metadata may be committed to the journal even before the databalocks related to the metadata change are written to the disk. Thus files may contain stale data.

Journal mode:

mount -o data=journal

In this mode, both metadata and datablocks related to the metadata change are journalled. 
Here, a copy of the modified databalocks are first written to the journal. Then the modified datablocks are writted to the filesystem. Once the  I/O data transfer to the filesystem terminates (data is committed to the filesystem), the copies of the blocks in the journal are discarded.

This is the slowest mode of journalling. More total disk I/O is being done here.  However, this merges lots of small writes around the disk into efficient linear IO, which helps in avoiding expensive seeks for small, random writes.


How to know if a filesystem has journalling enabled or not?

# dumpe2fs /dev/sda2 | grep -i has_journal
dumpe2fs 1.41.12 (17-May-2010)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

# debugfs -R features /dev/sda2
debugfs 1.41.12 (17-May-2010)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize


How to find the size of the journal?

# dumpe2fs /dev/sda2 | egrep -i '(journal|size)'
dumpe2fs 1.41.12 (17-May-2010)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Block size:               4096
Fragment size:            4096
Flex block group size:    16
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x0002cb40
Journal start:            1


How to improve journal performance?

Journal performance can be improved by placing the journal on a seperate device.  The external journal partition should be located on a device with similar or better performance characteristics than the device that contains the file system. Important points to note here are

1) external journal will use the entire partition
2) journal partition must be created with same block size as that used by the filesystem it is journalling

Let us see how to go about a creating a external journal partiton. Say, for filesystems in partitions on the device /dev/sda, we want to create a external journal partition on device /dev/sdb.

1) Get the filesystem block size of partition /dev/sda1

# dumpe2fs /dev/sda1 | egrep -i '(journal|size)'

2) Unmount the filesystem, /dev/sda1 and remove the internal journal

Make sure any filesystems are cleanly unmounted before altering them with the tune2fs or e2fsck utilities

 # tune2fs -O ^has_journal /dev/sda1

3) Create an external journal device partition

# mke2fs -O journal_dev -b <block-size> /dev/sdb1

4) Update the /dev/sda1 filesystem superblock to use the external journal /dev/sdb1

# tune2fs -j -J device=/dev/sdb1 /dev/sda1


Populate MySQL database with sample data

http://www.generatedata.com, provides a db_install.sql file, which can used for creating sample data tables


1) wget http://www.generatedata.com/files/data_generator2_1.zip

2) unzip data_generator2_1.zip

3) cd data_generator2_1/install

4) Create a database, called "sampledata", from mysql shell
     mysql> create database sampledata;

5) Edit db_install.sql file under data_generator2_1/install/ -
   Add the following line in top of the file db_install.sql
        use sampledata;

6) $ mysql -u root -p < db_install.sql

7) Login to mysql shell
   mysql> use sampledata;
   mysql> show tables;
+--------------------------+
| Tables_in_sampledata     |
+--------------------------+
| dg_cities                |
| dg_counties              |
| dg_countries             |
| dg_first_names           |
| dg_forms                 |
| dg_provinces             |
| dg_provinces_netherlands |
| dg_states                |
| dg_surnames              |
| dg_user_accounts         |
+--------------------------+
10 rows in set (0.00 sec)

MySQL Replication HowTo

How mysql replication works?


MySQL replication basically consists of three steps which works as follows:

  1. The master records changes to data in its binary log.
  2. The slave copies the changes recorded in the master’s binlog to its relay log.
  3. Then the slave replays the change-set recorded in its relay log, applying these changes to its own data.
On Master:

When a transaction is sent to a MySQL server running with binary logging enabled, the transaction is executed normally, and just prior to completing the transaction the server records the change serially in the binary log. Once this is done, it tells the storage engine that the transaction is ready to be committed.

On Slave:

The slave runs two threads to handle its role as a slave.

Firstly, the slave runs a I/O thread that opens a standard client connection to the master (using the replication user account) and starts a non-SQL BINARYLOG DUMP command, which causes the master to start a Binlog dump thread and allows the slave to read log entries from the master's binary log. The slave compares each entry with the latest entry that it already has on the disk. If it fails to find new entries, it sleeps and waits for the master to send a new entry signal. If there are new entries, it records these in the relay log on the slave.

A second thread, the SQL slave thread, reads from the relay log and replays the queries, which completes the process that started with a query being sent to the master and ends with the query being committed to the slave server database.

A diagrammatic illustration obtained from http://blog.secaserver.com/2011/06/the-best-way-to-setup-mysq-replication/, explains it as follows

                          

 Section 1 : Master MySQL server setup for replication

We are doing a fresh installation of mysql server and configuring it as a master server here

1) Install MySQL server

# yum install mysql-server
# yum install mysql

Now start the mysql server

# service mysqld start

The mysql server shall read from the configuration file /etc/my.cnf and initialize the mysql database by populating the directories
  • /var/lib/mysql
  • /var/log/mysql
2) Configuring mysql server as master by editing the mysql configuration file, /etc/my.cnf  with the following entries under mysqld section

[mysqld]
binlog-format=mixed
log_bin=/var/log/mysql/mysql-bin.log
max_binlog_size=200M
expire_logs_days=45
server-id=100
innodb_flush_log_at_trx_commit=1
sync_binlog=1

Let me explain each of the above entries and their relevance

server-id : 
Set a Unique ID for the server. By default, server-id is set as 0. If server-id is omitted or set to 0 in the my.cnf file, then master refuses connections from all the slaves.

log-bin :
The master must be told to store a binary log. log-bin will be used to write a log on the desired location. The exact path of the binary log file should be specified.

max_binlog_size :
Binary log files will keep growing and to manage their sizes, this parameter is helpful. Once the size limit is reached, a new binary log file is created. The default maximum size for a binary log is 1G. 

expire_logs_days :
Over time, there may be too many log files. It is good practice to automatically delete old binary logs. As per our setting, all binary log files older than 45 days will be deleted automatically. So if the slave server is down for 45 days or more, then the replication shall not happen and we need to start from beginning.The default value for this parameter is 0, which means no automatic removal.

sync_binlog, innodb_flush_log_at_trx_commit :
For the greatest possible durability and consistency in a replication setup using InnoDB with transactions, set innodb_flush_log_at_trx_commit=1 and sync_binlog=1 in the master my.cnf file. However, this will have an impact on performance.

sync_binlog - A value of 1 is the safest choice because in the event of a crash we loose at most one statement or transaction from the binary log. However, it is also the slowest choice (unless the disk has a battery-backed cache, which makes synchronization very fast).

If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit. When the value is 1 (the default), the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues. The default value of 1 is the value required for ACID compliance.

binlog-format :
MySQL supports two forms of replication: statement-based and row-based replication. There is also a hybrid mode (mixed) that is used by default.

To check which replication mode a MySQL server is running in, SELECT the value of the binlog_format system variable:

mysql> show variables like 'binlog_format%';
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| binlog_format | STATEMENT |
+---------------+-----------+
1 row in set (0.00 sec)

We are setting the replication mode as mixed

3) To make the changes done in /etc/my.cnf effective, restart the mysql server
         # service mysqld restart

4) Each slave mysql server that connects to a master mysql server in a replication setup (for the purposes of reading the binary log on the master) must have an account to log in. This account must be granted the dedicated permission REPLICATION SLAVE.

 a) Login to the master mysql server
           # mysql -u root -p

 b) master mysql > CREATE USER 'repl_user'@'%' IDENTIFIED BY 'mydba';
     master mysql > GRANT REPLICATION SLAVE ON *.* TO 'repl_user'@'%';

So we have created a user repl_user with password mydba on the master server. Using this user account, slave server shall login to master server.

c) To update the permissions table, flush the privileges:
    master mysql > FLUSH PRIVILEGES;

d) In mysql database, in the table user, an entry like this shall be found

mysql> select user,host from user;
+-----------+-------------+
| user      | host        |
+-----------+-------------+
| repl_user | %           |
| root      | 127.0.0.1   |
|           | localhost   |
| root      | localhost   |
|           | mastermysql |
| root      | mastermysql |
+-----------+-------------+

Section 1 : Slave MySQL server setup replication

We are doing a fresh installation of mysql server and configuring it as a slave server here

1) Install MySQL server

# yum install mysql-server
# yum install mysql

Now start the mysql server

# service mysqld start

The mysql server shall read from the configuration file /etc/my.cnf and initialize the mysql database by populating the directories
  • /var/lib/mysql
  • /var/log/mysql
2) Stop the mysql server
    # service mysqld stop

3) Edit the [mysqld] section in /etc/my.cnf file to configure as slave server

[mysqld]
server-id=102
master-host=<master host ip>
master-connect-retry=60
master-user=<repl_user_name>
master-password=<password>
relay-log=/var/lib/mysql/slave-relay.log
relay-log-index=/var/lib/mysql/slave-relay-log.index

#In case, if we are enabling binary logging for slave(to use the slave as master server later)
log_bin=/var/log/mysql/mysql-bin.log 
log-slave-updates = 1
read-only = 1

Let me explain the significance of each of the entries

server-id – gives the Slave its unique ID
master-host - tells the Slave the IP address of the Master mysql server for connection. 
master-connect-retry - Here we will specify the connection retry interval.
master-user - Specify the user created in master server, through which the slave connects to the Master server for replication
master-password - Specify the password of the replication user mentioned above

This is how a slave server /etc/my.cnf file will look like

#Slave entries
server-id=102
master-host=192.168.1.33
master-connect-retry=60
master-user=repl_user
master-password=mydba
relay-log = /var/lib/mysql/slave-relay.log
relay-log-index = /var/lib/mysql/slave-relay-log.index

Sometimes in future, we may like to use the slave server as master server. In that case, we shall enable binary logging.

log-bin - Path of the binary log
log-slave-updates makes the slave log the replicated events to its binary log ( Use this if binary logging is enabled )
read-only prevents data changes on the slaves, because slave is really usually used for reads

4) Do not start the mysql server for now

Now having setup the MySQL master and slave server, now need to start the replication process.

Section 2 : Setting up the Replication between Master and Slave

On Master

1) Connect to MySQL Master server
     # mysql -u root -p

2) First lock the tables in master so that no write activity happens
      master mysql> FLUSH TABLES WITH READ LOCK;

3) Leave this shell open and create another mysql client shell, i.e. another session: In this mysql shell run
    master mysql> SHOW MASTER STATUS;
SHOW MASTER STATUS;
+------------------+----------+--------------+------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+--------------+------------------+
| mysql-bin.000003 |      418 |              |                  |
+------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

Write down the values for the column FILE and POSITION. You will need them later when setting up the Slave to start replication - mysql-bin.000003, 418

4) On OS command prompt, get a data snapshot:

   # mysqldump --all-databases --lock-all-tables >dbdump.db   # I used this
   Another alternative is to use --master-data option, which automatically appends CHANGE MASTER TO statement required on the slave to start the replication process.
   
  # mysqldump --all-databases --master-data -u root -p --hex-blob > dbdump.db

mysqldump: It is a binary for producing SQL statements from a MySQL database. This means, "take my database and produce a file that contains SQL statements, which would build an identical database if executed"

--all-databases: Backs up all databases on the master.
--hex-blob: If you have BLOBs (binary objects) in your database, this option ensures they are stored in a way that allows them to be imported correctly (that is, stored as hexadecimal characters).  Blobs are typically images, audio or other multimedia objects, though sometimes binary executable code is stored as a blob.

Transfer this database dump to slave server

5) Release lock and allow the Master to get updates
   mysql> unlock tables;and Slave mysql servers

On Slave

1) Start MySQL server with --skip-slave-start option so that replication does not start
     # mysqld_safe –-skip-slave-start

2) From another operating system command prompt, import the master database dump
     # mysql –u root –p < dbdump.db

3) Stop MySQL server on slave

4) Possible problem
         While we imported the master mysql database dump on the slave server, we had imported the system databases from master too into the slave server. So now the slave server mysql root password would be lost and it shall be overwritten with root server mysql password. 

In case, if we intend to use a separate root password for slave mysql server, do the following

    a) Start MySQL so that it will not ask for password. Also, make sure it does not start replication:
            # mysqld_safe –-skip-slave-start –skip-grant-tables

    b) Then connect with mysql –u root from os command prompt and issue a command to update root password as follows:
      # mysql -u root
      slave mysql> use mysql;
      slave mysql> update user set password=PASSWORD(‘new-password’) WHERE User = ‘root’;
      slave mysql> flush privileges;
   c) Stop MySQL Server that you started with skipping slave start and grant tables.

5) Start MySQL Server with skipping Slave Start
           # mysqld_safe –-skip-slave-start
6) Set Master server log co-ordinates on the Slave. Slave server has to start the replication from this log co-ordinates.
  
slave mysql > CHANGE MASTER TO MASTER_HOST='192.168.1.33', MASTER_USER='repl_user', MASTER_PASSWORD='mydba', MASTER_LOG_FILE='mysql-bin.000003', MASTER_LOG_POS=418; 
This is how you tell Slave how to connect to Master in order to replicate. Note the log coordinates. These are the coordinates got from step 3, On Master above.
7) Stop the mysql server started by skipping replication on step 5.

8) Start the slave  mysql server normally
      # service mysqld start

So now master-slave replication has been setup and initiated successfully.

Checking if Master-Slave Replication is working fine

Having started the slave MySQL node, you can log in and issue some commands to make sure that Slave is running OK.

1) On slave mysql prompt, give the following command:

slave mysql> show processlist;
+----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------+------------------+
| Id | User        | Host      | db   | Command | Time | State                                                                 | Info             |
+----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------+------------------+
|  1 | system user |           | NULL | Connect |   43 | Waiting for master to send event                                      | NULL             |
|  2 | system user |           | NULL | Connect |  892 | Has read all relay log; waiting for the slave I/O thread to update it | NULL             |
|  4 | root        | localhost | NULL | Query   |    0 | NULL                                                                  | show processlist |
+----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------+------------------+
3 rows in set (0.00 sec)

2) On slave mysql prompt, give the following command
slave mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.33
                  Master_User: repl_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000003
          Read_Master_Log_Pos: 489
               Relay_Log_File: slave-relay.000003
                Relay_Log_Pos: 322
        Relay_Master_Log_File: mysql-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 489
              Relay_Log_Space: 473
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:

 This will display the current status on slave. Check the *_Errno and *_Error columns for errors
3) On slave mysql prompt, give the following command
slave mysql> show status like ‘Slave%’;
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Slave_open_temp_tables     | 0     |
| Slave_retried_transactions | 0     |
| Slave_running              | ON    |
+----------------------------+-------+
3 rows in set (0.00 sec)

Check if  the value of Slave_running is ON.

Some general queries

1) How to know how late a slave is compared to the master? In other words, how do I know the date of the last statement replicated by the slave?

Run the following command in mysql shell prompt of SLAVE server

slave mysql> SHOW SLAVE STATUS;

Check the "Seconds_Behind_Master" column in the output from SHOW SLAVE STATUS.


2) How to force the master to block updates until the slave catches up?

a) On the master mysql server, execute these statements:

    master mysql> FLUSH TABLES WITH READ LOCK;
    master mysql> SHOW MASTER STATUS;

    Record the replication coordinates (the current binary log file name and position) from the output of the SHOW statement.

b) On the slave mysql server, issue the following statement, where the arguments to the MASTER_POS_WAIT() function are the replication coordinate values obtained in the previous step:

    slave mysql> SELECT MASTER_POS_WAIT('log_name', log_pos);

    The SELECT statement blocks until the slave reaches the specified log file and position. At that point, the slave is in synchrony with the master and the statement returns.

c) On the master, issue the following statement to enable the master to begin processing updates again:

    master mysql> UNLOCK TABLES;

3) How to start mysql replication process all over again in a fresh manner?


So while learning to setup mysql replication, we would like to start fresh all over again by dropping the mysql databases and tables.

In master and slave server, stop the mysqld service first
# service mysqld stop

In master and server remove the files in the following folders

#rm /var/lib/mysql/*
#rm /var/log/mysql/*

Ensure that [mysqld] section in mysql configuration file, /etc/my.cnf has just the following entries in both the  master and slave servers

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

Now start the mysql service in both master and slave server

# service mysqld start

This shall initialize the mysql database and populate the directories /var/lib/mysql/ and /var/log/mysql/.
Then proceed as in Section 2

Friday, February 22, 2013

SSH taking too long to connect

When I tried to SSH to a machine, it was taking too long to connect.

So from the client, tried the ssh -v option to check where the connection was hanging and found the following

# ssh -v 192.168.1.33
.......
debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password
debug1: Next authentication method: gssapi-keyex
debug1: No valid Key exchange context
debug1: Next authentication method: gssapi-with-mic
debug1: An invalid name was supplied
Cannot determine realm for numeric host address

debug1: An invalid name was supplied
Cannot determine realm for numeric host address

debug1: An invalid name was supplied
debug1: An invalid name was supplied


So gssappi authentication was causing the delay in getting the login prompt.

How to overcome the GSSAPI authentication?

1) Disable GSSAPI authentication when using SSH or SCP command as follows

         ssh -o GSSAPIAuthentication=no 192.168.1.33


2) Disable GSSAPI authentication in ssh client configuration file. Add the line

        GSSAPIAuthentication  no

in /etc/ssh/ssh_config file (global ssh client configuration file) (OR)
in /home/<user>/.ssh/config file (Per user configuration file)


3) Adding the IP of the server to which we are connecting to in /etc/hosts file of client machine, like follows

        192.168.1.33  dhcppc1
  

Thursday, February 21, 2013

Network bonding in Linux

Network bonding means combining two or more network interface cards to create a single bonded interface. This helps in high availability as well as in increasing the available bandwidth. However note that bonding multiple network cards together will not instantly result in double the bandwidth and high-availability in case a link goes down. The network cards will be bonded as slave to the logical bond interface.

Ethernet channel Bonding modes fall under three categories

  • Modes that require switch support
  • Generic mode which do not require switch support
  • Modes that provide fail-over only
Modes that require switch support

A switch has to support 802.3ad. Then only it is possible to aggregate bandwidth of all the physical NICs.

balance-rr (Mode 0) -
   Packets are transmitted in round-robin fashion without hashing. say, if there are two NIC cards bonded and if two packets arrive at the bonded interface, the first packet will be sent to first slaave and the second packet will be sent to the second slave. If a third packet arrives, it shall be sent to the first slave and so on. Thus, this mode provides true load balancing.

802.3ad (Mode 4) -
   This mode is the official standard for link aggregation. It creates aggregation groups that share the same speed and duplex settings. This mode requires a switch that supports IEEE 802.3ad dynamic link.

balance-xor (Mode 2) -
    In this mode, the traffic is hashed - source MAC address is XOR'ed with destination address. This selects the same slave for each destination MAC address and provides load balancing and fault tolerance. Traffic is hashed and balanced according to the receiver on the other end.

 Note : The modes requiring switch support can be run back-to-back with crossover cables between two server as well. This is especially useful, for example, when using DRBD to replicate two partitions.

Generic bonding mode

This mode requires that the NIC cards support changing the MAC address on the fly. Say, if there are two NIC cards bonded, these NIC cards will constatnly swap their MAC addresses to trick the other end (be it a switch or another connected host) into believing that it is sending traffic to the same network card. If switch sends a packet to first slave, it will try to send the second packet to the same slave card again. Upon receiving the first packet, the first slave will swap it's MAC address with second slave. Since switch identifies
the NIC cards by MAC address, it will send the second packet to second slave thinking it is the card to which it sent the previous packet.

broadcast (mode 3) -
   It simply broadcasts all traffic out both interface. This mode is least used.

balance-tlb (mode 5) -
     This is called transmit load balancing. Outgoing traffic is load balanced, but incoming uses only a single interface. The driver will change the MAC address on the NIC when sending, but incoming always remains the same. The outgoing traffic is distributed according to the current load and queue on each slave interface (computed relative to speed on each slave). Incoming traffic is received by the current slave. If the receiving slave fails, another slave takes over the MAC address of the failed receiving slave.

balance-alb (mode 6) -
     This is called adaptive load balacing. Both sending and receiving frames
are load balanced using the change MAC address trick. The bonding driver
intercepts the ARP Replies sent by the server on their way out and overwrites
the src hw address with the unique hw address of one of the slaves in the
bond.

High Availability (failover-only)

Let us examine the failover part of NIC bonding. If both the NICs of a bond  are connected to the same switch, then if the switch goes down or rebooted for a firmware upgrade, then we are down.

So the best way is to connect the NICs in a bond to different switches. But this is possible only for generic mode of bonding which does not require switch support. For the bonding mode which requires switch support, this is not possible on most devices.

active-backup (mode 1) -
   This mode places one of the interafces into a backup state and shall make it active if the other active interface goes down. The bond's MAC address is externally visible on only one port to avoid confusing the switch.

So there are seven types of ethernet bonding.  

Kernel module for bonding must be loaded for this bonding purpose.

The most commonly used bonding modes are
1) balance-rr or 0
2) active-backup or 1
3) balance-xor or 2

Configuring  Ethernet bonding interface in CentOS 

Requirements for ethernet bonding are
  • eth0 - The first network card
  • eth1 - The second NIC
  • bond0 - The bonding device created by the entry in /etc/modules.conf (for an older 2.4 kernel) or in /etc/modprobe.d/bonding.conf(for a 2.6 kernel)
Note : The bond will always take the MAC address of the eth0 NIC card

In /etc/modprobe.d/bonding.conf file, add the following line

       alias bond0 bonding
  • The alias line will associate the bond0 network interface with the bonding module
In CentOS, the script /etc/init.d/network will bring up the network interfaces, eth0 and eth1, by default. So we will add following entries in /etc/rc.local to set up the bonding
  • bring these interfaces, eth0 and eth1, down first
  • clear the exist route asscoiated with eth0
  • enable bonding by specifying bonding mode(0..7) using modprobe command
  • associate the bonding interface, bond0, with MAC address of the eth0 NIC card
  • add an IP address for the bond0 interface
  • bring up the bonding interface, bond0
  • add the NICs, eth0 and eth1, as slaves to the bonding interface, bond0, using ifenslave command
  • Bring up the interfaces, eth0 and eth1
  • Set route for bond0 devices
Let us illustrate for setting up failover bonding - mode 1

Bonding to introduce failover - mode 1: (HA)

Add the following lines in /etc/rc.local file

ip link set dev eth0 down
ip link set dev eth1 down
ip route del 0/0 via 192.168.1.1 dev eth0
modprobe bonding mode=1 miimon=100 downdelay=200 updelay=200
ip link set dev bond0 addr 00:80:c8:e7:ab:5c
ip addr add 192.168.1.33/24 dev bond0
ip link set dev bond0 up
ifenslave bond0 eth0 eth1
ip link set dev eth0 up
ip link set dev eth1 up
ip route add 0/0 via 192.168.1.1 dev bond0 


 #To add an entry for ip address and hostname in /etc/hosts,
ip addr | grep global | awk -F'/' '{ print $1 }' | awk '{ print $2,"\t","\t","dhcpcc5","dhcpcc5" }' | awk '{system("sed "'NR'"i\""$0"\" -i /etc/hosts")}'

Reboot the machine

Once the machine is up, check the following

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200

Slave Interface: eth0
MII Status: up
Speed: 100 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:1c:c0:3e:4b:7e
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 100 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:19:5b:6b:53:22
Slave queue ID: 0





So the bonding mode is, fault tolerance, and the interface eth1 is the active slave behind the bond0 interface.

# ifconfig
bond0     Link encap:Ethernet  HWaddr 00:1C:C0:3E:4B:7E 
          inet addr:192.168.1.33  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::21c:c0ff:fe3e:4b7e/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:9758 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7774 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:5579137 (5.3 MiB)  TX bytes:2026631 (1.9 MiB)

eth0      Link encap:Ethernet  HWaddr 00:1C:C0:3E:4B:7E 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:1320 errors:0 dropped:0 overruns:0 frame:0
          TX packets:23 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:104657 (102.2 KiB)  TX bytes:3815 (3.7 KiB)
          Interrupt:29 Base address:0xe000

eth1      Link encap:Ethernet  HWaddr 00:1C:C0:3E:4B:7E 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:8438 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7751 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5474480 (5.2 MiB)  TX bytes:2022816 (1.9 MiB)
          Interrupt:17 Base address:0x2000

lo        Link encap:Local Loopback 
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:210 errors:0 dropped:0 overruns:0 frame:0
          TX packets:210 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:23293 (22.7 KiB)  TX bytes:23293 (22.7 KiB)

venet0    Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 
          inet6 addr: fe80::1/128 Scope:Link
          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
          RX packets:660 errors:0 dropped:0 overruns:0 frame:0
          TX packets:834 errors:0 dropped:3 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:88089 (86.0 KiB)  TX bytes:75317 (73.5 KiB)




# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.2.101   0.0.0.0         255.255.255.255 UH    0      0        0 venet0 192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 bond0
0.0.0.0         192.168.1.1     0.0.0.0         UG    0      0        0 bond0


# ip route show
192.168.2.101 dev venet0  scope link 
192.168.1.0/24 dev bond0  proto kernel  scope link  src 192.168.1.33
default via 192.168.1.1 dev bond0








To verify whether the failover bonding works..

  Bring down eth1 and check /proc/net/bonding/bond0 and check the "Current
Active Slave"

    Do a continuous ping to the bond0 ipaddress from a different machine and
bring down the active slave interface, eth1. The ping should not break.