Optimizing Linux Kernel for HADOOP environment

Coming from an infrastructure background from being a system administrator from learning how to harden service like web servers, mail servers etc. The curiosity for me to find out how to further help to fine tune the kernel to improve the performance of hadoop filesystems is what lead me to write this research. Of course this is only one part of the performance tuning considering the hardware, how it was setup, single node setup,network setup Etc. This only tackles how to further fine tune the linux kernel.

  1. Swappiness – hadoop relatively is built on java, which is mostly consuming memory rather than reading mostly to files/disk, swappiness is a kernel parameter which controls when to use and how to use the swap file acording to setup. from what i understand we need to avoid Out of memory conditions. Swappiness is set by default to 60, it is configure from 0 to 100. setting it to 100 will cause a big huge performance to maximize the use of swapping. we need to disable snappiness set to 0, setting it to 0 will only use swapping if it is almost running out of memory.

echo 0 > /proc/sys/vm/swappiness

to configure it in such a way that if you reboot your machine it automatically is set,

echo “vm.swappiness = 0” >> /etc/sysctl.conf  . i will continue to add more.

2. Increasing dirty_ratio parameter, dirty pages is one of the temporary pages when data is suddenly being change, before it goes to memory, in the process of swapping, dirty page comes into picture.

echo ‘vm.dirty_ratio=10’ >> /etc/sysctl.conf

3. fs.file-max paramter is use for increasing the number of open files which kernel can allocate. by default the 1024 is the default number of files that user can use, in some cases this also produce a java error “java.io.FileNotFoundException: (Too many open files)” you can either do

ulimit -s eg. 4096

or apply in system parameter

echo ‘fs.file-max = 943718’ >> /etc/sysctl.conf

4. To increase the length of the processor input queue

echo ‘net.core.netdev_max_backlog = 30000’ >> /etc/sysctl.conf

5. In some point, others agree to use EXT4 as a choice for using filesystem in Hadoop envinronment, some performance tuning they use is by using the noatime parameter in fstab which can improve performance by disabling those excess writing to disk to reduce i/o

ext4   noatime       0 0

6. Network parameter tuning may some risk like connectivity loss, and need to be careful when tuning everything related to network as this is also quite hard to debug, anyway by adding, by performing this, it may increase perfomance between the master and slace connection.

echo ‘net.core.somaxconn=1024’ >> /etc/sysctl.conf

Leave a comment