NP04 DAQ Computing Optimization

Under Consideration

  • vm.dirty_background_bytes=0x1400000 # 20 MB
  • vm.dirty_bytes=0x10000000 # 256 MB
  • Turn off Page Table Isolation - apparently using kernel parameters in grub, nopti or pti=off

Turn Off Swap (Done)

  • swapoff -a
  • Removed swap line from fstab file.

np04-srv-001

[root@np04-srv-001 ~]# free
              total        used        free      shared  buff/cache   available
Mem:      131742684     2644324    71160708       76184    57937652   128086408
Swap:             0           0           0

np04-srv-002

[root@np04-srv-002 ~]# free
              total        used        free      shared  buff/cache   available
Mem:      131742684     1455064     4537820       58880   125749800   129419328
Swap:             0           0           0

Identical Kernels (Done)

Set kernel to 3.10.0-693.21.1.el7.x86_64 on all DAQ computers.

  • Use yum to install the kernel
  • Select the kernel with grub2-set-default

np04-srv-001

[root@np04-srv-001 ~]# uname -a
Linux np04-srv-001 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

np04-srv-002

[root@np04-srv-002 ~]# uname -a
Linux np04-srv-002 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Disk Write Speeds

These adjustments apply to np04-srv-001 and np04-srv-002 which at the time were the only two data disk servers installed at EHN1.

Adjustments requested by Giovanna (December 2017)

Email received on Friday, December 08, 2017 11:00.

Dear Geoff,
Thanks to Wainer and the testing session we just had, I think that we can now prepare the “final" configuration of the storage servers.
Please find enclosed a summary of our tests.

1) the oflag=dsync option of dd artificially lowers the performance, forcing a synchronisation that is not really needed. Therefore, just changing the way we did measurements (oconv=fsync) already brought the performance up by 25% (400 MB/s -> 500 MB/s)

2) We increased progressively the number of threads in /sys/block/md*/md/group_thread_cnt and found a reasonable plateau at 4.
~> for i in  /sys/block/md*/md/group_thread_cnt; do echo 4 > $i;done

3) We reduced the dirty_background_ratio and dirty_ratio, in order to reduce RAM utilisation (this may be tuned once we know better what will run on those nodes)
~> echo 1 > /proc/sys/vm/dirty_background_ratio
~> echo 2 > /proc/sys/vm/dirty_ratio
 
4) We set the read_ahead to 65536
~> for i in `seq 0 3`; do blockdev — setra 65536 /dev/md$i ; done

5) we increased the min/max sync speeds:
~> echo 50000 > /proc/sys/dev/raid/speed_limit_min
~> echo 5000000 > /proc/sys/dev/raid/speed_limit_max

As a result we get ~2.5 GB/s sequential write performance which is only affected partially by reading (we get ~4GB/s sequential read perf).

Last thing to decide is if we bother using RAID6 or we are happy with RAID5. 
RAID 5 will give us a small performance increase (5-10%) as well as disk space increase (8%), while RAID 6 allows us to not loose data even if a disk breaks while we are recovering from a disk failure. Since disks don’t die like flies and our data are meant to be short-lived anyway, I would tend to go for RAID5.

The configuration has 4 independent devices with 12 disks each (one of which declared as spare, such that rebuilding will start without human intervention).

Just for reference, the creation command can be (for 1 device):
~> mdadm —create —verbose /dev/md0 —level=5 —raid-devices=11 /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak —spare-devices=1 /dev/sdal
~> mkfs.xfs /dev/md0
~> mount /dev/md0 /data0

In /etc/mdadm.conf we should specify an email address to get notified of any failures and make sure that the mdmonitor.service is running correctly.
Example:
[root@np04-srv-002 ~]# cat /etc/mdadm.conf
MAILADDR giovanna.lehmann@cern.ch

Ciao
Giovanna

Implementation

Configuration applied via ansible, the configuration tool we are using.

Settings via systemctl

- sysctl:
    name: vm.dirty_background_ratio
    value: 1
    sysctl_set: yes
    state: present
- sysctl:
    name: vm.dirty_ratio
    value: 2
    sysctl_set: yes
    state: present
- sysctl:
    name: dev.raid.speed_limit_min
    value: 50000
    sysctl_set: yes
    state: present
- sysctl:
    name: dev.raid.speed_limit_max
    value: 500000
    sysctl_set: yes
    state: present

Settings via udev

SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md0", ATTR{md/group_thread_cnt}="4"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md1", ATTR{md/group_thread_cnt}="4"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md2", ATTR{md/group_thread_cnt}="4"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md3", ATTR{md/group_thread_cnt}="4"

SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md0", ATTR{bdi/read_ahead_kb}="65536"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md1", ATTR{bdi/read_ahead_kb}="65536"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md2", ATTR{bdi/read_ahead_kb}="65536"
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md3", ATTR{bdi/read_ahead_kb}="65536"

Status

np04-srv-001

[root@np04-srv-001 ~]# ~dsavage/np04online/bin/np04-dump-disk.sh
np04-srv-001
/sys/block/md0/md/group_thread_cnt = 4
/sys/block/md1/md/group_thread_cnt = 4
/sys/block/md2/md/group_thread_cnt = 4
/sys/block/md3/md/group_thread_cnt = 4
/proc/sys/vm/dirty_background_ratio = 1
/proc/sys/vm/dirty_ratio = 2
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw 131072   512  4096          0  60010408181760   /dev/md0
rw 131072   512  4096          0  60010408181760   /dev/md1
rw 131072   512  4096          0  60010408181760   /dev/md2
rw 131072   512  4096          0  60010408181760   /dev/md3
/proc/sys/dev/raid/speed_limit_min = 50000
/proc/sys/dev/raid/speed_limit_max = 500000

np04-srv-002

[root@np04-srv-002 ~]#  ~dsavage/np04online/bin/np04-dump-disk.sh
np04-srv-002
/sys/block/md0/md/group_thread_cnt = 4
/sys/block/md1/md/group_thread_cnt = 4
/sys/block/md2/md/group_thread_cnt = 4
/sys/block/md3/md/group_thread_cnt = 4
/proc/sys/vm/dirty_background_ratio = 1
/proc/sys/vm/dirty_ratio = 2
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw 131072   512  4096          0  60010408181760   /dev/md0
rw 131072   512  4096          0  60010408181760   /dev/md1
rw 131072   512  4096          0  60010408181760   /dev/md2
rw 131072   512  4096          0  60010408181760   /dev/md3
/proc/sys/dev/raid/speed_limit_min = 50000
/proc/sys/dev/raid/speed_limit_max = 500000

-- DavidGeoffreySavage - 2018-06-27

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2018-06-27 - DavidGeoffreySavage
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CENF All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback