ISU Tier 3 Management
Introduction
This page contains instructions/information for the Tier3 setup at Iowa State University (ISU). We've attempted to write the guide at a beginner level so that newer students can take over the position with relative ease.
New managers of the ISU T3 should join this egroup :
atlas-adc-tier3-managers@cernNOSPAMPLEASE.ch . Through this group you can ask questions of other T3 managers and get information on when certain systems need to be updated. Another helpful contact for us has been Doug Benjamin at ANL. ISU T3 users will find all the information they need at the
ISU T3 Users page.
Another useful resource is the T3 Setup TWiki
here. Unfortunately this TWiki is filled with outdated links and information, so don't be surprised to find differences between our T3 and the instructions - the upgrade from SL5 to SL6 came with many changes that weren't backward compatible.
Finally, new admins should add their preferred email address to the /etc/aliases file on the interactive nodes under 'root' (see the example shown below). The list is comma separated and will send any messages for the root user (errors, certificate expiration warnings, and warnings when users try to sudo without permission).
...#At the bottom of /etc/aliases
# Person who should get root's mail
#root: marc
root: mdwerner,pluthd@gmail.com
Be sure to separate each email by a single comma (Even a single space will prevent the system from recognizing the address). After this file has been updated you must additionally run the
newaliases command to push the changes you made into the system.
newaliases
To test that it has worked properly, open a new terminal (not as root) and try to issue a sudo command. It should respond with "
is not in the sudoers file. This incident will be reported." and immediately send an email.
We are attempting to create detailed documentation for the setup, configuration, and testing of all services on the T3. These are located in the group : https://gitlab.cern.ch/ISUT3Management
. New administrators should request access to the repository.
TODO List
There are many tasks which require a fairly large time commitment (and learning curve) which must eventually be done by the T3 admin(s) :
- Upgrade head1 to SL6 and reinstall all software
- Reallocate unused space (previously used by virtual machines)
- Syncronize all repositories
- Create administrator scripts to configure/test each system
- Automatically setup the ATLAS environment by default for all users
- Repair all firewalls
- Update installed version of ROOT (currently running v5)
- Setup EOS
- Setup Condor Manager on second head node so that jobs will submit faster and will not cause subsequent commands to hang
- Setup user emails for all user accounts (i.e. system should send mail to each user's corresponding ISU email address)
SSL Certificate
The interactive nodes use self-signed SSL certificates to encrypt communication with users through a web browser (much the way that ssh keys are used to encrypt communication via SSH). Because the certificates are self-signed (and not through a certificate authority) it does not protect users from man-in-the-middle atttacks.
Renewal
SSL certificates (even self-signed ones) need to be renewed from time to time. When this happens the system will send an email to the root user account (which admins can sign up for using the /etc/aliases file as outlined in the Introduction).
Simply use the following command
genkey --renew hep-int1.physics.iastate.edu
This command will open an interface through which you will have to select the level of encryption (just choose the recommended options) and the location to put the public and private keys.
The process of generating the keys takes several minutes. Afterwards you must restart the apache service
/etc/init.d/httpd restart
LDAP
The LDAP (Lightweight Directory Access Protocol) service currently runs on head2. LDAP is the service which contains all of the usernames and passwords for the HEP group members with accounts - and is used by all of the other machines to authenticate those users (as such, if the LDAP service goes down, nobody will be able to login to the machines). This service is running on head2 (192.168.1.4) [it previously ran on a virtual machine at 192.168.1.30 - so if you see any legacy code that is the reason for the difference].
Starting/Stopping LDAP
To start and stop the LDAP service on any machine, simply use the command
service slapd <start/stop/restart/status>
On the head node the LDAP authentication service is called
nslcd. If you cannot login to the head node then it is likely that this service is either not running or having problems.
Misc Commands
Backing up and importing LDAP
The steps for ldap migration and reinstallation
Steps before the OS upgrade (on SL5 machine)
- stop slapd process and make a backup of database (see the script ldap_database_backup.sh)
- copy the database away from the machine
- copy away the old configuration file /etc/openldap/slapd.conf
- copy away the cacerts from /etc/openldap/cacerts/*
Steps after the OS upgrade
copy over the old ldap database and the old configuration file,
gunzip ldap_backup.ldif.gz
/usr/sbin/slapadd -v -f old_slapd.conf -l ldap_backup.ldif
see Redhat guide for more steps (
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Migration_Planning_Guide/sect-Migration_Guide-Security_Authentication-LDAP.html
)
- Make sure slapd is not running
service slapd stop
- clean of the default area
rm -rf /etc/openldap/slapd.d/*
- move into correct place
slaptest -f /root/old_slapd.conf -F /etc/openldap/slapd.d
- set ownership and protections
chown -R ldap:ldap /etc/openldap/slapd.d
chmod -R 000 /etc/openldap/slapd.d
chmod -R u+rwX /etc/openldap/slapd.d
chown -R ldap:ldap /etc/openldap/cacerts/*.pem
Steps to make this warning message go away :
bdb_db_open: warning - no DB_CONFIG file found in directory /var/lib/ldap: (2).
Expect poor performance for suffix "dc=physics,dc=sunysb,dc=edu".
bdb_monitor_db_open: monitoring disabled; configure monitor database to enable
- copy the example DB_CONFIG file
cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG
- Run the index command to fill the db config file (warning still present)
slapindex -v -F /etc/openldap/slapd.d/
- Rerun slapindex command to see the error
chown ldap:ldap /var/lib/ldap/DB_CONFIG
- set the ownership correctly
chown ldap:ldap /var/lib/ldap/*
User forgot password
In the event that a user forgets their T3 password, the admin can overwrite it for them. (You should inform them to create a new password for themselves immediately - for the sake of security).
-
ldappasswd -H ldap://head2 -x -D "cn=root,dc=isuhep,dc=lan" -W -S "uid=username,ou=people,dc=isuhep,dc=lan"
This command will ask you twice for the new password, and then for the root user password after which the user's password will be reset.
Initial Setup
If you ever need to set up LDAP authentication on another machine, the following links will be helpful. We use nss-pam-ldapd which can be installed easily with yum.
Giving sudo permission
To add a user to the list of sudoers, login as root and edit the /etc/sudoers/ file. The line which grants a user full access looks like
mdwerner ALL=(ALL) ALL
IPTABLES
Iptables is a linux firewall program - used to block/allow connections from specified IP addresses or ports. Certain programs (such as condor) require certain ports to be opened on machines, so rules need to be added to the iptables accordingly.
The NetworkManagement
git repository contains a copy of the iptables configuration files for each of the machines (named accordingly).
SQUID
SQUID is a service used primarily for access to CVMFS. Our SQUID server is located at 192.168.1.61, and that is the only service we use that machine for.
CVMFS
The following links may help the new user understand the CVMFS system.
CVMFS (Cern Virtual Machine File System) allows our T3 to access the Cern file system which contains all of the software we typically use (panda, root, AnalysisBase, athena, etc). If you see errors saying that any of these things do not exist then it is likely that there is some issue with CVMFS or the SQUID. In our experience simply restarting the service (as explained below) on all of the workers is sufficient to fix the issue.
Restarting CVMFS
The cvmfs service is called 'autofs'. As such it can be stopped/started/restarted with the command
service autofs <stop/start/restart/status>
Sometimes (for reasons I don't understand) the above command is insufficient. In those cases the following commands tend to succeed
cvmfs_config chksetup
cvmfs_config reload
Misc Errors
If you get the following error message
/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/utilities/validator.sh: line 369: echo: write error: No space left on device
it is likely due to the /tmp directory having run out of space. Clear out the directory and try again.
HTCondor
The condor system allows us to run up to 154 jobs in parallel (14 on each worker node + 14 on hep-int2). It is very common when many jobs are being submitted at once for the condor service to output a message such as
-- Failed to fetch ads from: <192.168.1.1:9230?addrs=192.168.1.1-9230> : hep-int1.physics.iastate.edu
SECMAN:2007:Failed to end classad message.
These messages will stop once all jobs have been added to the queue. Until that time any condor commands (such as condor_q) will just output this message over and over.
A good talk about how condor decides priority was given here
. Another helpful link for management information is here
. Finally the extremely long (but informative) HTCondor manual is here
.
The condor system runs several daemons
- condor_master controls all of the other daemons
- condor_startd is responsible for executing jobs
- condor_schedd is responsible for submitting jobs
- condor_collector is responsible for gathering information from the machines (runs only on head2)
- condor_negotiator assigns jobs to machines (runs only on head2)
The most useful debug info for startup issues is located in /var/spool/condor/StartLog.slotX on each machine. This log file contains the list of all config files the machine has read in (and the order in which they are read).
To turn off the condor system use condor_off
. If no condor daemons are running then you will need to start condor_master before the condor_on command will be available.
Commands
- To view the status of the condor system use
condor_status
- To restart the condor system use
condor_restart
(This command must be run as root on the head node, head2)
Config Files
There are several config files for condor. Local versions are kept in /etc/condor/ however for quickly changing condor configs across workers you should edit only those in /export/share/condor-etc/ (note that this file is read-only outside of the nfs, so you'll have to ssh into nfs1 in order to edit the file.
Once a config file has been changed you must run condor_reconfig -all
to propagate the changes to all machines (this command needs to be run on head2).
The config file can be editted to change the number of "slots" available on each machine. This may be necessary if some jobs require large amounts of memory. By default condor allows 1 slot for each processor (set to 15 for each of our workers). However with 24GB of RAM on each worker if any job requires 2GB or more the number of slots needs to be reduced (memory failures tend to not produce any error messages - simply killing the jobs).
To do this, simply add the following lines to the /export/share/condor-etc/condor.config.worker.local file
SLOT_TYPE_1 = mem=50% #This defines a job which can use half of the memory on the machine with a high memory requirement (other options are cpus=2)
NUM_SLOTS_TYPE_1 = 2
#The total number of slots that should be advertised by this machine
NUM_SLOTS = 2
You will also see in the config files a list of which daemons the condor_master should start and monitor.
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
If you ever have to change the configuration, make sure that the correct daemons are being run on the correct machines (as described above).
The START expression is when the machine is willing to start a job. If START = FALSE
then no jobs will be accepted by that machine.
The following settings are for a machine which always runs jobs, with no suspension or continuations (this corresponds to our setup at ISU)
START = True
RANK =
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
DNS
The local DNS (Domain Name Server) is head2. The DNS is responsible for directing traffic (and confirming that connections are being formed with the correct machines). If you find that you are unable to access any of the machines OR that logging into them is taking a long time, there may be an issue with the DNS. The DNS must be listed as nameserver in the /etc/resolv.conf file on each machine.
To start/restart the DNS service (again, remember to do this on head2)
service dnsmasq <start/stop/restart/status>
When attempting to diagnose DNS issues it is often useful to monitor the logfile
tail -f /var/log/dnsmasq
For more verbose entries into the logfile, edit the /etc/dnsmasq.conf file, uncommenting the line log-queries.
By default the DNS uses the local machine's /etc/hosts file as well as /etc/resolv.conf.
SSH
The SSH service (sshd)
NFS
The Network File System (NFS) contains all of our home directories (under /export/home). This network storage is automatically mounted on every machine. However if, for whatever reason, you need to actually access the machine it is hosted at 192.168.1.5 and has the nickname 'nfs1'.
The NFS has two partitions - a 130GB volume in RAID-1 configuration called "system" and a 20 TB volume in RAID-5 configuration named "data". (Hardware Raid)
To view information it is necessary to use a Dell tool
omreport storage vdisk
Clearing up space
Larger files need to be stored on the xrootd system, but oftentimes we need to have enough free space on the nfs for downloads. In such a case you should encourage users to remove any files they no longer need or transfer them to the xrootd system for longer-term storage. The following commands may be useful in identifying the worst offenders :
- du --max-depth 0 -h * | grep '[0-9\.]\+G'
This command checks the size of each local file/directory and displays them only if the size is in GB (obviously switch to a T if you believe there are any TB offenders). It can take awhile to run but the obvious offenders will be identified.
Another useful command for just checking the size of files and directories is
The -s option summarizes the result (rather than recursing down the whole directory structure)
The following links may help the new user understand the XRootD system. As of Aug 5, 2016 the version of XRootD used on the ISU T3 is v4.2.3 (This is the latest version available for SL5).
XRootD is a system for storing large datasets across several different machines. At ISU our workers have a combined storage space of ~50 TB which is only accessible via xrootd. All interactions with the xrootd system should occur through the head node (head2). The logs for this system are located in /var/log/xrootd/.
The xrootd system is mounted under /mnt/xrootd/ on all machines, and it can be accessed through this mount as though it were a single volume (this is a convenient alternative to using the xrootd commands outlined below). There is a single file in the /local/xrootd/a/ directory on each machine with the name of that machine. This makes it very easy to check via the mount that each machine is visible to the xrootd system.
- xrdcp [-r] <file> root://head2//local/xrootd/a/<username>/<directory>
- Note that the double slashes seem to be important for the command to work as intended. The [-r] option will recurse through the file given and copy all contents.
- xrdfs root://head2 <command> /local/xrootd/a/<username>/<directory>
- This command can be ls or mkdir, however the system will not recognize wild cards.
Known Problems
error: error running non-shared postrotate script for /var/log/xrootd/cmsd.log of '/var/log/xrootd/*/*.log /var/log/xrootd/*.log
- This problem was solved by X
In the logfile /var/log/xrootd/cmsd.log
"Copr. 2007 Stanford University/SLAC cmsd.
++++++ anon@hep-head2.physics.iastate.edu phase 1 initialization started.
=====> all.export /local/xrootd/a stage
=====> all.manager head2.isuhep.lan:3121
=====> all.role manager
171004 16:27:42 514781 Config: port for this manager not specified.
------ anon@hep-head2.physics.iastate.edu phase 1 manager initialization failed.
171004 16:27:42 514781 XrdConfig: LCL port 27983 wsz=87380 (87380)
171004 16:27:42 514781 XrdProtocol: getting protocol object cmsd
171004 16:27:42 514781 XrdProtocol: Protocol cmsd could not be loaded
------ cmsd anon@hep-head2.physics.iastate.edu:-1 initialization failed."
- This problem was caused by the hostname of head2 being set improperly. To see the current hostname of the machine, simply type hostname. To change the hostname, use the command /bin/hostname head2.isuhep.lan (or similar) and restart the cmsd and xrootd services. To make this change permanent you must also add the line HOSTNAME="head2.isuhep.lan" to the /etc/sysconfig/network file.
Scientific Linux (OS)
Scientific Linux is the Operating System (OS) of choice for CERN (and ISU T3). Based on Enterprise Linux (you will often see the version tags with el5/el6, etc). To determine which OS is running on a machine, use the command
uname -a
Example output :
Linux hep-int1.physics.iastate.edu 2.6.32-504.1.3.el6.x86_64 #1 SMP Tue Nov 11 14:19:04 CST 2014 x86_64 x86_64 x86_64 GNU/Linux
Note that from this output you can also see the exact version of the OS/Linux Kernel used (el6 is Scientific Linux 6).
Upgrading to SL7
Upgrading from SL5 to SL6 or SL6 to SL7 is not simple. In order to do this you must create an installation USB and install the OS from scratch (our optical drives don't seem to work, so a CD/DVD isn't an option). Before performing such an upgrade you should make note of all services running on the machine (hopefully all of them will be outlined here) - as they will need to be reinstalled and reconfigured once the installation is complete.
- Note that in order to boot from USB you first must enter the BIOS and change the "USB Flash Drive Emulation Type" from "Auto" to "Hard Drive". If not set then you will likely get some complaint about "isolinux.bin" being missing or corrupt.
Programs
- Expect (yum install expect) : This program has some useful functionality for bash scripts which help in automation [spawn, expect, send]
- SSH (sshd) : This is installed by default?
- Condor (yum install condor?)
- XRootD (yum install xrootd? && yum install cmsd?)
- OpenLDAP (yum install openldap?) : Read these instructions for configuring a machine to authenticate using OpenLDAP here
List of Machines & Ports
The following is a list of all the machines which make up the ISU T3 as well as a short description of their functions. If a machine is visible from outside of ISU then it's public name will be provided in parenthesis.
- hep-int1 (hep-int1.physics.iastate.edu) & hep-int2 (hep-int2.physics.iastate.edu) : These "interactive" nodes are the main machines for doing work on the ISU T3. They have the local IP addresses 192.168.1.1 and 192.168.1.2 respectively.
- head1 & head2 : These nodes serve as the head and backup head (respectively) of the distributed computing systems (xrootd/condor). They have the local IP addresses 192.168.1.3 and 192.168.1.4 respectively.
- worker1-9 (worker1.isuhep.lan) : These machines are the worker nodes for the distributed computing systems. They have the local IP addresses 192.168.1.6-14
- squid : This machine is a SQUID server which manages the condor system. It has the local IP address 192.168.1.61.
Used Ports
- Condor : 9618 (default) [I've also seen 9619 and 9620 (TCP) mentioned, so I've allowed them through the firewall too?]
- Xrootd : 1094
Common Ports
- SSH : 22
- HTML : 80 / 443
- DNS : 53
- SNMP : 162/163
- LDAP : 389
Network Tools
The following programs are useful tools in linux for working with a network. Network information can be found in the /etc/sysconfig/network-scripts file (there is a separate file for each network interface).
Routing
Many problems with the T3 end up being due to some network configuration. A network switch bugging out, a gateway being incorrectly set (common if the computers have rebooted and the previous fix wasn't made permanent), or something like that.
Past Issues
- On Nov 28th, 2017 the machines could still access the internet but couldn't ping each other or log in. The machines would periodically print a message "NIC Copper Link is Down". This problem was fixed by simply power-cycling the network switches, however if this happens again it should be considered a sign that the network switch needs to be replaced.
Checking if a machine is up (ping)
To check if a machine is up you can simply use the command
ping <machine_address>
.
Checking Running Programs
To get a list of the programs currently running on the machine run
ps aux
. To reduce the size of this output it is likely a good idea to pipe it to grep and search for the particular processes you have in mind, i.e.
ps aux | grep xrootd
which will return
/usr/bin/cmsd -k 7 -l /var/log/xrootd/cmsd.log -c /etc/xrootd/xrootd-clustered.cfg -b -s /var/run/xrootd/cmsd-default.pid -n default
xrootd 25680 0.0 0.0 111128 3156 ? Sl 13:59 0:00
/usr/bin/xrootd -k 7 -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-clustered.cfg -b -s /var/run/xrootd/xrootd-default.pid -n default
Checking Running Services
To get a list of the status of all services, simply use the command
service --status-all
This will output a list of all services offered on the machine, as well as information about the state of the service (on, off, failure).
ATLAS Event Display
There is an ATLAS event display just outside Jim's office. It is run by a small computer running windows 7. The viewer opens automatically when the computer starts up, so most of the time any issues can be solved by simply restarting the machine. For some (unknown) reason, the computer occasionally dumps its memory.
Test Results as of Sep 2nd, 2016
- Used Windows Memory Diagnostics Tool (mdsched.exe) : No memory errors detected.
- This program restarts the computer and checks the RAM on your computer. To access the results of the test after the computer restarts, goto EventViewer -> Application Services -> Microsoft -> Windows and look for "Memory".
- Ran (sfc /scannow) : No integrity violations detected
- This program validates the windows system files and replaces/reverts any missing/broken ones if it can.
- If this program can't repair the system files the easiest solution is to simply reinstall the OS.
- Ran (chkdsk /f /r) : Volume is clean.
- This command checks the hard drives for any bad sectors (which could be responsible for system files being corrupted).
-- MichaelDavidWerner - 2016-08-05