Additional Information:
(plus links to other documents)
As from 15-Jul-97, please look in AFS Cookbook
<NAME="STATUS">
AFS status checking (all servers)
- On any terminal, run /afs/.cern.ch/project/afs/etc/afsconf.pl -c
This will timeout on the failing server or at least show a drastically increased response time
Status checking (individual server)
login as ops, always log in from the server console or the console manager, do not rely on telnet sessions.
Issue the status command. This
- checks the fileserver processes on the AFS servers.
- checks the volume server processes on the AFS servers
- checks the free disk space on the user (not project) home directories
- Another way to detect if there has been/is a problem is to issue from any AFS account.
bos status afsx -long
e.g bos status afs31 -long
Instance fs, (type is fs) currently running normally.
Auxiliary status is: file server running.
Process last started at Wed Jul 31 22:25:32 1996 (5 proc starts)
Last exit at Wed Jul 31 22:25:32 1996
Last error exit at Wed Jul 31 22:19:21 1996, by file, due to signal 11
Command 1 is '/usr/afs/bin/fileserver -m 2 -L -nojumbo'
Command 2 is '/usr/afs/bin/volserver -nojumbo'
Command 3 is '/usr/afs/bin/salvager'
Look at the last error exit time - this will tell you when the fs process last failed due to an error.
If the output of a bos status afsx -long gives...
- has core file - indicates process failed but this failure will not prevent the server being restarted.
- salvaging file system - a file salvage operation is in progress and when finished, the fs process will be restarted. <NAME="STATUS">
Machine blocked/hung out of hours
Basic trouble-shooting
- run /afs/.cern.ch/project/afs/etc/afsconf.pl -c (it will block on the failing server)
- bos status afsNN (e.g.afs42);
this might indicate that the machine is actually salvaging in which case wait 30-40 minutes for the salavge to complete
- check for no_contact alarms; try to ping suspected server
- are there any network problems? AFS needs network to work properly
- check the console
- check the machine and its disks
- if no obvious solution in sight, problem looks serious (user complaints), then reboot the machine (see below)
Restarting the AFS file server process
Whenever an AFS file server seems to be blocked
- users complain that a certain file is not accessible and that file is on afsXX,
- /afs/cern.ch/project/afs/etc/afs_checkservers stops just before afsXX, or shows an access time for afsXX that is that is greater than 20'000'000 (unit is microseconds, therefore 20 seconds),
- no opportunity to inform the AFS team
you can restart the AFS fileserver process using the command: /afs/cern.ch/project/afs/etc/fs_restart afsXX
where afsXX is afs32, afsdb1, or whatever AFS file server. You run this comand from the "console" account on a normal machine. You could run it from the file server console itself, but then you'd have to "klog.krb console" first. This procedure only restarts the file server process. It is pretty quick (much faster then a complete reboot), relatively non-intrusive in that applications would usually just block shortly but not fail, and collects major error logs. Always try it before a reboot - but in case of a serious OS problem a reboot might still help where a file server restart does not.
Rebooting AFS servers
- Always try a "soft" reboot first, i.e. use the afs_rundown command (see below)
- If this fails, reboot from the console. On Sun servers, connect via the console manager and hit , , <CONTROL-B>, wait for 'ok' then type sync. On Linux, hit alt-ctrl-f1, then alt-ctrl-del.
- some machines have a reset button
- If this fails: power the machine off, wait 1 minute, power it on again.
Soft reboot of AFS servers (all except afs7X)
- First try to contact afs.support (see contact details below).
- If all the above checks have been run AND there are user complaints, try to reboot the system using the afs_rundown command from the afs console account:
~/afs_rundown -host afs44 -reboot
Rebooting AFS LINUX scratch servers (afs6X)
- These instructions are for the AFS6x machines only !
- They are real scratch servers - the difference is that the files/volumes are not backed up.
- In case of problems outside working hours (user complaints, alarms)
- soft reboot the server i.e. ~/afs_rundown -host afs6x -reboot
- if this fails try a reset
- if this fails, leave the machine down, mail afs.support@cernNOSPAMPLEASE.ch and call the afs.support team at a reasonable hour
Hardware problems - calling the manufacturer
IBM servers
Currently running: afs90-afs97 with an DS4300 RAID array controller. Equipment is covered by a 24x7, 4 hour intervention time contract! In case of H/W trouble, call IBM at 0800-55-5454. You need the machine model and serial number: both are on a small black label at the front of the machine.
SUN
Equipment is covered by a 24x7, 4 hour intervention time contract! In case of trouble call Sun, you need the serial number of the device.
Transtec disk servers: issue a ITCM ticket
How to look at the error log on a SUN
- login as ops on the AFS server which has problems.
- more /var/adm/messages <NAME="STATUS">
Alarms
Offline volume(s) in partition(s): /vicepzz
During working hours, please log the problem and send a mail to afs.support. Outside normal hours apply the following procedure.
The above message indicates that one or more volumes has gone offline i.e unavailable. Depending on the volume(s), this could be a minor or major problem.
From the afs console account, type
vos listvol afsxx zz | grep -v "On-line"
(where xx is afs server concerned and zz is the partition)
One or more volumes will be listed offline in the listvol. If the problem occurs during the day, mail AFS support
Salvage the volume if these conditions are met:
1) there is an “offline” alarm for this volume and
2) has a prefix of (p. user. s. or q.) - see types of AFS volumes below and
3) there are complaints from the user(s) and/or
4) other SURE alarms indicate that they are AFS-related service problems.
To salvage a volume
1) make sure the volume name does not end in .backup or .readonly, those cannot be salvaged!
2) log in as 'console' on an AFS machine
3) type /afs/cern.ch/project/afs/etc/salvage 'volume-name'
4) salvaging a volume blocks all accesses for the time of the salvage, which can take several minutes so please be patient (do not hit ctrl-c !)
5) vos exam ‘volume-name’ will indicate whether the volume is offline/online
Types of AFS volumes
1) user volumes (e.g user.tim) normally belong to one user and affect only one user.
2) project volumes (e.g p.lsf) which belong to experiments, projects and service and are shared by multiple users/services.
3) scratch volumes (e.g. q.alsoft.scratch.0) - general scratch volumes
4) personal scratch volumes (e.g. s.cms.slehti.0) scratch volumes belonging to a specific user.
5) system volumes (e.g. sys.sun4x_56.34a.12) - architecture specific.
6) other volumes (root.afs) used by the AFS cell - do NOT salvage these volumes.
Example
[lxplus055] ~ > vos partinfo afs20
Free space on partition /vicepa: 154903706 K blocks out of total 171743818
Free space on partition /vicepb: 49699179 K blocks out of total 88606368
Free space on partition /vicepda: 40807435 K blocks out of total 49906772
[lxplus055] vos listvol afs20 b | grep -v "On-line"
Total number of volumes on server afs20 partition /vicepb: 2642
**** Could not attach volume 537229091 ****
Total volumes onLine 2641 ; Total volumes offLine 1 ; Total busy 0
[lxplus055] ~ > vos exam 537229091
**** Could not attach volume 537229091 ****
RWrite: 537229091 Backup: 537229093
number of sites -> 1
server afs20.cern.ch partition /vicepb RW Site
[lxplus055] ~ > vos listvldb 537229091
user.johnh
RWrite: 537229091 Backup: 537229093
number of sites -> 1
server afs20.cern.ch partition /vicepb RW Site
[lxplus055] ~ > vos exam user.johnh
**** Could not attach volume 537229091 ****
RWrite: 537229091 Backup: 537229093
number of sites -> 1
server afs20.cern.ch partition /vicepb RW Site
False no_contact alarms following a "network" problem
If no_contact alarm persists even if the server is reachable, then on that server, type:
- ps -ef | grep monitor
- kill the pid for the process "/usr/local/bin/perl ./monitor" and the monitor will restart by itself.
- e.g. afs34> ps -ef | grep monitor
- e.g. afs34> ops 27392 1 0 16:56:36 ? 0:00 /usr/local/bin/perl ./monitor
- e.g. afs34> kill 27392
Contacts and Support:
Arne Wiebalckcan be contacted during working hours,
Rainer Toebbicke can be contacted on his CERN mobile phone:
Week Days (08:00 - 17:30) Week Evenings (17:30 - 22:00) Weekend Days (08:00 - 22:00)
Bernard Antoine can be contacted:
Week Days (08:00 - 17:30) Week Evenings (17:30 - 22:00) Weekend Days (08:00 - 22:00)
--