Operations Help Guide
AFS_SERVERS

Contents


Service Description:

AFS is a network-distributed file system, supported by Transarc, which consists of AFS server machines located in the computer centre and AFS clients which are UNIX/Linux workstations or PCs running the AFS client, located all over CERN. If a user logins to his account on LXPLUS, his files are located on one of the AFS servers. If this same user runs a batch job on LXBATCH, these same files will be located on the AFS server. AFS is a critical service at CERN and data should be available around the clock, throughout the year.

Logging in

To perform checks etc, please use from a UNIX session.

ssh -l ops afsxx (ssh is a "secure telnet" - passwords are encrypted)

Please note that not all AFS servers monitored by CNSURE are in production.

  • Production servers can found by executing the script /afs/cern.ch/project/afs/etc/afsconf.pl -s on the AFS console account. ( Just type afs_config on afs console account).
  • afs60-64, afs20-afs27, afs90-97 are in the "AFS area" of the machine room
  • afs30-35,afs54-57 down in the cellar
  • Scratch Servers AFS60-64 are scratch servers running LINUX. They are less critical than the main AFS File Servers - see special instructions. The volumes on the scratch servers are NEVER backed up.
  • Backup servers Many servers run backups overnight. AFS15 controls backups. It does not contain any production volumes. Backups end up in TSM and in CASTOR.


Startup Procedures:

afs_startup

After a power cut, not all machines will come back automatically. The ideal sequence is as follows:
  1. Turn on all SUN & IBM RAIDs
  2. Turn on afsdb1, afsdb2, afsdb3, in any order, in quick succession.
  3. Wait 5 minutes or until one of afsdb1,2,3 is 'alive'.
  4. turn on afsmisc2 & afsmisc1.
  5. Turn on afsdb4 & afsdb5 in building 613.
  6. Turn on all other AFS servers, in any order.

afs_rundown

3-April-1998: The rundown procedures have now been modified
Login in to the AFS console account (e.g. on sure01, lxplus etc) and use the afs_rundown command.
Please note the difference between the the AFS console account and the system console accessed by the console manager.

The command afs_rundown is in /afs/cern.ch/user/c/console


     usage : ~/afs_rundown [-reboot] [-all|-afs|-afsnfs|-dce|-host ] [-fake]
     try to shutdown IT/DIS/DFS servers (afs, afsnfs, dce)
     options :
     -reboot : reboot the system (otherwise, it is halted or powered off)
     -all    : afs and afsnfs servers are shutdown
     -afs    : only afs servers
     -afsnfs : only afsnfs servers
     -dce    : only dce servers
     -host   : shutdown a specific host (or several separated by ,)
     -help   : this help
     examples :
     # ~/afs_rundown -host afs31,afs32 -reboot
     -> reboot afs31 and afs32
     # ~/afs_rundown -all
     -> shutdown all servers (power off if possible)
     ************************************************************
     *    For emergency shutdown of computing center, type :    *
     *                  ~/afs_rundown -all                        *
     *            (and answer confirmation questions)           *
     ************************************************************
     

old rundown procedures

login to the local account as ops (the AFS ops account for is CONSOLE)
issue rundown or rundown -r for a reboot wait 3 to 5 minutes
In case of reboot: Check that the status command returns
fs is running normally, auxiliary status file server running


Backups:

AFS disaster/recovery - Refer to the
AFS backup scheme for additional information.

Currently AFS restores can be requested by sending a mail to afs.support@cernNOSPAMPLEASE.ch. These requests are processed by SysAdmin Team.


Additional Information:

(plus links to other documents)

As from 15-Jul-97, please look in AFS Cookbook
<NAME="STATUS">

AFS status checking (all servers)

  • On any terminal, run /afs/.cern.ch/project/afs/etc/afsconf.pl -c
    This will timeout on the failing server or at least show a drastically increased response time

Status checking (individual server)

  • login as ops, always log in from the server console or the console manager, do not rely on telnet sessions.

    Issue the status command. This

    • checks the fileserver processes on the AFS servers.
    • checks the volume server processes on the AFS servers
    • checks the free disk space on the user (not project) home directories
    • Another way to detect if there has been/is a problem is to issue from any AFS account.

      bos status afsx -long

      
           e.g  bos status afs31 -long
           Instance fs, (type is fs) currently running normally.
           Auxiliary status is: file server running.
           Process last started at Wed Jul 31 22:25:32 1996 (5 proc starts)
           Last exit at Wed Jul 31 22:25:32 1996
           Last error exit at Wed Jul 31 22:19:21 1996, by file, due to signal 11
           Command 1 is '/usr/afs/bin/fileserver -m 2 -L -nojumbo'
           Command 2 is '/usr/afs/bin/volserver -nojumbo'
           Command 3 is '/usr/afs/bin/salvager'
           


      Look at the last error exit time - this will tell you when the fs process last failed due to an error.
      If the output of a bos status afsx -long gives...

      • has core file - indicates process failed but this failure will not prevent the server being restarted.
      • salvaging file system - a file salvage operation is in progress and when finished, the fs process will be restarted. <NAME="STATUS">

    Machine blocked/hung out of hours

    Basic trouble-shooting

    • run /afs/.cern.ch/project/afs/etc/afsconf.pl -c (it will block on the failing server)
    • bos status afsNN (e.g.afs42);
      this might indicate that the machine is actually salvaging in which case wait 30-40 minutes for the salavge to complete
    • check for no_contact alarms; try to ping suspected server
    • are there any network problems? AFS needs network to work properly
    • check the console
    • check the machine and its disks
    • if no obvious solution in sight, problem looks serious (user complaints), then reboot the machine (see below)

    Restarting the AFS file server process

    Whenever an AFS file server seems to be blocked
    • users complain that a certain file is not accessible and that file is on afsXX,
    • /afs/cern.ch/project/afs/etc/afs_checkservers stops just before afsXX, or shows an access time for afsXX that is that is greater than 20'000'000 (unit is microseconds, therefore 20 seconds),
    • no opportunity to inform the AFS team
    you can restart the AFS fileserver process using the command:
     /afs/cern.ch/project/afs/etc/fs_restart afsXX
        
    where afsXX is afs32, afsdb1, or whatever AFS file server. You run this comand from the "console" account on a normal machine. You could run it from the file server console itself, but then you'd have to "klog.krb console" first. This procedure only restarts the file server process. It is pretty quick (much faster then a complete reboot), relatively non-intrusive in that applications would usually just block shortly but not fail, and collects major error logs. Always try it before a reboot - but in case of a serious OS problem a reboot might still help where a file server restart does not.

    Rebooting AFS servers

    • Always try a "soft" reboot first, i.e. use the afs_rundown command (see below)
    • If this fails, reboot from the console. On Sun servers, connect via the console manager and hit , , <CONTROL-B>, wait for 'ok' then type sync. On Linux, hit alt-ctrl-f1, then alt-ctrl-del.
    • some machines have a reset button
    • If this fails: power the machine off, wait 1 minute, power it on again.

    Soft reboot of AFS servers (all except afs7X)

    • First try to contact afs.support (see contact details below).
    • If all the above checks have been run AND there are user complaints, try to reboot the system using the afs_rundown command from the afs console account:
      ~/afs_rundown -host afs44 -reboot

    Rebooting AFS LINUX scratch servers (afs6X)

    • These instructions are for the AFS6x machines only !
    • They are real scratch servers - the difference is that the files/volumes are not backed up.
    • In case of problems outside working hours (user complaints, alarms)
      • soft reboot the server i.e. ~/afs_rundown -host afs6x -reboot
      • if this fails try a reset
      • if this fails, leave the machine down, mail afs.support@cernNOSPAMPLEASE.ch and call the afs.support team at a reasonable hour

    Hardware problems - calling the manufacturer

    IBM servers

    Currently running: afs90-afs97 with an DS4300 RAID array controller. Equipment is covered by a 24x7, 4 hour intervention time contract! In case of H/W trouble, call IBM at 0800-55-5454. You need the machine model and serial number: both are on a small black label at the front of the machine.

    SUN

    Equipment is covered by a 24x7, 4 hour intervention time contract! In case of trouble call Sun, you need the serial number of the device.

    Transtec disk servers: issue a ITCM ticket

    How to look at the error log on a SUN

    • login as ops on the AFS server which has problems.
    • more /var/adm/messages <NAME="STATUS">

    Alarms

    Offline volume(s) in partition(s): /vicepzz

    During working hours, please log the problem and send a mail to afs.support. Outside normal hours apply the following procedure.

    The above message indicates that one or more volumes has gone offline i.e unavailable. Depending on the volume(s), this could be a minor or major problem.

    From the afs console account, type

    vos listvol afsxx zz | grep -v "On-line"

    (where xx is afs server concerned and zz is the partition)

    One or more volumes will be listed offline in the listvol. If the problem occurs during the day, mail AFS support

    Salvage the volume if these conditions are met:

  • 1) there is an “offline” alarm for this volume and
  • 2) has a prefix of (p. user. s. or q.) - see types of AFS volumes below and
  • 3) there are complaints from the user(s) and/or
  • 4) other SURE alarms indicate that they are AFS-related service problems.

    To salvage a volume

  • 1) make sure the volume name does not end in .backup or .readonly, those cannot be salvaged!
  • 2) log in as 'console' on an AFS machine
  • 3) type /afs/cern.ch/project/afs/etc/salvage 'volume-name'
  • 4) salvaging a volume blocks all accesses for the time of the salvage, which can take several minutes so please be patient (do not hit ctrl-c !)
  • 5) vos exam ‘volume-name’ will indicate whether the volume is offline/online

    Types of AFS volumes

  • 1) user volumes (e.g user.tim) normally belong to one user and affect only one user.
  • 2) project volumes (e.g p.lsf) which belong to experiments, projects and service and are shared by multiple users/services.
  • 3) scratch volumes (e.g. q.alsoft.scratch.0) - general scratch volumes
  • 4) personal scratch volumes (e.g. s.cms.slehti.0) scratch volumes belonging to a specific user.
  • 5) system volumes (e.g. sys.sun4x_56.34a.12) - architecture specific.
  • 6) other volumes (root.afs) used by the AFS cell - do NOT salvage these volumes.

    Example

      [lxplus055] ~ > vos partinfo afs20
         Free space on partition /vicepa: 154903706 K blocks out of total 171743818
         Free space on partition /vicepb: 49699179 K blocks out of total 88606368
         Free space on partition /vicepda: 40807435 K blocks out of total 49906772
         [lxplus055] vos listvol afs20 b | grep -v "On-line"
         Total number of volumes on server afs20 partition /vicepb: 2642
         **** Could not attach volume 537229091 ****
         Total volumes onLine 2641 ; Total volumes offLine 1 ; Total busy 0
         [lxplus055] ~ > vos exam 537229091
         **** Could not attach volume 537229091 ****
         RWrite: 537229091     Backup: 537229093
         number of sites -> 1
         server afs20.cern.ch partition /vicepb RW Site
         [lxplus055] ~ > vos listvldb 537229091
         user.johnh
         RWrite: 537229091     Backup: 537229093
         number of sites -> 1
         server afs20.cern.ch partition /vicepb RW Site
         [lxplus055] ~ > vos exam user.johnh
         **** Could not attach volume 537229091 ****
         RWrite: 537229091     Backup: 537229093
         number of sites -> 1
         server afs20.cern.ch partition /vicepb RW Site
         

    False no_contact alarms following a "network" problem

    If no_contact alarm persists even if the server is reachable, then on that server, type:

    • ps -ef | grep monitor
    • kill the pid for the process "/usr/local/bin/perl ./monitor" and the monitor will restart by itself.
    • e.g. afs34> ps -ef | grep monitor
    • e.g. afs34> ops 27392 1 0 16:56:36 ? 0:00 /usr/local/bin/perl ./monitor
    • e.g. afs34> kill 27392


    Contacts and Support:

    Arne Wiebalckcan be contacted during working hours, Rainer Toebbicke can be contacted on his CERN mobile phone:
    Week Days (08:00 - 17:30) Week Evenings (17:30 - 22:00) Weekend Days (08:00 - 22:00)

    Bernard Antoine can be contacted:
    Week Days (08:00 - 17:30) Week Evenings (17:30 - 22:00) Weekend Days (08:00 - 22:00)



  • -- FabioTrevisani - 30 Jun 2009

    Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
    Topic revision: r1 - 2009-06-30 - FabioTrevisani
     
      • Cern Search Icon Cern Search
      • TWiki Search Icon TWiki Search
      • Google Search Icon Google Search

      Sandbox All webs login

    • Edit
    • Attach
    This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
    or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback