<H1 align=center>Operations Help Guide<BR>AFS_SERVERS</H1> <H3>Contents</H3> <UL> <LI><A href="#Service">Service Description</A> </LI> <LI><A href="#Start">Startup and Shutdown Procedures</A> </LI> <LI><A href="#Backups">Backups</A> </LI> <LI><A href="#Status">Status Checking</A> </LI> <LI><A href="#out_hours">Machine blocked/hung out of hours</A> </LI> <LI><A href="#Contacts">Contacts</A> </LI> </UL> <A name=Service> <HR> </A> <H2 align=center>Service Description:</H2> <p>AFS is a network-distributed file system, supported by Transarc, which consists of AFS server machines located in the computer centre and AFS clients which are UNIX/Linux workstations or PCs running the AFS client, located all over CERN. If a user logins to his account on LXPLUS, his files are located on one of the AFS servers. If this same user runs a batch job on LXBATCH, these same files will be located on the AFS server. AFS is a critical service at CERN and data should be available around the clock, throughout the year. </p> <h3>Logging in</h3> <dir> <P>To perform checks etc, please use from a UNIX session: <br> (ssh is a "secure telnet" - passwords are encrypted)</P> <P><kbd> ssh -l ops afs<i>xx</i></kbd></P> </dir> <h3>Servers status</h3> <dir> <P>Please note that not all monitored AFS servers are in production. </p> <UL> <LI><B>Production servers </B>can found by executing the script:<br> <kbd>/afs/cern.ch/project/afs/etc/afsconf.pl -s</kbd> <br> on the AFS console account. (Just type <kbd>afs_config</kbd> on afs console account). </li> <LI>The <a href="http://oraweb.cern.ch/pls/cdbsql/web.execute_general_search?p_fld1=clustername&p_data1=afs&p_default=no&p_hn=yes&p_cn=yes&p_csn=yes&p_ctype=yes&p_imp=yes&p_lr=yes&p_outputformat=html">location of the server</a> can be found by looking to the column "Location Room" (from result of previous link)</li> <LI><B>Scratch Servers</B> <B>AFS6<i>xx</i></B> are scratch servers running LINUX. They are less critical than the main AFS File Servers - see special instructions. The volumes on the scratch servers are NEVER backed up. </li> <LI><B>Backup servers</B> Many servers run backups overnight. <B>AFS15</B> controls backups. It does not contain any production volumes. Backups end up in TSM and in CASTOR. </LI> </UL> </dir> <P><A name=Start> <HR> </A> <H2 align=center>Startup and Shutdown Procedures:</H2> <H3>afs_startup</H3> <blockquote> <p>After a power cut, not all machines will come back automatically. The ideal sequence is as follows: <OL> <LI>Turn on all SUN & IBM RAIDs (= disks) </LI> <LI>Turn on afsdb1, afsdb2, afsdb3, in any order, in quick succession. </LI> <LI>Wait 5 minutes or until one of afsdb1,2,3 is 'alive'.</LI> <li>Turn on afsmisc2 & afsmisc1. </LI> <LI>Turn on afsdb4 & afsdb5 in building 613. </LI> <LI>Turn on all other AFS servers, in any order. </LI> </OL> </p> </blockquote> <H3>afs_rundown</H3> <blockquote> <p><B>3-April-1998: The rundown procedures have now been modified </B></p> <p>Login in to the AFS <B>console</B> account (e.g. on lxadm, lxplus, etc...) and use the <kbd>afs_rundown</kbd> command. <BR> Please note the difference between the the AFS console account and the system console accessed by the <kbd> connect2console.sh </kbd> command. </p> <P>The command <kbd>afs_rundown</kbd> is in <code>/afs/cern.ch/user/c/console </code> <PRE><FONT size=2> usage : ~/afs_rundown [-reboot] [-all|-afs|-afsnfs|-dce|-host <HOST>] [-fake] try to shutdown IT/DIS/DFS servers (afs, afsnfs, dce) options : -reboot : reboot the system (otherwise, it is halted or powered off) -all : afs and afsnfs servers are shutdown -afs : only afs servers -afsnfs : only afsnfs servers -dce : only dce servers -host : shutdown a specific host (or several separated by ,) -help : this help examples : # ~/afs_rundown -host afs31,afs32 -reboot -> reboot afs31 and afs32 # ~/afs_rundown -all -> shutdown all servers (power off if possible) ************************************************************ * For emergency shutdown of computing center, type : * * ~/afs_rundown -all * * (and answer confirmation questions) * ************************************************************ </FONT></PRE> </blockquote> <H3>Old rundown procedures</H3> <blockquote> <OL> <li>Login to the local account as <B>ops</B> (the AFS ops account for is CONSOLE) </li> <li>Issue <kbd>rundown</kbd> or <kbd>rundown -r</kbd> for a reboot and wait 3 to 5 minutes </li> <li>In case of reboot, check that the status command returns <BR> <code>fs is running normally, auxiliary status file server running</code> </li> </OL> </blockquote> <P><A name=Backups> <HR> </A> <H2 align=center>Backups:</H2> <dir> <p>AFS disaster/recovery :<br> Currently AFS restores can be requested by sending a mail to <a href="mailto:afs.support@cern.ch">afs.support@cern.ch</a> These requests are processed by SysAdmin Team. </p> </dir> <P><A name=Status> <HR> </A> <H2 align=center>Status Checking</H2> <H3>AFS status checking (all servers)</H3> <blockquote> <P>On any terminal, run: <kbd>/afs/.cern.ch/project/afs/etc/afsconf.pl -c</kbd><BR> This will timeout on the failing server or at least show a drastically increased response time. </P> </blockquote> <H3>Status checking (individual server)</H3> <blockquote> <P>First method: <OL> <li>Login as <B>ops</B>, always log in from the server console or through the <kbd> connect2console.sh </kbd> command. Do not rely on telnet sessions.</li> <li>Issue the <kbd>status</kbd> command. This <UL> <LI>checks the fileserver processes on the AFS servers </LI> <LI>checks the volume server processes on the AFS servers </LI> <LI>checks the free disk space on the user (not project) home directories </LI> </UL> </li> </ol> </P> <P>Another way to detect if there has been/is a problem is to issue from any AFS account. <UL> <LI><kbd>bos status afs<i>xx</i> -long </kbd>, <br>e.g <kbd> bos status afs31 -long </kbd> <PRE><FONT size=2> Instance fs, (type is fs) currently running normally. Auxiliary status is: file server running. Process last started at Wed Jul 31 22:25:32 1996 (5 proc starts) Last exit at Wed Jul 31 22:25:32 1996 Last error exit at Wed Jul 31 22:19:21 1996, by file, due to signal 11 Command 1 is '/usr/afs/bin/fileserver -m 2 -L -nojumbo' Command 2 is '/usr/afs/bin/volserver -nojumbo' Command 3 is '/usr/afs/bin/salvager' </FONT></PRE> </li> <li>Look at the last error exit time - this will tell you when the fs process last failed due to an error. <BR>If the output of a bos status afsx -long gives... <UL> <LI><code>has core file </code>- indicates process failed but this failure will not prevent the server being restarted.</LI> <LI><code>salvaging file system </code>- a file salvage operation is in progress and when finished, the fs process will be restarted.</LI> </UL> </LI> </UL> </blockquote> <P><A name=out_hours> <HR> </A> <H2 align=center>Machine blocked/hung out of hours</H2> <H3>Basic trouble-shooting</H3> <UL> <LI>run <kbd>/afs/.cern.ch/project/afs/etc/afsconf.pl -c </kbd> (it will block on the failing server) </li> <LI><kbd>bos status afs<i>XX</i></kbd> (e.g.afs42) <BR> this might indicate that the machine is actually salvaging in which case wait 30-40 minutes for the salavge to complete </li> <LI>check for no_contact alarms; try to ping suspected server </li> <LI>are there any network problems? AFS needs network to work properly </li> <LI>check the console </li> <LI>check the machine and its disks </li> <LI>if no obvious solution in sight, problem looks serious (user complaints), then <B>reboot the machine</B> (see below) </LI> </UL> <H3>Restarting the AFS file server process</H3> <blockquote> <p>Whenever an AFS file server seems to be blocked : <UL> <LI>users complain that a certain file is not accessible and that file is on afs<i>XX</i> </li> <LI><kbd>/afs/cern.ch/project/afs/etc/afs_checkservers</kbd> stops just before afs<i>XX</i> or shows an access time for afsXX that is that is greater than 20'000'000 (unit is microseconds, therefore 20 seconds), <LI>no opportunity to inform the AFS team </LI> </UL> you can restart the AFS fileserver process using the command: <br> <kbd> /afs/cern.ch/project/afs/etc/fs_restart afs<i>XX</i> </kbd><br> where afsXX is afs32, afsdb1, or whatever AFS file server. You run this comand from the <b>console</b> account on a normal machine. You could run it from the file server console itself, but then you'd have to <kbd>klog.krb console</kbd> first. This procedure only restarts the file server process. It is pretty quick (much faster then a complete reboot), relatively non-intrusive in that applications would usually just block shortly but not fail, and collects major error logs. Always try it before a reboot - but in case of a serious OS problem a reboot might still help where a file server restart does not. </p> </blockquote> <H3>Rebooting AFS servers</H3> <blockquote> <UL> <LI>Always try a "soft" reboot first, i.e. use the <kbd>afs_rundown</kbd> command (see below) </li> <LI>If this fails, reboot from the console. <UL> <li> On Sun servers, connect via the console manager and hit<br> <i>RETURN</i>, <i>TILDE</i> (~), <i>CONTROL-B</i> <br> wait for <B>'ok'</B> then type <kbd>sync</kbd>. </LI> <li>On Linux, hit <kbd>alt-ctrl-f1</kbd>, then <kbd>alt-ctrl-del</kbd>. </li> </UL> <li>some machines have a reset button </li> <LI>If this fails: power the machine off, wait 1 minute, power it on again. </LI> </UL> <H4>Soft reboot of AFS servers (all except afs7X)</H4> <UL> <LI>First try to contact afs.support (see contact details below). <LI>If all the above checks have been run AND there are user complaints, try to reboot the system using the afs_rundown command from the afs console account: <BR> <kbd>~/afs_rundown -host afs44 -reboot</kbd> </LI> </UL> <H4>Rebooting AFS LINUX scratch servers (afs6X)</H4> <UL> <LI><B>These instructions are for the AFS6x machines only ! </B> <LI>They are real scratch servers - the difference is that the files/volumes are not backed up. <LI>In case of problems outside working hours (user complaints, alarms)<BR> <UL> <LI>soft reboot the server i.e. ~/afs_rundown -host afs6x -reboot <li>if this fails try a reset <LI>if this fails, leave the machine down, mail afs.support@cern.ch and call the afs.support team at a reasonable hour </LI> </UL> </LI> </UL> </blockquote> <H3>Hardware problems - calling the manufacturer</H3> <blockquote> <H4>IBM servers</H4> <P> Currently running: afs90-afs97 with an DS4300 RAID array controller. Equipment is covered by a 24x7, 4 hour intervention time contract! In case of H/W trouble, call IBM at 0800-55-5454. You need the machine model and serial number: both are on a small black label at the front of the machine. </P> <H4>SUN</H4> <P> Equipment is covered by a 24x7, 4 hour intervention time contract! In case of trouble call Sun, you need the serial number of the device. </P> <H4>Transtec disk servers</H4> <P> issue a ITCM ticket </P> </blockquote> <h3>How to look at the error log on a SUN </h3> <UL> <LI>Login as ops on the AFS server which has problems. </LI> <LI>Type: <kbd>more /var/adm/messages</kbd> </LI> </UL> <HR width="25%"> <H2>Alarms</H2> <blockquote> <H3>Offline volume(s) in partition(s): /vicepzz</H3> <dir> <P>During working hours, please log the problem and send a mail to <B>afs.support</B>. Outside normal hours apply the following procedure. </P> <P>The above message indicates that one or more volumes has gone offline, i.e unavailable. Depending on the volume(s), this could be a minor or major problem. </P> <P><B>From the afs console account</B>, type:<br> <kbd>vos listvol afs<i>xx</I> zz | grep -v "On-line" </kbd> <br> (where <i>xx</I> is afs server concerned and zz is the partition) </P> <P>One or more volumes will be listed offline in the listvol. If the problem occurs during the day, <A href="mailto:afs.support@cern.ch">mail AFS support </A> </P> <P><B>Salvage the volume if these conditions are met:</B> <OL> <LI>there is an “offline” alarm for this volume <B>and</B> <LI>has a prefix of (p. user. s. or q.) - see types of AFS volumes below <B>and</B> <LI>there are complaints from the user(s) <B>and/or</B> <LI>other alarms indicate that they are AFS-related service problems. </OL> </P> <P><B>To salvage a volume:</B> <OL> <LI>make sure the volume name does not end in <code>.backup</code> or <code>.readonly</code>, those cannot be salvaged! <LI>log in as <b>console</B> on an AFS machine <LI>type <kbd>/afs/cern.ch/project/afs/etc/salvage <I>'volume-name'</I> </kbd> <LI>salvaging a volume blocks all accesses for the time of the salvage, which can take several minutes so please be patient (<B>do not hit ctrl-c !</B>) <LI>issue: <kbd>vos exam <I>'volume-name'</I></kbd>, it will indicate whether the volume is offline/online </OL> </P> <P><B>Types of AFS volumes:</B> <OL> <LI>user volumes (e.g <code>user.tim</code>) normally belong to one user and affect only one user. <LI>project volumes (e.g <code>p.lsf</code>) which belong to experiments, projects and service and are shared by multiple users/services. <LI>scratch volumes (e.g. <code>q.alsoft.scratch.0</code>) - general scratch volumes <LI>personal scratch volumes (e.g. <code>s.cms.slehti.0</code>) scratch volumes belonging to a specific user. <LI>system volumes (e.g. <code>sys.sun4x_56.34a.12</code>) - architecture specific. <LI>other volumes (<code>root.afs</code>) used by the AFS cell - do NOT salvage these volumes. </OL> <UL><B>Example</B> <PRE> [lxplus055] ~ > vos partinfo afs20 Free space on partition /vicepa: 154903706 K blocks out of total 171743818 Free space on partition /vicepb: 49699179 K blocks out of total 88606368 Free space on partition /vicepda: 40807435 K blocks out of total 49906772 [lxplus055] vos listvol afs20 b | grep -v "On-line" Total number of volumes on server afs20 partition /vicepb: 2642 **** Could not attach volume 537229091 **** Total volumes onLine 2641 ; Total volumes offLine 1 ; Total busy 0 [lxplus055] ~ > vos exam 537229091 **** Could not attach volume 537229091 **** RWrite: 537229091 Backup: 537229093 number of sites -> 1 server afs20.cern.ch partition /vicepb RW Site [lxplus055] ~ > vos listvldb 537229091 user.johnh RWrite: 537229091 Backup: 537229093 number of sites -> 1 server afs20.cern.ch partition /vicepb RW Site [lxplus055] ~ > vos exam user.johnh **** Could not attach volume 537229091 **** RWrite: 537229091 Backup: 537229093 number of sites -> 1 server afs20.cern.ch partition /vicepb RW Site </PRE></UL> </P> </dir> </blockquote> <blockquote> <H3>False no_contact alarms following a "network" problem</H3> <dir> <P> If no_contact alarm persists even if the server is reachable, then on that server, type: <BR> <UL> <LI> <kbd> ps -ef | grep monitor </kbd> </li> <LI>kill the pid for the process <code>/usr/local/bin/perl ./monitor</code> and the monitor will restart by itself. <br>Example:<PRE> <kbd>ps -ef | grep monitor </kbd> ops 27392 1 0 16:56:36 ? 0:00 /usr/local/bin/perl ./monitor <kbd>kill 27392 </kbd> </PRE> </LI> </UL></LI> </UL> </P> </dir> </blockquote> <A name=Contacts> <HR> </A> <H2 align=center>Contacts and Support:</H2> <P>Please consult SDB for <a href="http://cern.ch/servicedb/index.php?service=3742"> Cluster afs </a> service.</P> <HR> <H2>Keywords</H2> <blockquote> AFS servers salvage </blockquote>
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r2
<
r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r2 - 2009-06-30
-
FabioTrevisani
Home
Plugins
Sandbox for tests
Support
Alice
Atlas
CMS
LHCb
Public Webs
Sandbox Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
PDF version
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Cern Search
TWiki Search
Google Search
Sandbox
All webs
E
dit
A
ttach
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback