Difference: ProductionProceduresBanningASite (1 vs. 4)

Revision 42009-01-12 - RobertoSantinel

Line: 1 to 1
 
META TOPICPARENT name="ProductionProcedures"

Banning sites

Line: 50 to 49
 In the comment, quotes must not be used inside quotes because it confuses the parser, e.g.
$dirac-admin-ban-site LCG.Liverpool.uk --comment='Lots of jobs failed with "bus error" in Gauss step'
Changed:
<
<
will generate as a comment for the site banned: 'Lots of jobs failed with bus'.
>
>
will generate as a comment for the site banned: 'Lots of jobs failed with bus'.
 

GGUS tickets

Revision 32008-12-16 - MarcosASeco

Line: 1 to 1
 
META TOPICPARENT name="ProductionProcedures"

Banning sites

Line: 47 to 47
 
  • An email will automatically be sent to lhcb-grid@cernNOSPAMPLEASE.ch.
  • A comment should always be entered to ensure it is clear to everyone the reason for the ban.
Added:
>
>
In the comment, quotes must not be used inside quotes because it confuses the parser, e.g.
$dirac-admin-ban-site LCG.Liverpool.uk --comment='Lots of jobs failed with "bus error" in Gauss step'
will generate as a comment for the site banned: 'Lots of jobs failed with bus'.
 

GGUS tickets

  • The grid expert should immediately submit a GGUS ticket to the site reporting the problem and any relevant information (such as error message and names of local site nodes that have been involved).

Revision 22008-12-09 - GreigCowan

Line: 1 to 1
 
META TOPICPARENT name="ProductionProcedures"

Banning sites

Line: 14 to 14
 
  • Job is killed by the site batch system for using too much CPU/memory
  • Site configuration problem leading to LHCb software not being accessible
  • Grid middleware at a site has failed
Added:
>
>
  • Site is in downtime (scheduled or unscheduled)
  In the last two cases above or where it is clear that there is a problem with the site which must be fixed before normal LHCb activity can resume the option to ban the site should be considered.
Line: 62 to 63
 
  • This will send an email to lcg-grid@cernNOSPAMPLEASE.ch.
  • Shifters should make regular checks of the list of banned sites to ensure that there are no sites which have been forgotten about.
Added:
>
>

Dealing with dowtime

Sites regularly have to go into periods of maintenance during which time they will not provide the complete set of Grid services. Often this will mean that the site cannot be used by LHCb and it should be banned to prevent jobs being scheduled there. There are some circumstances where the site may still be of use to LHCb if, for example, only the storage is being taken offline while the CPU is OK. In this case, the site could still be used for MC production (Tier-2s). If the site is completely off the grid then the Application status will often be reported as "No Grid CE available" because it's not in the grid information system. The site downtime calendar should be viewed regularly by the shifters to determine when sites should be banned and unbanned. Again, sites should only be unbanned when they have announced that they are officially out of downtime and are passing SAM tests.

 -- GreigCowan - 09 Dec 2008

Revision 12008-12-09 - GreigCowan

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="ProductionProcedures"

Banning sites

Reason to ban a site

If jobs are failing at a site then the reason for the failures must be investigated by the shifters. There are many causes of jobs failure:

  • DIRAC problem
  • Application crash
  • Job is killed by the site batch system for using too much CPU/memory
  • Site configuration problem leading to LHCb software not being accessible
  • Grid middleware at a site has failed

In the last two cases above or where it is clear that there is a problem with the site which must be fixed before normal LHCb activity can resume the option to ban the site should be considered.

List of currently banned sites

$ dirac-admin-get-banned-sites
LCG.AUVER.fr                   2008-10-01 14:42:54 roma Linux_i686_glibc-2.3.3
LCG.BHAM-HEP.uk                2008-11-29 19:39:14 azhelezo Application not found
LCG.Barcelona.es               2008-12-08 18:32:34 gcowan All jobs failing with Application not Found
LCG.Bari.it                    2008-10-19 09:49:56 roma Software Installation
LCG.Bristol-HPC.uk             2008-11-17 08:16:18 roma Site update to SLC4
LCG.CESGA.es                   2008-11-01 17:12:22 roma All production jobs failed or stalled
LCG.CNAF-T2.it                 2008-12-07 13:26:02 gcowan All jobs failing with application not found
LCG.Catania.it                 Server error while serving getSiteMaskLogging: tuple index out of range
LCG.FESB.hr                    2008-12-03 09:51:18 gcowan Productions failing due to no space left on device errors
LCG.Ferrara.it                 2008-10-01 19:44:59 roma CEStateStatus: Draining
LCG.GR-03.gr                   2008-10-01 19:13:02 roma CEStateStatus: Draining
LCG.GR-04.gr                   2008-11-13 10:08:00 rvazquez No comment supplied.

This lists all sites that are currently not being used by DIRAC production. The date when they were banned is given along with the reason for the ban.

Banning a site

$ dirac-admin-ban-site LCG.CERN.ch --comment="All jobs failing with Application not Found error"

  • An email will automatically be sent to lhcb-grid@cernNOSPAMPLEASE.ch.
  • A comment should always be entered to ensure it is clear to everyone the reason for the ban.

GGUS tickets

  • The grid expert should immediately submit a GGUS ticket to the site reporting the problem and any relevant information (such as error message and names of local site nodes that have been involved).
  • The GGUS ticket should be CC'd to lhcb-grid@cernNOSPAMPLEASE.ch.

Post-banning action

It is important that sites do not remain banned indefinitely. Once the GGUS ticket has been acted on and resolved by the site, the Grid team should ensure that jobs are once again running at the site by looking at the results of the SAM jobs. Once they have verified that the site is operational then the site should be unbanned and the GGUS ticket closed.

$ dirac-admin-allow-site LCG.CERN.ch

  • This will send an email to lcg-grid@cernNOSPAMPLEASE.ch.
  • Shifters should make regular checks of the list of banned sites to ensure that there are no sites which have been forgotten about.

-- GreigCowan - 09 Dec 2008

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback