Banning sites

Reason to ban a site

If jobs are failing at a site then the reason for the failures must be investigated by the shifters. There are many causes of jobs failure:

  • DIRAC problem
  • Application crash
  • Job is killed by the site batch system for using too much CPU/memory
  • Site configuration problem leading to LHCb software not being accessible
  • Grid middleware at a site has failed
  • Site is in downtime (scheduled or unscheduled)

In the last two cases above or where it is clear that there is a problem with the site which must be fixed before normal LHCb activity can resume the option to ban the site should be considered.

List of currently banned sites

$ dirac-admin-get-banned-sites                   2008-10-01 14:42:54 roma Linux_i686_glibc-2.3.3                2008-11-29 19:39:14 azhelezo Application not found               2008-12-08 18:32:34 gcowan All jobs failing with Application not Found                    2008-10-19 09:49:56 roma Software Installation             2008-11-17 08:16:18 roma Site update to SLC4                   2008-11-01 17:12:22 roma All production jobs failed or stalled                 2008-12-07 13:26:02 gcowan All jobs failing with application not found                 Server error while serving getSiteMaskLogging: tuple index out of range                    2008-12-03 09:51:18 gcowan Productions failing due to no space left on device errors                 2008-10-01 19:44:59 roma CEStateStatus: Draining                   2008-10-01 19:13:02 roma CEStateStatus: Draining                   2008-11-13 10:08:00 rvazquez No comment supplied.

This lists all sites that are currently not being used by DIRAC production. The date when they were banned is given along with the reason for the ban.

Banning a site

$ dirac-admin-ban-site --comment="All jobs failing with Application not Found error"

  • An email will automatically be sent to
  • A comment should always be entered to ensure it is clear to everyone the reason for the ban.

In the comment, quotes must not be used inside quotes because it confuses the parser, e.g.

$dirac-admin-ban-site --comment='Lots of jobs failed with "bus error" in Gauss step'
will generate as a comment for the site banned: 'Lots of jobs failed with bus'.

GGUS tickets

  • The grid expert should immediately submit a GGUS ticket to the site reporting the problem and any relevant information (such as error message and names of local site nodes that have been involved).
  • The GGUS ticket should be CC'd to

Post-banning action

It is important that sites do not remain banned indefinitely. Once the GGUS ticket has been acted on and resolved by the site, the Grid team should ensure that jobs are once again running at the site by looking at the results of the SAM jobs. Once they have verified that the site is operational then the site should be unbanned and the GGUS ticket closed.

$ dirac-admin-allow-site

  • This will send an email to
  • Shifters should make regular checks of the list of banned sites to ensure that there are no sites which have been forgotten about.

Dealing with dowtime

Sites regularly have to go into periods of maintenance during which time they will not provide the complete set of Grid services. Often this will mean that the site cannot be used by LHCb and it should be banned to prevent jobs being scheduled there. There are some circumstances where the site may still be of use to LHCb if, for example, only the storage is being taken offline while the CPU is OK. In this case, the site could still be used for MC production (Tier-2s). If the site is completely off the grid then the Application status will often be reported as "No Grid CE available" because it's not in the grid information system. The site downtime calendar should be viewed regularly by the shifters to determine when sites should be banned and unbanned. Again, sites should only be unbanned when they have announced that they are officially out of downtime and are passing SAM tests.

-- GreigCowan - 09 Dec 2008

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2009-01-12 - RobertoSantinel
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback