This guide is superseded by GridExpertOnCall in this TWiki

  • Expert cell-phone number: 16-1914

The grid expert on call can be found in the Shift database here

The Grid Expert role:

A GEOC, by the name, is primarily a Grid Expert. This means that (s)he has to recognize Grid-related problems, and consequently open GGUS tickets.

But a GEOC has also to be an expert shifter: (s)he has to give hints and helps to the production shifters, and know how to approach production and users problems. Also, (s)he has to be able to do simple DIRAC-related operations, like restarting a service which is stuck.

Grid Shifters are responsible for flagging issues with the ongoing activities in the Production system and directing reports to the Grid Expert on duty. Grid Experts on duty can make a first pass at solving the reported problems or delegate tasks to the relevant member of the Grid Team (or DIRAC developers) as necessary. Grid Experts are responsible for various tasks relating to the daily functioning of the Production system, each has the duty to document procedures pertaining to these tasks as necessary such that other Grid Experts and Shifters can make a first pass to resolve known problems. Grid Experts are ʻon callʼ in order to provide a point of contact for the Grid Shifters.

While ʻon callʼ the Grid Expert should remain in close contact with the shifter during office hours e.g. being in the control room. Outside of office hours the on call Grid Expert should be reachable via mobile phone in case of emergencies. Each member of the Grid team will have at least a primary and secondary role of expertise to ensure that all activities are covered in case of absence. When ʻon callʼ a Grid Expert can delegate their duties to backup members of the Grid Team if necessary.

Daily meetings

Every day, a GEOC has to chair the daily operations meeting, and to give a report to the daily WLCG operations meeting:

  • The daily operations meeting starts every working day at 11:15 (CERN time) in the LHCb offline control room (2-R 014), but there are cases where such a meeting could be cancelled: when there is no real data taking, and there is nothing urgent to be discussed. An indico meeting should be set up by the GEOC, usually cloning the former meeting. The minutes have to be then put in the minutes of daily meeting ELOG.

Since in LCG, a problem is not a problem if there is no GGUS ticket is opened, this report has to include a reference to the newly opened ticket, and news (or request for news) for the older ones have to be briefly discussed within this meeting.

Operational responsibilities:

Operational responsibilities might vary, but includes at least:

  • Compile the list of sites entering or leaving downtime as well as associated actions, e.g.
    • Ban site from production mask
    • Ban SE via redirection of associated SEs
    • Remove LFC mirrors from allowed lists
  • Liaise with site admins (GGUS) and with T1 site contacts.
  • Make a first pass at solving reported problems from Grid shifters users
    • Report problems to other Grid Team members, Operations ELOG, Mailing Lists, DIRAC Developers, GGUS, Tier-1 contacts as necessary
  • Correlate any reported site failures with status in SAM
    • SAM Sensors and Jobs
    • Define any new test modules that should be added to the suite. Roberto Santinelli is in any case in charge of maintaining the suite of tests, so every new test should be discussed with him.
  • Take minutes at the Production meetings (held every Thursday) and in case open new Savannah Production operations task

A non-exhaustive list of more technical responsibilities includes:

- Pilot Operations: tracking problematic CEs through monitoring the progress of pilots submitted via the WMS. Correlate failures with SAM test results.

- DIRAC Monitoring and Logging: monitoring central DIRAC components using the available tools, restarting components as necessary. Categorizing errors in the logging service, identifying the scope and severity of problems relating to the Grid activities e.g. problematic CEs at a site, data access and data transfer failures relating to specific SRM endpoints.

- Data Management: perform maintenance operations according to defined procedures, e.g. file transfer and file removal from all related catalogues, resolution of problematic files etc. While simple operations can be done by the GEOC in isolation, the Data Manager will be in charge of performing the un-common ones.

- Accounting / Auditing: monitor the incoming accounting records and compile reports on progress of data and workload management related activities.

- User Support: respond to users on the distributed analysis list as necessary. Announce downtimes of Tier-1 sites and DIRAC WMS interventions. Debug problems reported by users and convey new use-cases to the DIRAC developers.

- Transfer PIT - CASTOR: monitor the transfer activity during Data Taking to ensure that the RAW data are properly transfer from the PIT to CASTOR.

Collection of operations procedures

Grid Experts are also responsible for defining Production Operations Procedures for other Grid Experts and Shifters to follow.

Defining Procedures A Template

  • Recipes for many reported problems can be constructed by the operations team for the production shifters to follow, e.g.
Production Operations Procedure: <SHORT DESCRIPTION>
Date: <DATE>
Author: <NAME>
Version: 1.0

  • Any new tools (e.g. scripts, utilities) developed should be added to the DIRAC3 code repository in the LHCbSystem and made available to the shifters via a client installation.
  • After the tools are available, the above report can be uploaded to the Production Procedures
  • After feedback the operations procedure should be updated accordingly.

Defining Procedures An Example

  • Many useful tools for Grid Shifters can be written using existing APIs and provided as part of an operational suite with documentation for the steps to
    • Example: A Grid Shifter must reschedule many jobs for production 12345 in status ʻFailed, ʻInput Data Resolutionʼ
      • Be aware of any unresolved issues requiring investigation Production Operations Procedure: Rescheduling jobs of a given ProductionID after transient site problems
         Date: 24/07/08
         Author: S.Paterson
         Version: 1.0
         Description: After a transient SE failure at site LCG.Example.zz many thousands of jobs were left in a failed state after input data resolution. 
         Once confirmation was received that the problem was resolved the below script can be used to remedy the affected jobs. 
         Tools To Use: dirac-production-reschedule-status 

            * Run dirac-production-reschedule-status with the following arguments
                  * e.g. 12345 Failed ?Input Data Resolution?

Feature Requests

Feature requests from Shifters and Experts should go to the DIRAC Savannah project page.

The procedure for submitting feature requests and bug reports can be found here.

Ongoing Activities

Automating procedures to ensure all files are processed for a given reconstruction or stripping production

The processing of all available files for a given production is achievable but currently requires human (expert) interventions. The following page describes the current understanding of how to get to 100% processing efficiency.


Storage Requirements can be accessed here


Since Dirac v5r0, all the (LHCb)DIRAC services have been migrated to a new hardware. The List of LHCb dedicated services should be maintained updated accordingly.

A GEOC should have a basic understanding on how the services are deployed, and be able to know whenever a service or agent is stuck, and therefore how to restart it.

Monitoring links

There is a huge plethora of monitoring links out there.

CERN-specific monitoring links

Older operations links

-- FedericoStagni - 15-Nov-2010

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2016-10-17 - AndrewMcNab
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback