TWiki> LHCb Web>LHCbComputing>GEOCGuide (revision 1)EditAttachPDF

LHCb GEOC Guide

  • Expert cell-phone number: 16-1914

The grid expert on call can be found in the Shift database here

The Grid Expert role:

A GEOC, by the name, is primarily a Grid Expert. This means that (s)he has to recognize Grid-related problems, and consequently open GGUS tickets.

But a GEOC has also to be an expert shifter: (s)he has to give hints and helps to the production shifters, and know how to approach production and users problems. Also, (s)he has to be able to do simple DIRAC-related operations, like restarting a service which is stuck.

Grid Shifters are responsible for flagging issues with the ongoing activities in the Production system and directing reports to the Grid Expert on duty. Grid Experts on duty can make a first pass at solving the reported problems or delegate tasks to the relevant member of the Grid Team (or DIRAC developers) as necessary. Grid Experts are responsible for various tasks relating to the daily functioning of the Production system, each has the duty to document procedures pertaining to these tasks as necessary such that other Grid Experts and Shifters can make a first pass to resolve known problems. Grid Experts are ʻon callʼ in order to provide a point of contact for the Grid Shifters.

While ʻon callʼ the Grid Expert should remain in close contact with the Shifter during office hours e.g. being in or around the control room. Outside of office hours the on call Grid Expert should be reachable via mobile phone in case of emergencies. Each member of the Grid team will have at least a primary and secondary role of expertise to ensure that all activities are covered in case of absence. When ʻon callʼ a Grid Expert can delegate their duties to backup members of the Grid Team if necessary.

Daily meetings

Every day, a GEOC has to chair the daily operations meeting, and to give a report to the daily WLCG operations meeting

  • The daily operations meeting starts every working day at 11:15 (CERN time) in the LHCb offline control room (2-R 014), but there are cases where such a meeting could be cancelled: when there is no real data taking, and there is nothing urgent to be discussed. An indico meeting should be set up by the GEOC, usually cloning the former meeting. The minutes have to be then put in the Minutes of daily meeting ELOG.

Since in LCG, a problem is not a problem if there is no GGUS ticket opened, this report has to include a reference to the newly opened ticket, and news (or request for news) for the older ones have to be briefly discussed within this meeting.

Collection of operations procedures

- Pilot Operations Tracking problematic CEs through monitoring the progress of pilots submitted via the WMS. Correlate failures with SAM test results.

- DIRAC Monitoring and Logging Monitoring central DIRAC components using the available tools, restarting components as necessary. Categorizing errors in the logging service, identifying the scope and severity of problems relating to the Grid activities e.g. problematic CEs at a site, data access and data transfer failures relating to specific SRM endpoints.

- Data Management Perform maintenance operations according to defined procedures, e.g. file transfer and file removal from all related catalogues, resolution of problematic files etc.

- Accounting / Auditing Monitor the incoming accounting records and compile reports on progress of data and workload management related activities.

- User Support Respond to users on the distributed analysis list as necessary. Announce downtimes of Tier-1 sites and DIRAC WMS interventions. Debug problems reported by users and convey new use-cases to the DIRAC developers.

Grid Experts are also responsible for defining Production Operations Procedures for other Grid Experts and Shifters to follow.

Feature Requests

Feature requests from Shifters and Experts should go to the DIRAC Savannah project page.

The procedure for submitting feature requests and bug reports can be found here.

Ongoing Activities

Automating procedures to ensure all files are processed for a given reconstruction or stripping production

As discussed in the PASTE meeting of 14th October 2008 (minutes) the processing of all available files for a given production is achievable but currently requires human (expert) interventions. The following page describes the current understanding of how to get to 100% processing efficiency.

Storage


Storage Requirements can be accessed here


DIRAC and LHCbDIRAC

Since Dirac v5r0, all the (LHCb)DIRAC services have been migrated to a new hardware. The List of LHCb dedicated services should be maintained updated accordingly.

A GEOC should have a basic understanding on how the services are deployed, and be able to know whenever a service or agent is stuck, and therefore how to restart it.

Monitoring links

There is a huge plethora of monitoring links out there.

CERN-specific monitoring links

Older operations links

-- FedericoStagni - 15-Nov-2010

Edit | Attach | Watch | Print version | History: r7 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2010-11-15 - FedericoStagni
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback