WLCG Operations

Introduction and overview

This section covers the organisation of computing operations in WLCG, and in particular describes the most important roles, bodies, communication channels and procedures. The main goal of WLCG computing operations is to provide a smooth and reliable environment for the experiments to run their computing activities, which requires among other things:
  • efficient communication
  • a quick resolution of issues according to agreed targets
  • coordination and decision bodies
  • well defined and agreed procedures
WLCG has accumulated already several years of experience in operations, in close collaboration with OSG, EGEE and EGI. The quality of the operational procedures is more than enough to ensure that the experiments are able to perform their duties with a very small chance of catastrophic incidents (significant loss of delivered luminosity or of archived data). Still, achieving this requires substantial effort, as the frequency of incidents does not seem to be decreasing nor their impact for the experiments. Improvements in this area are strongly correlated to other areas covered by this TEG, in particular for what concerns problem tracking, monitoring and software deployment. Most of the improvements specific to this area are likely to come from incremental changes as more experience is collected.

Technology and tools

For this area, where the human factor is essential, we also classify as "tools" (in a much wider sense) what follows:
  • meetings
  • operational roles
  • decision bodies

Meetings

The following meetings are being organized for WLCG operations:
  • Daily meeting, every working day at 1500 (CERN time). A meeting point for experiments, sites and service providers, discussing currently ongoing activities, issues of the last 24 hours and announcements of future interventions. It is by far the most effective communication channel available
  • T1 Service Coordination meeting, a bi-weekly meeting to coordinate WLCG services across all sites and discuss any major incidents and long standing issues. It has a wider scope than the daily meeting and is particularly suited to discuss changes and coordinate them. The only drawback is that it does not cover Tier-2 sites.
  • Grid Deployment Board: a monthly meeting discussing development activities and future directions in WLCG. It is mainly a way to inform about current and future developments and can take decisions, but with longer timescales. Tier-2 sites are also included, though they participate on a voluntary basis and only a few do that.

Roles and bodies

WLCG has defined some specific roles and bodies for various aspects of operations. The most relevant are listed here:
  • Security Officer. The coordinator for all security-related operational issues in WLCG, he/she advises on security risks and contributes to defining security policies for WLCG. To do so, he/she collaborates with other security teams in EGI and OSG, the LHC experiments and other stakeholders.
  • Information Officer: a single point of contact for improving the operational aspects of the Information System, ensuring coherence of the IS infrastructure and its evolution and collecting input from sites and users on behalf of the IS developers. He/she is responsible for coordinating campaigns to achieve specific goals (e.g. improve the stability of the service or the accuracy of the published information).
  • Data management Officer: acts as a contact point for experiments, sites and the WLCG management whenever a global coordination of an action related to data management and storage is required. Typical examples include: changes in storage interfaces, central data management service upgrades, deployment of new versions of storage systems. He also takes care of the data management and storage session of the Tier-1 service coordination meeting.
  • Grid Deployment Board: described above, it is also capable of taking decisions, as explained.
  • Site administrator: a person in charge of operational duties for an WLCG site and registered as such in GOCDB or OIM. There can be several at a site and his/her specific duties are not exposed outside the site.
  • Site security officer: a site contact person for all security issues. He/she can be contacted in case of security incidents at a site, and is in charge for coordinating all actions related to security.
  • Experiment contact person: an individual working at a site and acting as interface between his/her site and a specific LHC experiment. His/her main role is to coordinate all actions to be taken at a site to fulfill the needs of an experiment and, conversely, represent the experiment towards the site. Each experiment (apart from ALICE) has contact people at all Tier-1 sites and CMS has contact people also at Tier-2 sites.

Procedures and policies

Scheduling downtimes

There exist two different types of downtimes that sites can declare for their services

  • Scheduled downtimes which are planned and agreed in advance
  • Unscheduled downtimes which are triggered by unexpected failures of resources or services

Scheduled downtimes have to be announced at least 24 hours before the actual intervention. Any downtime that is not within this time frame is considered as unscheduled. Announcements for downtimes are being sent to stakeholders at the time of entering the information, 24 hours and 1 hour before the intervention. In addition to formal announcements through tools (e.g. GOCDB) for scheduled downtimes site contacts very often contact experiments also in advance to schedule downtimes in order not to hinder operations more than necessary. In case of unscheduled downtimes several templates for site admins are available at https://wiki.egi.eu/wiki/MAN04 to report on both "short undetected" downtimes and such which will require a recovery of the system for more than 1 hour.

Problem handling

Issues that are not handled within a certain time frame are being collected and discussed in the T1 Service Coordination meeting.

Major incidents are being discussed at the daily operations meeting whenever they happen and subsequently in the Management Board. A post mortem of any major incidents that has happened will be presented in the form of a "Service Incident Report (SIR)" which is usually provided by the service providers.

On a quaterly basis the Grid Deployment Board also provides a platform to experiments to present any major issues during the last 3 months that were experienced and could be of general interest to be discussed on a WLCG wide side.

Works well

The communication channels among WLCG and the experiments are very effective and allows sites to cooperate with the experiments in an efficient way.

Concerning procedures, SIRs are considering very useful and they should probably be done not only by sites but also by experiments, in particular when there are incidents on experiment services at the sites.

In most cases, the response of sites and the time for resolution is considered more than adequate.

Top issues

In some cases, the strength of the link between an experiment and a site is not enough and should be improved.

Another aspect is that there is no real WLCG operations team: computing operations are mostly done in the experiments and issues are followed up by them, while WLCG provides just coordination. Moreover, while Tier-1 sites are well covered by the Tier-1 Service Coordination Meeting, for Tier-2 sites there is no channel apart from the GDB, which is held only once per month and it is not concerned by daily operations.

This is strongly correlated to the need for the experiments to have dedicated contact people at least at the most important sites, which is not really feasible at all sites (CMS is the only exception).

Another class of problems is related to the difficulty that sites sometimes have in knowing the exact requirements of an experiment (in particular if they change over time), as VO cards in CIC are not updated as regularly as they should. Similarly, some sites express concern about the difficulty in identifying users responsible for abuses (in particular for jobs run by pilot-based frameworks) and the excessive load to the site (for example on the shared file system used for software installation) caused by some of the jobs, leading in extreme cases to significant performance degradation.

To improve

Apart from addressing the problems mentioned above, some suggestions for improvement have been given.

It would be useful to provide a "whiteboard", similar to the CERN IT Status Board, where experiments can publish news of general interest for WLCG.

-- StefanRoiser - 17-Nov-2011

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2011-11-25 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback