CMS-VO representative documentation for T2_CH_CSCS and T3_CH_PSI

Introduction

This documentation is intended to aid the CMS-VO representative for the T3 computing site at PSI and the T2 computing site at CSCS with their daily activities. This documentation is not meant to be comprehensive, but rather to cover the basic responsibilities. The external links in this document will invariably change in due to time; please excuse any broken links, and feel free to fix them.

Responsibilities of the CMS-VO representative

The CMS-VO representative is, together with dedicated personnel at PSI and CSCS, responsible for the proper functioning of the T3 and T2 computing sites. The setup and use-cases of the two sites are rather different, and from here on are treated separately.

T2 computing sites like CSCS are connected to the LHC Worldwide Computing Grid (WLCG), and run analysis, reconstruction, and MC jobs from CMS, ATLAS and LHCb. They also have a storage element (SE), a large pool of disks used to store big datasets. The T2 site is located at CSCS in Lugano. CSCS is a large computing center with a team of dedicated professional computing engineers on-site (as opposed to smaller T2 sites that are sometimes managed by physicists themselves). At CSCS, CMS operations concern regular comping on the Phoenix computing cluster, high-performance computing on Piz Daint (a CRAY supercomputer), and the storage element. The ETH/UZH physics groups use the computing facilities at CSCS like all grid users; via the grid, i.e. no direct job submission is possible. However, unlike most grid users, the ETH/UZH physics groups have write access to the storage element at T2_CH_CSCS, allowing them to stage-out crab jobs and order CMS datasets via phedex. A good fraction of the storage element is taken up by central datasets from CMS, ATLAS and LHCb, and are managed by (semi-)automated tools; it is of course key that the private use of the SE of ETH/UZH physics groups does not interfere with the central operations from the collaborations. Regarding the T2, the responsibility of the CMS-VO rep boils down to making sure the CMS computing workflows are working as intended on the Phoenix and Piz Daint computing clusters, and to make sure the storage element is working as intended for both CMS and the ETH/UZH physics groups. For more information on the CMS operations at CSCS, see the attachments TODO.

Responsibilities regarding T2_CH_CSCS

Frequent

Check the tickets submitted against CSCS on GGUS. At the time of writing, GGUS is the platform on which people involved with computing submit tickets against sites or organizations when they encounter a problem. Unfortunately, GGUS only allows one notification email address per site, which is grid@cscsNOSPAMPLEASE.ch for CSCS and the CMS-VO rep is not on it (the email address is defined on GocDB, here. This means that (unless you code up a solution!) the CMS-VO rep needs to check the submitted tickets manually. The link to do this changes every once in a while, so be careful. At the time of writing, tickets submitted against CSCS can be found here:

Tickets submitted against CSCS

You are not supposed to handle tickets all by yourself, and you can expect support from the staff at CSCS. However, remember that the staff at CSCS consists of computer scientists, not physicists, and frequently your knowledge of CMS workflows and the infrastructure at CSCS is needed to allow efficient communication between CMS (the usual ticket submitter) and CSCS (the ticket receiver).

Monitor the Site Availability. All sites connected to the grid are constantly tested for their availability using SAM tests, small test jobs which test if basic operations (like staging-out, access to central datasets, etc.) on the sites are working. If the SAM tests fail, you can usually be sure regular jobs would fail as well, and the site will (after some time) be temporarily removed from active production. It is thus important to solve problems with the site availability as soon as possible. More information about SAM tests can be found here.

If because of some underlying problem the CSCS T2 site becomes unavailable for CMS jobs or SE operations, a ticket will be submitted against CSCS. However, it takes some time for the ticket to be submitted, and then some more time before the ticket is picked up. Relying on GGUS tickets leaves the site unavailable for too long, hurting the overall availability.

The up-to-date site availability measures for CSCS are most easily retrieved using the SAM visualization dashboard:

http://wlcg-sam-cms.cern.ch/templates/ember/#/historicalsmry/heatMap?profile=CMS_CRITICAL&site=T2_CH_CSCS

This dashboard can be used to monitor the availability of all sites (not just CSCS), to make a ranking of the most available sites (which within higher management layers is often an important statistic, so don't neglect it!), and most importantly, it allows us to see which SAM tests are failing and what their output status was. The plots in the dashboard are clickable, so you can easily 'zoom in' on the failing tests at the time of failure.

Occasionally

Edit | Attach | Watch | Print version | History: r48 < r47 < r46 < r45 < r44 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r45 - 2018-12-07 - ThomasKlijnsma
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback