CMS-VO representative documentation for T2_CH_CSCS and T3_CH_PSI



Monitoring the site availability

Monitoring the computing over the last hours/years (Note the start and end times!)

Tickets submitted against CSCS
Long link

Disk usage of the SE
Warning: very large file, recommended to open using curl or wget. This is basically the sorted output of the du command, with the file sizes in bytes.

Open phedex transfer/deletion requests

Collection of notes made throughout the last 2 years


This documentation is intended to aid the CMS-VO representative for the T3 computing site at PSI and the T2 computing site at CSCS with their daily activities. This documentation is not meant to be comprehensive, but rather to cover the basic responsibilities. The external links in this document will invariably change in due to time; please excuse any broken links, and feel free to fix them.

Responsibilities of the CMS-VO representative

The CMS-VO representative is, together with dedicated personnel at PSI and CSCS, responsible for the proper functioning of the T3 and T2 computing sites. The setup and use-cases of the two sites are rather different, and from here on are treated separately.

T2 computing sites like CSCS are connected to the LHC Worldwide Computing Grid (WLCG), and run analysis, reconstruction, and MC jobs from CMS, ATLAS and LHCb. They also have a storage element (SE), a large pool of disks used to store big datasets. The T2 site is located at CSCS in Lugano. CSCS is a large computing center with a team of dedicated professional computing engineers on-site (as opposed to smaller T2 sites that are sometimes managed by physicists themselves). At CSCS, CMS operations concern regular comping on the Phoenix computing cluster, high-performance computing on Piz Daint (a CRAY supercomputer), and the storage element. The ETH/UZH physics groups use the computing facilities at CSCS like all grid users; via the grid, i.e. no direct job submission is possible. However, unlike most grid users, the ETH/UZH physics groups have write access to the storage element at T2_CH_CSCS, allowing them to stage-out crab jobs and order CMS datasets via phedex. A good fraction of the storage element is taken up by central datasets from CMS, ATLAS and LHCb, and are managed by (semi-)automated tools; it is of course key that the private use of the SE of ETH/UZH physics groups does not interfere with the central operations from the collaborations. Regarding the T2, the responsibility of the CMS-VO rep boils down to making sure the CMS computing workflows are working as intended on the Phoenix and Piz Daint computing clusters, and to make sure the storage element is working as intended for both CMS and the ETH/UZH physics groups. For more information on the CMS operations at CSCS, see the attachments TODO.

Responsibilities regarding T2_CH_CSCS


Check the tickets submitted against CSCS on GGUS. At the time of writing, GGUS is the platform on which people involved with computing submit tickets against sites or organizations when they encounter a problem. Unfortunately, GGUS only allows one notification email address per site, which is for CSCS and the CMS-VO rep is not on it (the email address is defined on GocDB, here. This means that (unless you code up a solution!) the CMS-VO rep needs to check the submitted tickets manually. The link to do this changes every once in a while, so be careful. At the time of writing, tickets submitted against CSCS can be found here:

Tickets submitted against CSCS

You are not supposed to handle tickets all by yourself, you can expect support from the staff at CSCS. However, remember that the staff at CSCS consists of computer scientists, not physicists, and frequently your knowledge of CMS workflows and the infrastructure at CSCS is needed to allow efficient communication between CMS (the usual ticket submitter) and CSCS (the ticket receiver).

Monitor the Site Availability. All sites connected to the grid are constantly tested for their availability using SAM tests, small test jobs which test if basic operations (like staging-out, access to central datasets, etc.) on the sites are working. If the SAM tests fail, you can usually be sure regular jobs would fail as well, and the site will (after some time) be temporarily removed from active production. It is thus important to solve problems with the site availability as soon as possible. More information about SAM tests can be found here.

If because of some underlying problem the CSCS T2 site becomes unavailable for CMS jobs or SE operations, a ticket will be submitted against CSCS. However, it takes some time for the ticket to be submitted, and then some more time before the ticket is picked up. Relying on GGUS tickets leaves the site unavailable for too long, hurting the overall availability.

The up-to-date site availability measures for CSCS are most easily retrieved using the SAM visualization dashboard:

This dashboard can be used to monitor the availability of all sites (not just CSCS), to make a ranking of the most available sites (which within higher management layers is often an important statistic, so don't neglect it!), and most importantly, it allows us to see which SAM tests are failing and what their output status was. The plots in the dashboard are clickable, so you can easily 'zoom in' on the failing tests at the time of failure.

Help debugging problems. When problems with the CMS workflows on CSCS arise, the CMS-VO rep and the dedicated staff at CSCS work together; typically the computer engineers will outclass the the CMS-VO rep in terms of technical know-how, but the CMS-VO rep provides to crucial physics context. This is of course a rather broad responsibility, and it would be impossible to list all the debugging techniques that one can apply.

Attend the monthly GridOps meeting. This meeting is scheduled by Pablo. He sends out email invitations containing a link to a twiki page which has all the info needed in order to attend. If you wish to receive these emails, just ask Pablo send them also to you.


Check private use of the CSCS SE. This requires downloading , and analyzing which users are using excessively. Typically we would allow about 10TB of disk usage per user, unless they are storing data intended for multiple people. Only users must delete their own datasets. While you may have the rights in principle to delete datasets, the user should always delete their own data. The instructions on how to delete files from the T2 SE can be found on the How-to-interact-with-the-SE TWiki.

Managing phedex datasets.

Report at the face-to-face (F2F) meetings with CSCS.

Responsibilities regarding T3_CH_PSI


Support the T3 admin with user questions related to physics code.

Help users with access problems to the storage element.


Check disk usage of /shome, the SE, and /scratch.

Report at the ETH weekly meeting.

Help implementing new features at the T3.

Keep documentation up-to-date.
Put also "the basic recommendation on using the cluster" in the todo items.

Problem-solving at T2_CH_CSCS

Gathering computing statistics

Essential skills needed

Knowledge of physics software

Knowledge of working on the GPU

Submit test crab jobs

Basic interactions with the storage elements

Edit | Attach | Watch | Print version | History: r37 < r36 < r35 < r34 < r33 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r37 - 2019-04-10 - ThomasKlijnsma
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback