Tier0 Documentation

Contents:

Data Flow in the Tier-0:

Files from the CMS detector are transferred from P5 to the t0streamer by the Storage Manager. Tier0 takes all the input files from there (.dat files), depending on the stream the file belongs, it can follow a different process in the Tier-0 system. In the end we produce .root files that may belong to several datatiers, Prompt Calibration Loop (PCL) uploads to the dropbox and uploads to the DQM GUI. In the following diagram, you can actually see details on how the Tier-0 data flow is.

Tier0_Data_Flow.png

T0 Critical tickets:

Ticket type definitions :

  • If the issue is critical and CMS cannot wait until the next working day for a solution, the ticket should be opened as a GGUS ALARM ticket. Prompt action anytime guaranteed
  • If the issue is not critical, and CMS can wait until the next working day before a site intervention, the ticket should be opened as a GGUS TEAM ticket. Action guaranteed only on the next business hours
  • Detailed and specific documentation Here

General guidelines

  • If in doubt on how to create a ticket, please follow the instructions Here
  • If there is danger of data loss, an ALARM ticket is warranted
  • If the problem is not likely to cause data loss and can wait till the next business hours, a team ticket is appropriate. A team ticket can always be escalated to an alarm ticket if necessary
  • A normal ticket should not be used, because it cannot be escalated to an alarm ticket
  • when opening a ticket, cms-crc-on-duty should always be cc'ed
  • If possible, the CRC should be contacted and consulted before opening an ALARM ticket, just to keep him in the loop.
  • Disk Quota problems (/store/data quota full) In this case we must contact the Virtual Organization Contact Operator (currently (03/2015) Ivan.Glushkov@cernNOSPAMPLEASE.ch and 0041764877257@mail2smsNOSPAMPLEASE.cern.ch). Going to IT could take longer.

Most seen urgent problems

  • Repack/Express stuck -- files inaccessible/stuck in CASTOR/EOS.
    • For Repack, we can survive with it stuck until PromptReco comes, a bit before I would say (44h) - TEAM ticket
    • For Express, we want to be done before the PCL 12h delay - ALARM ticket if during STABLE BEAMS, TEAM if Cosmics data taking is taking place.
  • PCL upload failing
    • In the range of what we can do, is usually for Express jobs failing due to inaccessible files - ALARM ticket
  • Frontier Problems
    • Usually PromptReco will fail when getting conditions, also Reco jobs everywhere. If not much luminosity is being processed, we can do a TEAM ticket, but if we have to catch up backlog, is better to have an ALARM ticket and the issue fixed ASAP.
    • Express will also fail. If this is preventing STABLE BEAMS express to run, ALARM ticket.

Useful monitoring links

T0 Subsystems:

Interesting links to know when collisions happens and how :

  • Here is for more real time monitoring, to know when CMS is taking data, the DAQ System

  • Here's a live twitter feed about the LHC status, and CMS usually takes data when there's STABLE BEAMS

  • Here's a bunch of LHC monitors that are not that helpful, but if you know how to read, you can start to understand the machine and predict when you have to pay more attention :

  • And finally a tool that helps a lot to get useful runs for replays, based in peak PileUp (heavier events) and how much data was taken (in 1/pb)

Operations tools

How do I give the CSP shifters updated instructions?

In case there is an issue we are aware of and don't need to be notified about, a note should be added to the 'Computing Plan of the Day' by writing an email to cms-crc-on-duty@cernNOSPAMPLEASE.ch explaining what the new/temporary instructions are. The CRC maintains this page and should integrate your updates. The current plan is available here.

Monitoring questions

How do I check the progress of a run processing in the Tier 0?

Go to the Run tab in the production WMStats. Compare 'run status' with the status you can find in this diagram

PCL

Tier0 WMAgent

Tier0Feeder

StorageManager -> SMDB (@P5) ----(replicates to)---> cms_orcon_adg (@CERN) <---(reads)--- Tier0Feeder

 
--------------------------------------------------(reads) -----------------------------------------> cmsr <----writes-----

Two databases

  • cmsr
  • cms_orcon_adg

Keys

Flocking to the Tier0 Pool

Involved teams are:

- Central Production and Reprocessing

- GlideInWMS These guys are also managing:

  • Tier0 central manager: vocms0820
  • Tier0 pool collector

In normal operation Tier0 jobs have higher priority than Central Production jobs (except, for LogCollect, Merge and Cleanout jobs, which have the highest priority no matter what workflow they belong to or which Schedd they belong to). It implies that jobs coming from the Tier0 will go in the front of the MCP in the queue. However, if the pool is full, it might take several hours for the Tier0 to use the site at full capacity. In case it is required to start using the resources faster, several "switches" are available to avoid new jobs to reach the site or even to evict them. For Disabling flocking to Tier0 Pool, Enabling pre-emption in the Tier0 pool, Changing the status of _CH_CERN sites in SSB, please refer to T0 Pool Instructions.

-- VytautasJankauskas - 2019-03-27

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2019-03-27 - VytautasJankauskas
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback