Site Support Team - Documentation / Adding a New CMS Site

The instructions below are for the Site Support team to setup a new site. For site executive/admins there is a different twiki with more reference, explanation, etc. here.

Requirements

What will you need to be a good site

Tier2

  • be a good grid site in EGI or OSG
  • a farm with at least one CE (better 2 or more)
  • an SRM (please avoid multiple SE's as much as possible)
  • one machine to run a Squid server (two for a large site with e.g. more then 500 execution slots)
  • a Vobox to run CMS.PhEDEx agents
  • firewall permission for Squid and job monitoring
    • detailed below

Tier3

  • similar to a T2 but usually at smaller scale
  • no requirement to run CMS Grid jobs; The local batch system does not need to be Grid enabled and can be limited to local users.
  • no need to be up 24x7

Fully Operational Sites

T1s & T2s

  1. CMS secretariat & CMS VO: Site executive needs to be properly registered in the CMS secretariat and be included in the CMS VO in VOMS and needs to be a member of the CERN cms-zh e-group.
  2. CMS Name: Propose a CMS name for the site
  3. CRIC (old SiteDB setup):
    • Send an email to cms-comp-ops-coordinators@cern.ch informing about new site and also ask for last confirmation.
    • Create a ggus ticket (be sure that you are able to create ggus ticket. If you are not, please visit the registration page of the ggus)
      • Type of issue: CMS_Register New CMS Site, CMS Support Unit: CMS SiteDB, subject: Tier-{1, 2, 3} Registration
      • Include the following info:
        Title: Tier: CMS Name: Site Executive: Location:
           Example   Close     Title: MIT Tier: Tier 2 CMS Name: T2_US_MIT LCG Name: MIT_CMS Site Executive: Maxim Goncharov, Wei Li Location: Boston, MA, USA Usage: OSG  
      • Please note that you can also provide the following fields (or site executive can fill these fields later on):
        Data Manager: name, e-mail; PhEDEx Contact: name, e-mail; Site Admin: name, e-mail
      • Site support team will setup site and facility in CRIC. THere is no need currently for site executives or admins to interact with CRIC if "Data Manager" and "PhEDEx Contact" information is provided in the ticket (or in case the site will not have a PhEDEx endpoint). The "Site Executive" and "Site Admin" lists are CERN e-groups and can be changed here. Put "cms-" into the search field, leave the search set to "begins with", check the "Only groups I own or manage" box and click "Search" to find the admin/exec group of your facility.
  4. GIT: Setup GIT and commit files to SITECONF
  5. CMSSW/CVMFS:
    • CMS software can be installed locally but this is no longer required. Instead we suggest sites to use the CERN Virtual Machine FileSystem, CVMFS.
    • CVMFS is a network readonly filesystem, more information, including installation and configuration details, can be found on its twiki page.
    • CMS software installations on CVMFS are maintained/managed at CERN.
      • If you add storage.xml and site-local-config.xml files to the siteconf GIT repository they will appear automatically on CVMFS. There is a crontab/script that synchronizes CVMFS every few hours. (If you don't use CVMFS, you need to keep the copies of those files in your local installation in sync with the GIT repository.)
      • The script looks for two files with the names, storage* and site-local-config*.
      • Note: this means if you name files differently, they will likely not appear automatically on CVMFS!
  6. Frontier/Squid: Install and configure Squid for CMS Frontier
  7. glide-in WMS Subscription
    • CMS uses glide-in WMS to utilize Grid resources. Tier-1 and 2 sites thus need to allow pilot jobs of the glide-in WMS factory to execute on their compute elements.
    • Please open a [[https://ggus.eu/?mode=ticket_cms]CMS GGUS ticket]] for Support Unit "Glidein Factory", provide the details about your CEs as below and ask the queues of your CEs to be added:
      1. Name of CE
      2. Type of CE
      3. Multi-core or single-core
      4. Wall-time limit
      5. Memory per core
      6. Other special parameters you expected
    • Factory operations will likely add them to the integration setup and test that things work first. Don't miss getting them added to the production factories at CERN, Fermilab, and SDSC.
    • when ready in the production factories, please set the "Processing" field in SiteDB; this will enable your site in CRAB so HammerCloud jobs can run
    • CE has to be whitelisted in the job dashboard. Inform Edward Karavakis or CERN MONIT team.
  8. SAM tests: If the previous steps were successfully completed, CMS SAM CE tests should start to appear.
    • SAM tests are driven by CE and SE information in the VO-feed. (GOCDB and MyOSG entries are not used.) In case queue information is not explicitly specified in the glide-in WMS entry for the CE (and then the VO-feed will not include queue information for the CE either), SAM ETF takes the queue information from BDII if available.
    • Start debugging the following tests results (not all of the SAM tests):
      1. org.sam.CREAMCE-JobSubmit
      2. org.cms.WN-env
      3. WN-basic
      4. org.cms.WN-swinst (CVMFS)
      5. org.cms.WN-frontier
      6. org.cms.WN-squid
  9. Rucio Setup
    1. CMS uses Rucio to transfer data to/from sites. The Rucio Storage Element at your site will be configured based on the information in storage.json.
    2. double check storage.json has correct entries for all protocols you like to use
    3. Contact the transfer/data management team via GGUS, CMS Support Unit = "CMS Datatransfers" and ask them to setup a Rucio Storage Element for your site.
      • please provide the site name, storage type, and storage technology, and amount of disk space dedicated for central experiment use
    4. ask the transfer/data management team to subscribe/make a rule for the active SAM and HammerCloud datasets for your site
      • old SAM dataset: /GenericTTbar/SAM-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO (4 files, 3.4 GB)
      • old HC dataset: /GenericTTbar/HC-CMSSW_7_0_4_START70_V7-v1/GEN-SIM-RECO (185 files, 603.0 GB)
      • new SAM dataset: /GenericTTbar/SAM-CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/AODSIM (2 files, 2.3 GB)
      • new HC dataset: /GenericTTbar/HC-CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/AODSIM (47 files, 114.9 GB)
      • newest SAM dataset: /GenericTTbar/SAM-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/AODSIM (3 files, 2.3 GB)
      • newest HC dataset: /GenericTTbar/HC-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/AODSIM (46 files, 113.5 GB)
    5. if you like to contribute disk space for local dataset subscription/storage, please let them know the amount of space and ask them to setup a local Rucio account.
    6. Rucio Link commissioning:
  10. Rucio LoadTests:
    • Rucio transfers a small file in the /store/test/loadtest area between sites to check links between RSEs are working. Please contact the transfer/data management team via GGUS, CMS Support Unit = "CMS Datatransfers" and ask your RSE to be included in the LoadTest setup.
  11. SRM test: Debug SRM SAM tests results - org.cms.SRM
    • The site needs working PhEDEx agents to pass these tests.
  12. /store/unmerged/ cleaning
    • Take a look at CMS namespace policy and set up cleaning for /store/unmerged/. With the exception of files belonging to ongoing workflows files older than 2 weeks can be deleted automatically. Still active datasets are published here.
  13. Check and debug SAM tests results:
    • org.cms.glexec.WN-gLExec
    • org.cms.WN-analysis
    • org.cms.WN-mc
    • org.cms.WN-xrootd-access
    • org.cms.WN-xrootd-fallback
  14. Check and debug HammerCloud test:
    • To start test contact Andrea Sciaba to add the site to the appropriate test templates.
  15. Xrootd Architecture - Link
    • Please inform the CMS Site Support team, cms-comp-ops-site-support-team@cernNOSPAMPLEASE.ch, about the URI of the xrootd endpoints of your site. This should be the site endpoint(s) (or redirector) that is subscribed to the federation and serves the namespace of the PhEDEx node name (and does not re-direct to other site(s)).
  16. CMS VO Feed - Link
    • If the site is properly registered, i.e. configured in CRIC, glide-in WMS factory entries are defined, Rucio configured, and xrootd endpoints announced, then the site with all it's services will appear in the VO-feed.
    • The CMS VO-feed uses information from:
      • Rucio for SEs
      • glide-in WMS factories for CEs
      • from a JSON for xrootd
  17. Monitoring:
    • Is the site automatically appearing in Site Status Board?
      • CMS VO Feed is the source for the TopologyMaintenances metric in SSB, it means that if the site is in the CMS VO Feed it will appear in SSB as well.
      • Having the site in the CMS VO feed doesn't mean that this site will be monitored for every metric in SSB, sometimes the responsible for the metric has to add the site to the test template - for example: HammerCloud test (Andrea Sciaba).
      • Set values for the cpubound (160) and iobound (161) metricses. cpubound should be the number of cores of the site and iobound the number of cores that the site can sustain running I/O intensive applications (default 10% of cpubound).
    • Is the site automatically appearing in Dashboard?
      • No, Dashboard ops (e.g. Julia) may need to explicitely refresh the import of site names from SiteDB.
    • Is the site automatically appearing in the Site Readiness plots?
      • The site will appear in Site Readiness plots once it appears in SiteReadiness Status metric in SSB (Site has to be added to the test template).
      • Duncan and/or Pepe Flix can give more details about the site readiness machinary.
    • Once in SSB, is the site placed in the Waiting room or in Production?
  18. Waiting Room: Move site out of the SSB Waiting Room
  19. Site is ready to work!

T3s

  • Follow the same steps as for T2s except for the following ones which have different instructions depending on the T3:
  • 3. SiteDB:
    • If you are not Grid-enabling your batch system and accept pilot job from the glide-in WMS factories, then please leave the "Processing" entry in SiteDB blank.
  • 5. CMSSW/CVMFS
    • T3s can "skip" this step.
    • However, if the site wants to use the data they host on the Tier3, we should recommend that they install CVMFS on their local machines to get CMSSW.
  • 6. Frontier/Squid
    • If more than a few CMSSW jobs will run at the site, a local Squid should be commissioned and registered centrally for monitoring.
    • At the very least, the site must point in it's site-local-config.xml to the CERN Opportunistic Squids cmsopsquid[1,2].cern.ch
  • 8. SAM tests - There won't be any CE related ones if the site only provides SE.
    • [The SRM test should be addressed though]
  • 12. A Tier-3 has no requirements concerning commissioned links. Note however that data are only transferred on commissioned links. The site has make their plans from where they intend to download and need to commission those links.
  • 15. Since there is no CE there won't be any HC running
  • 17. The VO Feed will miss a CE, which is fine for a SE Tier-3 only.
  • 18. Monitoring for the SE would help the site, but we do not require it.

Opportunistic Sites

  • "Opportunistic" means that a site will provide resources without a pledge, which usually means there are no guarantees about availability or it can also mean there is no local CMS support or it can mean the resources are only available for a short time.
  • For sites with a small amount of resources, we don't want to create a new CMS site, because there are administrative and operational overheads attached to having a larger number of CMS sites.
  • For sites with larger amount of resources, often the the time period we get them is too short to go through the whole process of adding a new CMS site (could take up to 2 weeks or more).
  • Only for opportunistic sites that provide a large amount of resources for an extended period of time it is worthwhile to go through the whole process of adding a new CMS site.
Under opportunistic we define two types of sites:
  1. Fully opportunistic:
    • A site that is not part of CMS and that has no local CMS support.
    • We can add resources dynamically to an existing meta-site:
      • T3_US_Parrot
      • T3_EU_Parrot
      • T3_US_ParrotTest (only for tests)
      • more meta-sites can be created for different regions as needed (Asia for instance)
    • OR create a new CMS site if the amount of resources and the time we have them warrants it
    • The dynamic model is useful for adding/removing little amount of resources frequently, just reconfiguring the frontend/factory.
    • Beyond a certain amount of resources it introduces inflexibility in how we can assign work to these resources.
    • A meta-site with much more than 10,000 slots is a bit unwieldy (if we ever get there we can split this up further into more meta-sites).
  2. Semi-opportunistic
    • An already existing CMS site that wants to provide extra resources for a short period of time.
    • Contact the production/analysis teams to assign more work to this site.

Fully Opportunistic

  1. Confirm that the site has authorization and that has contacted the corresponding people (ask Oliver Gutsche).
  2. Confirm the amount of resources that the site will provide and for how long.
    • If the site will provide a big amount of resources or will provide them for quite long time it is probably better to add them as a new CMS Site.
  3. Make sure that the site provides a way to access the resources via glideInWMS (which should exist if it's a grid site)
  4. No SITECONF changes needed
  5. Contact the CMS GlideinWMS Parrot support list
    • To configure the existing glideInWMS factory entry to access the number of slots that the site will provide.
    • To take that factory entry and add it to the T3_US_Parrot or T3_EU_Parrot group
  6. Jobs will be submitted to these resources.
  • The modifications are only in the glideInWMS system and they are completely hidden within the glideInWMS system
    • This is a problem for monitoring: all of our system would not know about what's inside T3_US_Parrot, that you added or removed some resources. They just see a collection of resources called T3_US_Parrot which could be anything and anywhere.
  • SAM and HC tests. We don't need to run those tests at T3_US_Parrot.
  • If we want to evaluate some specific resources what we could do is ask the glidein Factory to turn off (momentarily) the other sites/CEs associated with T3_US_Parrot and leave on only the one we are interested in and start sending jobs. All checks will be limited to the glide in level, we won't have monitoring from any other regular systems.

For Production

  • Fully opportunistic - parrot - not ready for that in the glidein level
  • Semi-opportunistic - could be set up in an easy way
    • if it is an already existing CMS site and it is already in the production list it is a matter of increasing the threshold in the WMAgent to send more jobs to the site.
    • If it is a new CMS site - we need to commission the site (1-2weeks)
      • it should be easier to commission an already existing CMS T3
Edit | Attach | Watch | Print version | History: r68 < r67 < r66 < r65 < r64 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r68 - 2021-01-29 - StephanLammel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback