Site Support Team - Documentation / Adding a New CMS Site

The instructions below are for the Site Support team to setup a new site. For site executive/admins there is a different twiki with more reference, explanation, etc. here.

Requirements

What will you need to be a good site

Tier2

  • be a good grid site in EGI or OSG
  • a farm with at least one CE (better 2 or more)
  • an SRM (please avoid multiple SE's as much as possible)
  • one machine to run a Squid server (two for a large site with e.g. more then 500 execution slots)
  • a Vobox to run PhEDEx agents
  • firewall permission for Squid and job monitoring
    • detailed below

Tier3

  • similar to a T2 but usually at smaller scale
  • no requirement to run CMS Grid jobs; The local batch system does not need to be Grid enabled and can be limited to local users.
  • no need to be up 24x7

Fully Operational Sites

T1s & T2s

  1. CMS secretariat & CMS VO: Site executive needs to be properly registered in the CMS secretariat and be included in the CMS VO in VOMS and needs to be a member of the CERN cms-zh e-group.
  2. CMS Name: Propose a CMS name for the site
  3. CRIC (old SiteDB setup):
    • Send an email to cms-comp-ops-coordinators@cern.ch informing about new site and also ask for last confirmation.
    • Create a ggus ticket (be sure that you are able to create ggus ticket. If you are not, please visit the registration page of the ggus)
      • Type of issue: CMS_Register New CMS Site, CMS Support Unit: CMS SiteDB, subject: Tier-{1, 2, 3} Registration
      • Include the following info:
        Title: Tier: CMS Name: Site Executive: Location:
           Example   Close     Title: MIT Tier: Tier 2 CMS Name: T2_US_MIT LCG Name: MIT_CMS Site Executive: Maxim Goncharov, Wei Li Location: Boston, MA, USA Usage: OSG  
      • Please note that you can also provide the following fields (or site executive can fill these fields later on):
        Data Manager: name, e-mail; PhEDEx Contact: name, e-mail; Site Admin: name, e-mail
      • Site support team will setup site and facility in CRIC. THere is no need currently for site executives or admins to interact with CRIC if "Data Manager" and "PhEDEx Contact" information is provided in the ticket (or in case the site will not have a PhEDEx endpoint). The "Site Executive" and "Site Admin" lists are CERN e-groups and can be changed here. Put "cms-" into the search field, leave the search set to "begins with", check the "Only groups I own or manage" box and click "Search" to find the admin/exec group of your facility.
  4. GIT: Setup GIT and commit files to SITECONF
  5. CMSSW/CVMFS:
    • CMS software can be installed locally but this is no longer required. Instead we suggest sites to use the CERN Virtual Machine FileSystem, CVMFS.
    • CVMFS is a network readonly filesystem, more information, including installation and configuration details, can be found on its twiki page.
    • CMS software installations on CVMFS are maintained/managed at CERN.
      • If you add storage.xml and site-local-config.xml files to the siteconf GIT repository they will appear automatically on CVMFS. There is a crontab/script that synchronizes CVMFS every few hours. (If you don't use CVMFS, you need to keep the copies of those files in your local installation in sync with the GIT repository.)
      • The script looks for two files with the names, storage* and site-local-config*.
      • Note: this means if you name files differently, they will likely not appear automatically on CVMFS!
  6. Frontier/Squid: Install and configure Squid for CMS Frontier
  7. glide-in WMS Subscription
    • CMS uses glide-in WMS to utilize Grid resources. Tier-1 and 2 sites thus need to allow pilot jobs of the glide-in WMS factory to execute on their compute elements.
    • Please open a [[https://ggus.eu/?mode=ticket_cms]CMS GGUS ticket]] for Support Unit "Glidein Factory", provide the details about your CEs as below and ask the queues of your CEs to be added:
      1. Name of CE
      2. Type of CE
      3. Multi-core or single-core
      4. Wall-time limit
      5. Memory per core
      6. Other special parameters you expected
    • Factory operations will likely add them to the integration setup and test that things work first. Don't miss getting them added to the production factories at CERN, Fermilab, and SDSC.
    • when ready in the production factories, please set the "Processing" field in SiteDB; this will enable your site in CRAB so HammerCloud jobs can run
    • CE has to be whitelisted in the job dashboard. Inform Edward Karavakis or CERN MONIT team.
  8. SAM tests: If the previous steps were successfully completed, CMS SAM CE tests should start to appear.
    • SAM tests are driven by CE and SE information in the VO-feed. (GOCDB and MyOSG entries are not used.) In case queue information is not explicitly specified in the glide-in WMS entry for the CE (and then the VO-feed will not include queue information for the CE either), SAM ETF takes the queue information from BDII if available.
    • Start debugging the following tests results (not all of the SAM tests):
      1. org.sam.CREAMCE-JobSubmit
      2. org.cms.WN-env
      3. WN-basic
      4. org.cms.WN-swinst (CVMFS)
      5. org.cms.WN-frontier
      6. org.cms.WN-squid
  9. PhEDEx Setup
    1. Configure a host to run PhEDEx
    2. Edit your storage.xml file (needs a registered CMS member at the site)
  10. PhEDEx entry (node) & links: (This is done manually)
    • Open a new GGUS ticket to Transfer team (cmscompinfrasup-datatransfer squad) --> "please create Site node & links in Phedex" with the following information:
      • Storage Element host (e.g. cmsdcache.pi.infn.it)
      • Storage Element kind (e.g. Buffer, MSS, Disk)
      • Storage Element technology (e.g. dCache, Castor, DPM, Disk, Other)
    • In this step we can request to create Links as well (not link commissioning)
      • For T3s we don't create all links but only specific requested links - ask which ones.
    • Please add the phedex-node name in the local-stage-out section of the site-local-config.xml file.
  11. PhEDEx agents: Configure PhEDEx agents
    • https://twiki.cern.ch/twiki/bin/view/CMS/PhedexDraftDocumentation#Site_Administrator_Documentation
    • Send your grid user public key (usercert.pem) to cms-phedex-admins[AT]cern.ch. You will receive 3 encrypted e-mails containing PhEDEX roles and passwords for Prod, Debug and Dev. You should decode the 3 e-mails using the same certificate you sent and put the 3 outputs in one file to use to connect to PhEDEx TMDB.
    • Access to the PhEDEx database from outside CERN is firewalled, and your hosts will need to be granted access. Please send a mail to Physics-Database.Support AT cern.ch and cms-phedex-admins AT cern.ch asking for your host(s) to be allowed to connect. You should give the name of the domain or subnet that needs access. Ideally this domain/subnet should not contain too many machines, but it should be open enough that you can change hosts without having to repeatedly ask for new holes in the firewall.
    • If a site upgrades to SL6 how to update Phedex?
      • Starting from version 4.2.0 PhEDEx is now distributed through CVMFS, but you can still choose to install and configure the PhEDEx agents manually.
      • You can find the instructions for SLC6 here: LINK
    • Phedex needs the trivial file catalogue (TFC) of the site. It gets it directly from the agents configured at that site. Specifically, there is an agent whose sole job is to make sure that the TFC is regularly uploaded to the PhEDEx database. PhEDEx does not take the TFC, or anything, from SITECONF.
  12. SRM test: Debug SRM SAM tests results - org.cms.SRM
    • The site needs working PhEDEx agents to pass these tests.
  13. LoadTest07 samples: Create and Inject LoadTest07 samples
    • Either create one from scratch or copy an existing one from another site, as detailed in the "Downloading_a_LoadTest07_sample" section of this twiki:
    • If the site is a T3, will it be used for MonteCarlo production? If not, it doesn't need the source loadtest dataset to be created and injected, only export links need the source loadtest datasets. T3s usually have import links only.
  14. PhEDEx Link commissioning:
  15. DDM quota
  16. /store/unmerged/ cleaning
    • Take a look at CMS namespace policy and set up cleaning for /store/unmerged/. With the exception of files belonging to ongoing workflows files older than 2 weeks can be deleted automatically. Still active datasets are published here. Sites are encouraged to take a look at customizable reference implementation
  17. HC & SAM tests:
    • Transfer HC & SAM tests samples to site. The datasets are:
      • old SAM dataset: /GenericTTbar/SAM-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO (4 files, 3.4 GB)
      • old HC dataset: /GenericTTbar/HC-CMSSW_7_0_4_START70_V7-v1/GEN-SIM-RECO (185 files, 603.0 GB)
      • new SAM dataset: /GenericTTbar/SAM-CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/AODSIM (2 files, 2.3 GB)
      • new HC dataset: /GenericTTbar/HC-CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/AODSIM (47 files, 114.9 GB)
      • newest SAM dataset: /GenericTTbar/SAM-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/AODSIM (3 files, 2.3 GB)
      • newest HC dataset: /GenericTTbar/HC-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/AODSIM (46 files, 113.5 GB)
    • The datasets need to be subscribed to the FacOps group (if not, they will be deleted by DDM once the storage of the site gets loaded)!
    • Debug tests results:
      • Remaining SAM tests:
      1. org.cms.glexec.WN-gLExec
      2. org.cms.WN-analysis
      3. org.cms.WN-mc
      4. org.cms.WN-xrootd-access
      5. org.cms.WN-xrootd-fallback
    • HammerCloud test
      • To start test contact Andrea Sciaba to add the site to the appropriate test templates.
  18. Xrootd Architecture - Link
    • Please inform the CMS Site Support team, cms-comp-ops-site-support-team@cernNOSPAMPLEASE.ch, about the URI of the xrootd endpoints of your site. This should be the site endpoint(s) (or redirector) that is subscribed to the federation and serves the namespace of the PhEDEx node name (and does not re-direct to other site(s)).
  19. CMS VO Feed - Link
    • If the site is properly registered, i.e. configured in CRIC, glide-in WMS factory entries are defined, PhEDEx configured and agents running, and xrootd endpoints announced, then the site with all it's services will appear in the VO-feed.
    • The CMS VO-feed uses information from:
      • PhEDEx for SEs
      • glide-in WMS factories for CEs
      • from a JSON for xrootd
  20. Monitoring:
    • Is the site automatically appearing in Site Status Board?
      • CMS VO Feed is the source for the TopologyMaintenances metric in SSB, it means that if the site is in the CMS VO Feed it will appear in SSB as well.
      • Having the site in the CMS VO feed doesn't mean that this site will be monitored for every metric in SSB, sometimes the responsible for the metric has to add the site to the test template - for example: HammerCloud test (Andrea Sciaba).
      • Set values for the cpubound (160) and iobound (161) metricses. cpubound should be the number of cores of the site and iobound the number of cores that the site can sustain running I/O intensive applications (default 10% of cpubound).
    • Is the site automatically appearing in Dashboard?
      • No, Dashboard ops (e.g. Julia) may need to explicitely refresh the import of site names from SiteDB.
    • Is the site automatically appearing in the Site Readiness plots?
      • The site will appear in Site Readiness plots once it appears in SiteReadiness Status metric in SSB (Site has to be added to the test template).
      • Duncan and/or Pepe Flix can give more details about the site readiness machinary.
    • Once in SSB, is the site placed in the Waiting room or in Production?
  21. Waiting Room: Move site out of the SSB Waiting Room
  22. Site is ready to work!

T3s

  • Follow the same steps as for T2s except for the following ones which have different instructions depending on the T3:
  • 3. SiteDB:
    • If you are not Grid-enabling your batch system and accept pilot job from the glide-in WMS factories, then please leave the "Processing" entry in SiteDB blank.
  • 5. CMSSW/CVMFS
    • T3s can "skip" this step.
    • However, if the site wants to use the data they host on the Tier3, we should recommend that they install CVMFS on their local machines to get CMSSW.
  • 6. Frontier/Squid
    • If more than a few CMSSW jobs will run at the site, a local Squid should be commissioned and registered centrally for monitoring.
    • At the very least, the site must point in it's site-local-config.xml to the CERN Opportunistic Squids cmsopsquid[1,2].cern.ch
  • 8. SAM tests - There won't be any CE related ones if the site only provides SE.
    • [The SRM test should be addressed though]
  • 12. A Tier-3 has no requirements concerning commissioned links. Note however that data are only transferred on commissioned links. The site has make their plans from where they intend to download and need to commission those links.
  • 15. Since there is no CE there won't be any HC running
  • 17. The VO Feed will miss a CE, which is fine for a SE Tier-3 only.
  • 18. Monitoring for the SE would help the site, but we do not require it.

Opportunistic Sites

  • "Opportunistic" means that a site will provide resources without a pledge, which usually means there are no guarantees about availability or it can also mean there is no local CMS support or it can mean the resources are only available for a short time.
  • For sites with a small amount of resources, we don't want to create a new CMS site, because there are administrative and operational overheads attached to having a larger number of CMS sites.
  • For sites with larger amount of resources, often the the time period we get them is too short to go through the whole process of adding a new CMS site (could take up to 2 weeks or more).
  • Only for opportunistic sites that provide a large amount of resources for an extended period of time it is worthwhile to go through the whole process of adding a new CMS site.
Under opportunistic we define two types of sites:
  1. Fully opportunistic:
    • A site that is not part of CMS and that has no local CMS support.
    • We can add resources dynamically to an existing meta-site:
      • T3_US_Parrot
      • T3_EU_Parrot
      • T3_US_ParrotTest (only for tests)
      • more meta-sites can be created for different regions as needed (Asia for instance)
    • OR create a new CMS site if the amount of resources and the time we have them warrants it
    • The dynamic model is useful for adding/removing little amount of resources frequently, just reconfiguring the frontend/factory.
    • Beyond a certain amount of resources it introduces inflexibility in how we can assign work to these resources.
    • A meta-site with much more than 10,000 slots is a bit unwieldy (if we ever get there we can split this up further into more meta-sites).
  2. Semi-opportunistic
    • An already existing CMS site that wants to provide extra resources for a short period of time.
    • Contact the production/analysis teams to assign more work to this site.

Fully Opportunistic

  1. Confirm that the site has authorization and that has contacted the corresponding people (ask Oliver Gutsche).
  2. Confirm the amount of resources that the site will provide and for how long.
    • If the site will provide a big amount of resources or will provide them for quite long time it is probably better to add them as a new CMS Site.
  3. Make sure that the site provides a way to access the resources via glideInWMS (which should exist if it's a grid site)
  4. No SITECONF changes needed
  5. Contact the CMS GlideinWMS Parrot support list
    • To configure the existing glideInWMS factory entry to access the number of slots that the site will provide.
    • To take that factory entry and add it to the T3_US_Parrot or T3_EU_Parrot group
  6. Jobs will be submitted to these resources.
  • The modifications are only in the glideInWMS system and they are completely hidden within the glideInWMS system
    • This is a problem for monitoring: all of our system would not know about what's inside T3_US_Parrot, that you added or removed some resources. They just see a collection of resources called T3_US_Parrot which could be anything and anywhere.
  • SAM and HC tests. We don't need to run those tests at T3_US_Parrot.
  • If we want to evaluate some specific resources what we could do is ask the glidein Factory to turn off (momentarily) the other sites/CEs associated with T3_US_Parrot and leave on only the one we are interested in and start sending jobs. All checks will be limited to the glide in level, we won't have monitoring from any other regular systems.

For Production

  • Fully opportunistic - parrot - not ready for that in the glidein level
  • Semi-opportunistic - could be set up in an easy way
    • if it is an already existing CMS site and it is already in the production list it is a matter of increasing the threshold in the WMAgent to send more jobs to the site.
    • If it is a new CMS site - we need to commission the site (1-2weeks)
      • it should be easier to commission an already existing CMS T3
Edit | Attach | Watch | Print version | History: r66 < r65 < r64 < r63 < r62 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r66 - 2019-09-17 - StephanLammel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback