Site Support Team - Documentation
Production Related Activities
Relevant SAM tests
After looking at the different metrics these are the SAM tests that are important for MC production:
- org.cms.WN-env
- org.cms.SRM-GetPFNFromTFC
- org.cms.SRM-VOGet
- org.cms.SRM-VOPut
- org.cms.WN-basic
- org.cms.WN-frontier
- org.cms.WN-mc
- org.cms.WN-squid
- org.cms.WN-swinst
- org.cms.WN-xrootd-fallback ---> This one is not a requirement at the moment, but will be really important in the near future.
Sites Pledges
- For Production WMAgents use the pledges information from:
- SSB - Pledges view
- This information should be updated manually by the workflow team.
- However, this is just a reference for monitoring the usage and has, as far as I known, no influence on how much we really run at a site.
How to update the Pledges for Production
- Find the # of slots (cores) available at the Site
- Log in SSB
- You should be SSB admin with Modify Metrics privileges - if you are not, ask dashboard-support.cern.ch
- Go to metric history plot
- Right click on a row from the plot that you want to change the value for.
- Update the Value and Status
How to find the Pledges per Site
- Check http://gstat-wlcg.cern.ch/apps/pledges/resources/
- Hover with mouse over federated site to see resources for individual sites
- Convert HS06 normalized CPU performance to slots
- 1 slot = 10 HS06 (on average)
- Percentage of resources assigned for every VOMS role:
Sites out of Production (Waiting Room)
- When a site goes IN to the waiting room
- Communicate with Workflow team
- Put the sites in drain (workflow team)
- When a site comes OUT of the waiting room
- Communicate with Workflow team.
- Test site if valid for re commissioning workflows (workflow team)
- If the site has been in the WR for more than 8 weeks - it should be re-commissioned according to the procedures.
- If the site was re-commissioned before and has been in the WR for less than 8 weeks, the site just needs to be put out of drain.
- Remove the sites in drain (workflow team)
How to Add/Remove sites in Drain
Use this link:
https://cmssst.web.cern.ch/cgi-bin/set/ProdStatus/T2_DE_RWTH
(Don't forget to change site name)
- IMPORTANT
- If you manually override the status of a site, the future automatic updates are blocked. So, once your manual operation is done, you should put the site back into "Automatic state setting/no override"
or
- login to any vocms machine
- sudo -u cmst1 /bin/bashs
- vim ~cmst1/www/site-limits.conf --> requires afs permissions to access the file (ask Edgar Fajardo and Jorge Amando Molina-Perez)
- To Add: write drain next to site name
- To Remove: delete drain next to site name, leave it blank
- drain = finish running jobs & don't send anymore jobs to site
- skip = site has never been commissioned & jobs will not be sent
- down = site previously commissioned that is not going to be used anymore
- [blank] = jobs will be sent to the site
- There is a script running in all agents reading the txt file. The cronjobs run every 15 mins and update the agents running the following command for each site in the file.
[cmssrv94] /data/srv/wmagent/current > ./config/wmagent/manage execute-agent wmagent-resource-control --drain --site-name=T1_US_FNAL
How to re commission a site for Workflows
Commission a site in testbed by Assigning WFs via scripts:
- Follow the procedure CompOpsWorkflowComissionT2Site
- The assigned WF is a long WF (>8 000 jobs) intended to take long time to complete so we don't have to keep assigning workflows until testing period (a few days) is complete.
- Follow the assigned WF: Procedures
Sites in Scheduled Downtime
- Here are the policies the Workflow team has regarding this issue: