--
LuisContreras - 28 Feb 2014 --
JulianBadillo - 27 Aug 2014
Introduction
StoreResults workflows are basically
merge task workloads. The purpose of this service is to elevate a user dataset to global DBS and
PhEDEx. This will make that dataset available for transferring anywhere across CMS sites grid. This service is based on Savannah right now, but will be migrated to
ReqMgr 2 soon.
Basically a
StoreResults wokflow is composed of the following steps:
- Get a ticket on GGUS
- Migrate dataset from a local dbs to global dbs.
- Create and assign a StoreResults workflow in ReqMgr
- Wait for the workflow to complete
- Announce the workflow
StoreResults workflows read data from a local dbsUrl. This creates a problem when uploading a dataset to global: The parent cant be found on global. In order to avoid parentage problems, before a workflow can run the input dataset must be migrated to global DBS. This transfer uses DBSMigration service. Input dataset are commonly located at T3s, T2s and
EOS. To elevate datasets at FNAL, It might be necessary to move the data from cmssrm.fnal.gov to cmssrmdisk.fnal.gov in order to allow the agent to read files (If the user dataset was produced before the disk/tape separation).
GGUS ticket
StoreResults users send requests through GGUS
https://ggus.eu/?mode=ticket_cms
, the ticket should have the following information:
Creating a Store Results Request
Note For filtering the ggus ticket you can use the "search" option in GGUS, and look for tickets assigned to "CMS Workflows" Support Unit
see here:
Each ticket contains the information needed to create a workflow:
- user dataset
- local dbs url
- CMSSW release
- Physics group.
Look an example ticket here:
110773
You can manually create a request with
reqmgr.py.
Before you Start
Handle Savannah Requests - DEPRECATED
- NOTE: This instructions don't apply anymore since Savannah is retired
- Set up the environment to run the scripts:
source /data/admin/wmagent/env.sh
source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
cd ~/storeResults/
- Open a python interactive console
python Python 2.6.8 (unknown, Nov 20 2013, 13:07:46) [GCC 4.6.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
- Type in the python console the follwing instructions, replace the %USER and %PASSWORD strings with your info:
from RequestQuery import RequestQuery
ticket = 110773
input_dataset = '/TT_scaleup_CT10_TuneZ2star_8TeV-powheg-tauola/jpilot-Summer12-START53_V7C_FSIM-v1_TLBSM_53x_v3-3eec1c547e1536755bef831bcbf18d7a/USER'
dbs_url = 'phys03'
cmssw_release = 'CMSSW_5_3_8_patch3'
group_name = 'B2G'
rq = RequestQuery({'ComponentDir':'/home/cmsdataops/storeResults/Tickets'})
report = rq.createRequestJSON(ticket, input_dataset, dbs_url, cmssw_release, group_name)
#You can also print the report
rq.printReport(report)
- 'report' is going to be used in the next section so don't quit the python interpreter.
- The output of this looks like:
>>> report = rq.createRequestJSON(ticket, input_dataset, dbs_url, cmssw_release, group_name)
Processing ticket: 110773
Ticket json local DBS Sites se_names
-------------------- ----- ---------- -------------------------------------------------- --------------------------------------------------
110773 y phys03 T3_US_FNALLPC
- This contains the following information:
- Ticket: the identification number of the ticket from GGUS
- json: If the json file for the given ticket was created or not
- Local DBS: the origin DBS location of the dataset
- Sites: Site where the input dataset was created (matches from se_name). This is going to be useful if you assign manually the workflow in ReqMgr.
- se_names: the storage element for the given dataset
Notes
- The JSON file is generated on the Tickets directory as Ticket_TICKETNUM.json.
- 'ComponentDir' is the location where json files are going to be saved, you may use this folder as default if it does not exist:
Migrate Datasets to Global DBS
- Run
createStoreResults.py
with the following information, that should come in the ticket.
python createStoreResults.py TICKET DATASET DBS_URL CMSSW_RELEASE GROUP_NAME
TICKET: The ticket # in GGUS, could be any number, this is used only for tracking.
DATASET: The input dataset, it has to be located at the same Tier-2 which is used to finally hold the group data in /store/results/
DBS_URL: For example: "phys01" for https://cmsweb.cern.ch/dbs/prod/phys01/DBSReader.
CMSSW_RELEASE: which should be used for merging step. In general, use always the version used for the dataset production, if the version is outdated you need to check which is the closest version available.
GROUP_NAME: The physics group (HIN, HIG, SUS, etc.) requesting the migration, this will determine the subdirectory below /store/results/ and sets the appropriate Phedex accounting group tag.
- Wait until the migrations are done. The output of this looks like:
Migrate: from url https://cmsweb.cern.ch/dbs/prod/phys03/DBSReader dataset: /TT_scaleup_CT10_TuneZ2star_8TeV-powheg-tauola/jpilot-Summer12-START53_V7C_FSIM-v1_TLBSM_53x_v3-3eec1c547e1536755bef831bcbf18d7a/USER
Migration submitted: Request 105003
Timer started, timeout = 600 seconds
Querying migrations status...
Migration to global succeed: /TT_scaleup_CT10_TuneZ2star_8TeV-powheg-tauola/jpilot-Summer12-START53_V7C_FSIM-v1_TLBSM_53x_v3-3eec1c547e1536755bef831bcbf18d7a/USER
All migration requests are done
Savannah Ticket Migration id Migration Status Dataset
-------------------- --------------- ------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------
110773 105003 successful /TT_scaleup_CT10_TuneZ2star_8TeV-powheg-tauola/jpilot-Summer12-START53_V7C_FSIM-v1_TLBSM_53x_v3-3eec1c547e1536755bef831bcbf18d7a/USER
- Once the migrations are successful, the script will submit requests to ReqMgr.
Some Notes and Remarks!
Submit requests to ReqMgr
- Assign the request only if the migration was successful.
- You should use reqmgr.py and the JSON file created on the previous steps.
- Create the request:
python WmAgentScripts/reqmgr.py -u https://cmsweb.cern.ch -f ./storeResults/Tickets/Ticket_110773.json -j {"createRequest":{"RequestString":"StoreResults_110773_v1","Requestor":"jbadillo","ProcessingVersion":1}} --createRequest
- You should see an output like this:
Processing command line arguments: '['WmAgentScripts/reqmgr.py', '-u', 'https://cmsweb.cern.ch', '-f', './storeResults/Tickets/Ticket_110773.json', '-j', '{"createRequest":{"RequestString":"StoreResults_110773_v1","Requestor":"jbadillo","ProcessingVersion":1}}', '--createRequest']' ...
....
INFO:root:Loading file './storeResults/Tickets/Ticket_110773.json' ...
....
INFO:root:Create request 'jbadillo_StoreResults_110773_v1_150121_121546_1955' succeeded.
INFO:root:Approving request 'jbadillo_StoreResults_110773_v1_150121_121546_1955' ...
INFO:root:Request: PUT /reqmgr/reqMgr/request ...
INFO:root:Approve succeeded.
- Take note of the name of the workflow created.
- Assign the workflow using ReqMgr to:
- team: step0
- sites: sites where the user dataset is*.
- Allways check the "Trust site list" to allow xrootd.
- If (and only if) the dataset is at a T2 or T3, you can choose "Non-custodial subscription" to the given site.
Notes:
- Increase the version number when assigning retries for the same ticket.
- If you have not injected a json file to ReqMgr before, read section 3 at: https://github.com/dmwm/WMCore/wiki/All-in-one-test
- If the input data is at T3_US_FNALLPC (cmseos.fnal.gov), assign to T1_US_FNAL and enable xrootd option. The jobs will run at the tier1, not at the tier3 (cmseos.fnal.gov cannot run jobs)
- PhEDEx subscriptions: Data should only be subscribed to Disk, never to Tape (for T1 sites). The reason is that users don't have quota for Tape. Then, always pick "Non-Custodial Sites". The other parameters can be default (Subscription Priority = Low, Custodial Subscription Type = Move). This also creates a problem when subscribing to T1_XX_Site_Disk. The agent cant subscribe automatically to Disk only. so subscription request should be done manually for T1 sites (This problem will be solved soon).
Announcing
The workflow team is the responsible for announcing
StoreResults workflows, so once the workflows are closed out:
Closing tickets
- If everything with the workflow is ok, when it is completed just reply to the ticket telling: the name of the elevated dataset (output dataset of the workflow), and site where the dataset is subscribed (PhEDEx), something like the following:
Hi,
The elevated dataset is:
/TT_scaleup_CT10_TuneZ2star_8TeV-powheg-tauola/StoreResults-Summer12_START53_V7C_FSIM_v1_TLBSM_53x_v3_3eec1c547e1536755bef831bcbf18d7a-v1/USER
A replica available to transfer within the CMS grid can be found in PhEDEx at: T1_US_FNAL_Disk
Please check that everything is ok and let us know if there is a problem.
Thanks,
Workflow Team
- Then change the ticket status to "solved".
- If there is a problem, reply the ticket with a brief explanation of the problem and possible solutions. This is an example: https://savannah.cern.ch/task/?51219
Transfer files from cmssrm.fnal.gov to cmssrmdisk.fnal.gov
NOTE: Only FNAL Luis can do this, ask him if there you need to transfer files from Tape to Disk (For old datasets).
/store/user data at cmssrm.fnal.gov is not available for the agent to read. Data has to be copied to cmssrmdisk.fnal.gov before the workflow can run. First, you have to create a list of files from the dataset.
- Create a directory where you can save the lists, and go there. i.e
mkdir /tmp/fileListsFNAL
cd /tmp/fileListsFNAL
- Create the lists by doing:
source ~WmAgentScripts/setenvscript.sh
python ~WmAgentScripts/StoreResults/transferFiles_FNAL.py [dbslocal] [ticket] [dataset]
You find dbslocal, ticket and dataset from the reports that print out
RequestQuery and
MigrationToGlobal
Now copy the the lists to a FNAL lpc machine (i.e. cmslpc42), login to that machine:
ssh cmslpc42.fnal.gov
Then do:
curl https://raw.githubusercontent.com/CMSCompOps/WmAgentScripts/master/StoreResults/ftsuser-transfer-submit-list > ftsuser-transfer-submit-list
source /uscmst1/prod/grid/gLite_SL5.csh
voms-proxy-init -voms cms:/cms/Role=production -valid 192:00
ftsuser-delegation
./ftsuser-transfer-submit-list FROMDCACHE ~/lists/<Number of the list>.txt
This will show an output like:
[luis89@cmslpc42 ~]$ ./ftsuser-transfer-submit-list-luis FROMDCACHE ~luis89/lists/50543.txt
Using proxy at /tmp/x509up_u48313
"cred_id": "4ecc44f421bfa9b5",
"job_id": "7a6c044a-e03c-11e3-af2d-782bcb2fb7b4",
"job_state": "STAGING",
"user_dn": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lcontrer/CN=752434/CN=Luis Carlos Contreras Pasuy",
"vo_name": "cms",
"voms_cred": "/cms/Role=production/Capability=NULL /cms/Role=NULL/Capability=NULL /cms/uscms/Role=NULL/Capability=NULL"
"cred_id": "4ecc44f421bfa9b5",
"job_id": "7bf02936-e03c-11e3-a8ea-782bcb2fb7b4",
"job_state": "STAGING",
"user_dn": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lcontrer/CN=752434/CN=Luis Carlos Contreras Pasuy",
"vo_name": "cms",
"voms_cred": "/cms/Role=production/Capability=NULL /cms/Role=NULL/Capability=NULL /cms/uscms/Role=NULL/Capability=NULL"
Ongoing transfers can be monitored through the web interface:
https://cmsfts3-users.fnal.gov:8449/fts3/ftsmon/
To track the transfers you has to use "job_id" at the monitor web interface (
https://cmsfts3-users.fnal.gov:8449/fts3/ftsmon/#/
).
Note: Documentation about this transfers can be found at tinyurl.com/ftsuser
- When the workflow is done, data from cmssrmdisk.fnal.gov has to be removed. This transfers are only temporary, then send a mail to FNAL site support with a list of files to be removed (the same list you use in the last step)
The standard procedure to validate
StoreResults is:
- Download the sample json file to be injected to ReqMgr from this link this link
- Inject the json file into ReqMgr (Please read section 3 first: https://github.com/dmwm/WMCore/wiki/All-in-one-test
), change ReqMgr url to the validation one i.e. https://cmsweb-testbed.cern.ch/reqmgr
- Assign the workflow:
- you may set the whitelist to T1_US_FNAL, but it is not compulsory, the agent will find where the data is located
- PLEASE change the ProcessingString from Summer12_DR53X_PU_S10_START53_V7A_v1_TLBSM_53x_v3_99bd99199697666ff01397dad5652e9e to ValidationTest _Summer12_DR53X_PU_S10_START53_V7A_v1_TLBSM_53x_v3
- You may subscribe the data wherever you want, but just be careful: StoreResults service should never be subscribed to Tape (reason is users dont have tape quota)
- Depending on the merge parameters you choose, it will create different number of jobs (~22 if default). The workload has to be basically merge jobs, and the corresponding logCollect and Cleanup jobs.
- Jobs will run fast (if FNAL have available slots, they should complete in a couple of hours max)
- All jobs should succeed, non custodial subscription should be checked. Also check that the output dataset is uploaded to DBS3 (this is very important! StoreResults workflows are supposed to read from local DBS url, and upload the output to global DBS). If all is fine, then StoreResults workflow is ok.