/data/tier0/, absolute paths are provided in these instructions for clearness In order to start a new replay, firstly you need to make sure that the instance is available Check the Tier0 project on Jira
condor_q. If any, you can use
condor_rm -allto remove everything.
runningagent (This is an alias included in the cmst1 config, actual command: ps aux | egrep 'couch|wmcore|mysql|beam'
/data/tier0/00_stop_agent.sh
/data/tier0/admin/ReplayOfflineConfiguration.py
./00_software.sh # loads the newest version of WMCore and T0 github repositories. ./00_deploy_replay.sh # deploys the new configuration, wipes the T0AST database etc.
./00_start_agent.sh # starts the new agent - loads the job list etc.
vim /data/tier0/srv/wmagent/2.0.8/install/tier0/Tier0Feeder/ComponentLog vim /data/tier0/srv/wmagent/2.0.8/install/tier0/JobCreator/ComponentLog vim /data/tier0/srv/wmagent/2.0.8/install/tier0/JobSubmitter/ComponentLog
condor_q runningagent
<meaningfulName>Scenario = "<actualNameOfTheNewScenario>"
defaultRecoTimeout = 48 * 3600to something higher like 10 * 48 * 3600. Tier0Feeder checks this timeout every polling cycle. So when you want to release it again, you just need to go back to the 48h delay.
/data/tier0/admin/ProdOfflineConfiguration.py
defaultCMSSWVersion = "CMSSW_7_4_7"
repackVersionOverride = { "CMSSW_7_4_2" : "CMSSW_7_4_7", "CMSSW_7_4_3" : "CMSSW_7_4_7", "CMSSW_7_4_4" : "CMSSW_7_4_7", "CMSSW_7_4_5" : "CMSSW_7_4_7", "CMSSW_7_4_6" : "CMSSW_7_4_7", } expressVersionOverride = { "CMSSW_7_4_2" : "CMSSW_7_4_7", "CMSSW_7_4_3" : "CMSSW_7_4_7", "CMSSW_7_4_4" : "CMSSW_7_4_7", "CMSSW_7_4_5" : "CMSSW_7_4_7", "CMSSW_7_4_6" : "CMSSW_7_4_7", }
select RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>' select EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>'
UPDATE ( SELECT reco_release_config.released AS released, reco_release_config.delay AS delay, reco_release_config.delay_offset AS delay_offset FROM reco_release_config WHERE checkForZeroOneState(reco_release_config.released) = 0 AND reco_release_config.run_id <= <Replace By the desired Run Number> ) t SET t.released = 1, t.delay = 10, t.delay_offset = 5;
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --cms-name=T1_IT_CNAF --pnn=T1_IT_CNAF_Disk --ce-name=T1_IT_CNAF --pending-slots=100 --running-slots=1000 --plugin=PyCondorPlugin $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Processing --pending-slots=1500 --running-slots=4000 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Production --pending-slots=1500 --running-slots=4000 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Merge --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Cleanup --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=LogCollect --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Skim --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20A useful command to check the current state of the site (agent parameters for the site, running jobs etc.):
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN -p
datasets = [ "DisplacedJet" ] for dataset in datasets: addDataset(tier0Config, dataset, do_reco = True, raw_to_disk = True, tape_node = "T1_IT_CNAF_MSS", disk_node = "T1_IT_CNAF_Disk", siteWhitelist = [ "T1_IT_CNAF" ], dqm_sequences = [ "@common" ], physics_skims = [ "LogError", "LogErrorMonitor" ], scenario = ppScenario)
subject : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch/CN=110263821 issuer : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch identity : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch type : RFC3820 compliant impersonation proxy strength : 1024 path : /data/certs/serviceproxy-vocms001.pem timeleft : 157:02:59 key usage : Digital Signature, Key Encipherment === VO cms extension information === VO : cms subject : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch issuer : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch attribute : /cms/Role=production/Capability=NULL attribute : /cms/Role=NULL/Capability=NULL timeleft : 157:02:58 uri : voms2.cern.ch:15002
# Tier0 - /eos/cms/store/t0streamer/ area cleanup script. Running here as cmst0 has writing permission on eos - cms-tier0-operations@cern.ch 0 10,22 * * * lxplus.cern.ch /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.sh >> /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/streamer_delete.log 2>&1To add a run to the skip list:
/afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py
# run number in this list will be skipped in the iteration below runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357,...]
<job_exit_code>: <number of retries>On every T0 vocms VM in the 00_deploy*.sh script, there is the following configuration for Express, Processing and Repack jobs written to the main T0 WMAgent config.py file:
... #Configurable retry number for failing jobs before they go to paused. #Also, here we need to initialize the job type section in the PauseAlgo first echo "config.RetryManager.PauseAlgo.section_('Express')" >> ./config/tier0/config.py echo "config.RetryManager.PauseAlgo.Express.retryErrorCodes = { 8001: 0, 70: 0, 50513: 0, 50660: 0, 50661: 0, 71304: 0, 99109: 0, 99303: 0, 99400: 0, 8001: 0, 50115: 0 }" >> ./config/tier0/config.py echo "config.RetryManager.PauseAlgo.section_('Processing')" >> ./config/tier0/config.py echo "config.RetryManager.PauseAlgo.Processing.retryErrorCodes = { 8001: 0, 70: 0, 50513: 0, 50660: 0, 50661: 0, 71304: 0, 99109: 0, 99303: 0, 99400: 0, 8001: 0, 50115: 0 }" >> ./config/tier0/config.py echo "config.RetryManager.PauseAlgo.section_('Repack')" >> ./config/tier0/config.py echo "config.RetryManager.PauseAlgo.Repack.retryErrorCodes = { 8001: 0, 70: 0, 50513: 0, 50660: 0, 50661: 0, 71304: 0, 99109: 0, 99303: 0, 99400: 0, 8001: 0, 50115: 0 }" >> ./config/tier0/config.py ...Obviously, the above piece of code gets executed only during the deployment of the T0 WMAgent. If there is a need to adjust this configuration in an already deployed and running agent, then one just needs to modify the main WMAgent configuration file config.py. It is stored at /data/tier0/srv/wmagent/current/config/tier0/config.py directory. After modifications a respective WMAgent component needs to be restarted (RetryManager in this case). There are instructions on how to restart a component in this twiki cookbook.
SELECT id, name, cache_dir FROM wmbs_job WHERE state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused');You can use this query to get the workflows that have paused jobs:
SELECT DISTINCT(wmbs_workflow.NAME) FROM wmbs_job inner join wmbs_jobgroup on wmbs_job.jobgroup = wmbs_jobgroup.ID inner join wmbs_subscription on wmbs_subscription.ID = wmbs_jobgroup.subscription inner join wmbs_workflow on wmbs_subscription.workflow = wmbs_workflow.ID WHERE wmbs_job.state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused') and wmbs_job.cache_dir like '%Reco%';Paused jobs can also be in state 'submitfailed'
cd /data/tier0/srv/wmagent/current/install/tier0 find ./JobCreator/JobCache -name Report.3.pklThis will return the cache dir of the paused jobs (This may not work if the jobs were not actually submitted - submitfailed jobs do not create Report.*.pkl)
xrdcp PFN .
#Source environment source /data/tier0/admin/env.sh # Fail paused-jobs $manage execute-agent paused-jobs -f -j 10231 # Resume paused-jobs $manage execute-agent paused-jobs -r -j 10231You can use the following options:
-j job -w workflow -t taskType -s site -d do not commit changes, only show what will doTo do mass fails / resumes for a single error code, the follow commands are useful:
cp ListOfPausedJobsFromDB /data/tier0/jocasall/pausedJobsClean.txt python /data/tier0/jocasall/checkPausedJobs.py awk -F '_' '{print $6}' code_XXX > jobsToResume.txt while read job; do $manage execute-agent paused-jobs -r -j ${job}; done <jobsToResume.txt
select DISTINCT(tar_details.LFN) from wmbs_file_parent inner join wmbs_file_details parentdetails on wmbs_file_parent.CHILD = parentdetails.ID left outer join wmbs_file_parent parents on parents.PARENT = wmbs_file_parent.PARENT left outer join wmbs_file_details childsdetails on parents.CHILD = childsdetails.ID left outer join wmbs_file_parent childs on childsdetails.ID = childs.PARENT left outer join wmbs_file_details tar_details on childs.CHILD = tar_details.ID where childsdetails.LFN like '%tar.gz' and parentdetails.LFN in ('/store/unmerged/express/Commissioning2014/StreamExpressCosmics/ALCARECO/Express-v3/000/227/470/00000/A25ED7B5-5455-E411-AA08-02163E008F52.root', '/store/unmerged/data/Commissioning2014/MinimumBias/RECO/PromptReco-v3/000/227/430/00000/EC5CF866-5855-E411-BC82-02163E008F75.root');
lcg-cp srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/logs/prod/2014/10/WMAgent/PromptReco_Run227430_MinimumBias/PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar ./PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar
tar -xvf PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar zgrep <UUID> ./LogCollection/*.tar.gz
S'trivialcatalog_file:/home/glidein_pilot/glide_aXehes/execute/dir_30664/job/WMTaskSpace/cmsRun2/CMSSW_7_1_10_patch2/override_catalog.xml?protocol=direct'Changes to:
S'trivialcatalog_file:/afs/cern.ch/user/l/lcontrer/scram/plugins/override_catalog.xml?protocol=direct'
eos cp <local file> </eos/cms/store/unmerged/...>
tar -zxvf 68d93c9c-db7e-11e3-a585-00221959e789-46-0-logArchive.tar.gz
# Create a valid proxy voms-proxy-init -voms cms # Source CMSSW environment source /cvmfs/cms.cern.ch/cmsset_default.sh # Create the scram area (Replace the release for the one the job should use) scramv1 project CMSSW CMSSW_7_4_0
# Go to the src area cd CMSSW_7_4_0/src/
eval `scramv1 runtime -sh` # Actually run the job (you can pass the parameter to create a fwjr too) cmsRun PSet.py
# If you need to modify the job for whatever reason (like drop some input to get at least some # statistics for a DQM harvesting job) you need to first to get a config dump in python format # instead of pickle. Keep in mind that the config file is very big. # Modify PSet.py by adding "print process.dumpPython()" as a last command and run it using python python PSet.py > cmssw_config.py # Modify cmssw_config.py (For example find process.source and remove files that you don't want to run on). Save it and use it as input for cmsRun instead of PSet.py cmsRun cmssw_config.py
update lumi_section_closed set filecount = 0, CLOSE_TIME = <timestamp> where lumi_id in ( <lumisection ID> ) and run_id = <Run ID> and stream_id = <stream ID>;Example:
update lumi_section_closed set filecount = 0, CLOSE_TIME = 1436179634 where lumi_id in ( 11 ) and run_id = 250938 and stream_id = 14;
export PYTHONPATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc491/cms/cmssw-patch/CMSSW_7_5_8_patch1/pythonIn the previous example we assume the job is using CMSSW_7_5_8_patch1 for runningm and that's why we point to this particular path in cvmfs. You should modify it according to the CMSSW version your job is intended to use. Now you can use the following snippet to dump the file: //PYTHON
import FWCore.ParameterSet.Config import pickle pickleHandle = open('PSet.pkl','rb') process = pickle.load(pickleHandle) #This line only will print the python version of the pkl file on the screen process.dumpPython() #The actual writing of the file outputFile = open('PSetPklAsPythonFile.py', 'w') outputFile.write(process.dumpPython()) outputFile.close()After dumping the file you can modify its contents. It is not necessary to pkl it again. you can use the cmsRun command normally
cmsRun PSetPklAsPythonFile.pyor
cmsRun -e PSet.py 2>err.txt 1>out.txt &
#Source environment source /data/tier0/admin/env.sh # Run the diagnose script (change run number) $manage execute-tier0 diagnoseActiveRuns 231087
curl https://github.com/dmwm/WMCore/commit/8c5cca41a0ce5946d0a6fb9fb52ed62165594eb0.patch | patch -d /data/tier0/srv/wmagent/1.9.92/sw.pre.hufnagel/slc6_amd64_gcc481/cms/wmagent/1.0.7.pre6/data/ -p 2Then init couchapp, this will create the view. It may take some time if you have a big database to map.
$manage execute-agent wmagent-couchapp-initThen curl the results for the given time frame (look for the timestamps you need, change user and password accordingly)
curl -g -X GET 'http://user:password@localhost:5984/wmagent_jobdump%2Fjobs/_design/JobDump/_view/statusByTime?startkey=["executing",1432223400]&endkey=["executing",1432305900]'
# source environment source /data/tier0/srv/wmagent/current/apps/t0/etc/profile.d/init.sh # go to the job area, open a python console and do: import cPickle jobHandle = open('job.pkl', "r") loadedJob = cPickle.load(jobHandle) jobHandle.close() print loadedJob # for Report.*.pkl do: import cPickle jobHandle = open("Report.3.pkl", "r") loadedJob = cPickle.load(jobHandle) jobHandle.close() print loadedJob
import cPickle, os jobHandle = open('job.pkl', "r") loadedJob = cPickle.load(jobHandle) jobHandle.close() # Do the changes on the loadedJob output = open('job.pkl', 'w') cPickle.dump(loadedJob, output, cPickle.HIGHEST_PROTOCOL) output.flush() os.fsync(output.fileno()) output.close()
import FWCore.ParameterSet.Config as cms import pickle handle = open('PSet.pkl', 'r') process = pickle.load(handle) handle.close() print process.dumpConfig()
feature=maxRSS value=15360000Executing generate_code.sh would create a script named after the feature, like modify_wmworkload_maxRSS.py. The later will modify the selected feature in the Workflow Sandbox. After generated, you need to add a call to that script in modify_one_workflow.sh. The later will call all the required scripts, create the tarball and locate it where required (Specs folder). Finally, execute modify_several_workflows.sh which will call modify_one_workflow.sh for all the desired workflows. The previous procedure has been followed for several jobs, so for some features the required personalization of the scripts has been already done, and you would just need to comment or uncomment the required lines. As a summary, you would need to proceed as detailed bellow:
vim list ./print_workflow_config.sh vim generate_code.sh ./generate_code.sh vim modify_one_workflow.sh ./modify_several_workflows.sh
vim list cp modify_pset.py modify_pset_<feature>.py vim modify_pset_<feature>.py vim modify_one_job.sh ./modify_several_jobs.sh
# Copy the workflow sandbox from /data/tier0/admin/Specs to your work area cp /data/tier0/admin/Specs/PromptReco_Run245436_Cosmics/PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 /data/tier0/lcontrer/tempThe work area should only contain the workflow sandbox. Go there and then untar the sandbox and unzip WMCore:
cd /data/tier0/lcontrer/temp tar -xjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 unzip -q WMCore.zipNow replace/modify the files in WMCore. Then you have to merge all again. You should remove the old sandbox and WMCore.zip too:
# Remove former sandbox and WMCore.zip, then create the new WMCore.zip rm PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 WMCore.zip zip -rq WMCore.zip WMCore # Now remove the WMCore folder and then create the new sandbox rm -rf WMCore/ tar -cjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 ./* # Clean workarea rm -rf PSetTweaks/ WMCore.zip WMSandbox/Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong.
# find lumis to update update lumi_section_closed set filecount = 0 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>; update lumi_section_closed set filecount = 1 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>;
delete from wmbs_sub_files_available where fileid in ( ... ); delete from wmbs_fileset_files where fileid in ( ... ); delete from wmbs_file_location where fileid in ( ... ); delete from wmbs_file_runlumi_map where fileid in ( ... ); delete from streamer where id in ( ... ); delete from wmbs_file_details where id in ( ... );
SELECT WMBS_WORKFLOW.NAME AS NAME, WMBS_WORKFLOW.TASK AS TASK, LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION AS SUBSCRIPTION, LUMI_SECTION_SPLIT_ACTIVE.RUN_ID AS RUN_ID, LUMI_SECTION_SPLIT_ACTIVE.LUMI_ID AS LUMI_ID FROM LUMI_SECTION_SPLIT_ACTIVE INNER JOIN WMBS_SUBSCRIPTION ON LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION = WMBS_SUBSCRIPTION.ID INNER JOIN WMBS_WORKFLOW ON WMBS_SUBSCRIPTION.WORKFLOW = WMBS_WORKFLOW.ID;# This will actually show the pending active lumi sections for repack. One of this should be related to the corrupted file, compare this result with the first query
SELECT * FROM LUMI_SECTION_SPLIT_ACTIVE;# You HAVE to be completely sure about to delete an entry from the database (don't do this if you don't understand what this implies)
DELETE FROM LUMI_SECTION_SPLIT_ACTIVE WHERE SUBSCRIPTION = 1345 and RUN_ID = 207279 and LUMI_ID = 129;
reco_locked tableIf you want to manually set a run as the fcsr you have to make sure that it is the lowest run with locked = 0
update reco_locked set locked = 0 where run >= <desired_run>
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_stream_done?run=305199&stream=ZeroBias https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306462 https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306460&primary_dataset=METrun_dataset_done can be called without any primary_dataset parameters, in which case it reports back overall PromptReco status. It aggregates over all known datasets for that run in the system (ie. all datasets for all streams for which we have data for this run).
source /data/tier0/admin/env.sh
$manage execute-agent wmcoreD --restart --components=<componentName>Example
$manage execute-agent wmcoreD --restart --components=DBS3Upload
cd /data/tier0/ ./00_stop_agent.sh ./00_start_agent.sh
source /data/tier0/admin/env.sh $manage execute-agent wmcoreD --restart --component ComponentName
https://github.com/ticoann/WmAgentScripts/blob/wmstat_temp_test/test/updateT0RequestStatus.py
/data/tier0/srv/wmagent/2.0.4/sw/slc6_amd64_gcc493/cms/wmagent/1.0.17.pre4/bin/
if info['Run'] < <RunNumber>As you should notice, the given run number would be the oldest run to be shown in WMStats.
$manage execute-agent updateT0RequestStatus.py
#Source environment source /data/tier0/admin/env.sh # Add a site to Resource Control - Change site, thresholds and plugin if needed $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --cms-name=T2_CH_CERN_T0 --se-name=srm-eoscms.cern.ch --ce-name=T2_CH_CERN_T0 --pending-slots=1000 --running-slots=1000 --plugin=CondorPlugin # Change/init thresholds by task: $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --task-type=Processing --pending-slots=500 --running-slots=500 # Change site status (normal, drain, down) $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --down
source /data/tier0/admin/env.sh
$manage execute-agent wmagent-unregister-wmstats `hostname -f`
source /data/tier0/admin/env.sh
$manage execute-agent wmagent-resource-control --site-name=<Desired_Site> --task-type=Processing --pending-slots=<desired_value> --running-slots=<desired_value>Example:
$manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000
$manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=PyCondorPlugin
$manage execute-agent wmagent-resource-control -p
select blockname from dbsbuffer_block where deleted = 0Some datasets could be marked as subscribed in the database, but not been really subscribed in PhEDEx. You can check this with Transfer Team and if that is the case, retry the subscription setting subscribed to 0. You can narrow the query to some blocks with a given name pattern or blocks in a specific site.
update dbsbuffer_dataset_subscription set subscribed = 0 where dataset_id in ( select dataset_id from dbsbuffer_block where deleted = 0 <and blockname like...> ) <and site like ...>Some blocks can be marked as closed, but still being open in PhEDEx. If this is the case, you can set status to "InDBS", to try closing them again. For example, if you want to closed MiniAOD blocks, you can provide a name pattern like '%/MINIAOD#%'. Attribute status can have 3 values: 'Open', 'InDBS' and 'Closed'. 'Open' is the first value assigned to all blocks, when they are closed and injected into DBS, status is changed to 'InDBS' and when they are closed in PhEDEx, status is changed to 'Closed'. Setting status to 'InDBS' would make the agent retries to close the blocks in PhEDEx.
update dbsbuffer_block set status = 'InDBS' where deleted = 0 and status = 'Closed' and blockname like ...If some subscriptions shouldn't be checked anymore, remove these subscriptions from database. For instance, if you want to remove RAW subscriptions to disk of all T1s, you can give a path pattern like '/%/%/RAW' and a site like 'T1_%_Disk'.
delete dbsbuffer_dataset_subscription where dataset_id in ( select id from dbsbuffer_dataset where path like ... ) and site like ...
condor_q 52982.15 -l | less -iTo get condor list by regexp:
condor_q -const 'regexp("30199",WMAgent_RequestName)' -af
condor_qedit <job-id> JobPrio "<New Prio (numeric value)>"
for job in $(condor_q -w | awk '{print $1}') do condor_qedit $job JobPrio "508200001" done
condor_qedit -const 'MaxWallTimeMins>30000' MaxWallTimeMins 1440
condor_status -pool vocms007 -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")' -af Cpus | sort | uniq -c
condor_status -pool vocms007 -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")' -af Cpus | awk '{sum+= $1} END {print(sum)}'
condor_status -pool vocms007 -const 'State=="Claimed" && ( ClientMachine=!="vocms001.cern.ch" && ClientMachine=!="vocms014.cern.ch" && ClientMachine=!="vocms015.cern.ch" && ClientMachine=!="vocms0313.cern.ch" && ClientMachine=!="vocms0314.cern.ch" && ClientMachine=!="vocms039.cern.ch" && ClientMachine=!="vocms047.cern.ch" && ClientMachine=!="vocms013.cern.ch")' -af Cpus | awk '{sum+= $1} END {print(sum)}'
/etc/condor/config.d/99_local_tweaks.config
MAX_JOBS_RUNNING = <value>
MAX_JOBS_RUNNING = 12000
condor_reconfig
condor_qedit <job-id> Requestioslots "0"
for job in $(cat <text_file_with_the_list_of_job_condor_IDs>) do condor_qedit $job Requestioslots "0" done
9. namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch", 10. "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"] 38. elif p[ 'dn' ] in namesToMapToTIER0: 39. dnmap[ p['dn'] ] = "cmst0"
/data/certs
admin/env.sh admin/env_unit.sh
/data/TransferSystem/t0_control.sh
Instance Name | TNS |
CMS_T0DATASVC_REPLAY_1 | INT2R |
CMS_T0DATASVC_REPLAY_2 | INT2R |
CMS_T0DATASVC_PROD | CMSR |
/data/tier0/00_stop_agent.sh
ps aux | egrep 'couch|wmcore'
sqlplus <instanceName>/<password>@<tns>Replacing the brackets with the proper values for each instance.
SQL> password Changing password for <user> Old password: New password: Retype new password: Password changed SQL> exit
/data/tier0/admin/And normally are named as following (not all the instances will have all the files):
WMAgent.secrets WMAgent.secrets.replay WMAgent.secrets.prod WMAgent.secrets.localcouch WMAgent.secrets.remotecouch
/data/tier0/srv/wmagent/current/config/tier0/config.pyThere you must look for the entry:
config.T0DAtaScvDatabase.connectUrland do the update.
https://session-manager.web.cern.ch/session-manager/
select * from run where run_id = 293501;
select * from streamer join stream on streamer.STREAM_ID=stream.ID where run_id= 293501; select count(*) from streamer join stream on streamer.STREAM_ID=stream.ID where run_id= 293501 group by stream.ID;
select wmbs_job_state.name, count(*) from wmbs_job join wmbs_job_state on wmbs_job.state = wmbs_job_state.id GROUP BY wmbs_job_state.name;
select id, cache_dir from wmbs_job where STATE =17; SELECT id, cache_dir FROM wmbs_job WHERE state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused?);
select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' order by cache_dir; select id, cache_dir from wmbs_job where STATE =17 and cache_dir not like '%Repack%' order by cache_dir; select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%PromptReco%' order by cache_dir; select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Express%' order by cache_dir; select id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Express%' and cache_dir not like '%Repack%' order by cache_dir; select retry_count, id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' and cache_dir not like '%Merge%' order by cache_dir; select retry_count, id, cache_dir from wmbs_job where STATE =17 and cache_dir like '%Repack%' and cache_dir like '%Merge%' order by cache_dir;
select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID order by RECO_CONFIG.RUN_ID desc;
select distinct min(RECO_CONFIG.RUN_ID) from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where CMSSW_VERSION.NAME = 'CMSSW_7_4_12' order by RECO_CONFIG.RUN_ID desc;
select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where CMSSW_VERSION.NAME = 'CMSSW_7_4_12' order by RECO_CONFIG.RUN_ID desc ;
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID order by EXPRESS_CONFIG.RUN_ID desc ;
select distinct min(EXPRESS_CONFIG.RUN_ID) from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where CMSSW_VERSION.NAME = 'CMSSW_7_4_12' order by EXPRESS_CONFIG.RUN_ID desc;
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where CMSSW_VERSION.NAME = 'CMSSW_7_4_12' order by EXPRESS_CONFIG.RUN_ID desc;
select distinct RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where RECO_CONFIG.RUN_ID=299325 order by RECO_CONFIG.RUN_ID desc ;
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where EXPRESS_CONFIG.RUN_ID=299325 order by EXPRESS_CONFIG.RUN_ID desc ;
select distinct EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where EXPRESS_CONFIG.RUN_ID=299325 order by EXPRESS_CONFIG.RUN_ID desc ;NOTE: Remember that the Express configuration includes two CMSSW releases; one for repacking: CMSSW_ID and another for reconstruction: RECO_CMSSW_ID
select wmbs_file_details.* from wmbs_job join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID where wmbs_job.ID = 3463;
select * from wmbs_file_runlumi_map where fileid = 8356;
select wmbs_file_details.ID, LFN, FILESIZE, EVENTS, lumi from wmbs_job join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID join wmbs_file_runlumi_map on wmbs_job_assoc.FILEID = wmbs_file_runlumi_map.FILEID where wmbs_job.ID = 3463 order by lumi;
select wmbs_file_details.ID, LFN, FILESIZE, EVENTS, lumi from wmbs_job join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID join wmbs_file_runlumi_map on wmbs_job_assoc.FILEID = wmbs_file_runlumi_map.FILEID where wmbs_job.ID = 3463 and lumi = 73;
select wmbs_job.* from wmbs_job_assoc join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID where wmbs_job_assoc.FILEID = 4400;
select wmbs_job.CACHE_DIR from wmbs_job join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/MuOnia/RAW/v1/000/300/515/00000/D662FD9E-177A-E711-8F1B-02163E019D28.root';
select wmbs_job_mask.*, wmbs_job.CACHE_DIR from wmbs_job_assoc join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID join wmbs_job_mask on wmbs_job.ID = wmbs_job_mask.JOB where wmbs_job_assoc.FILEID = 4400;
select * from wmbs_file_parent where child = 6708;
select wmbs_file_details.* from wmbs_file_parent join wmbs_file_details on wmbs_file_parent.PARENT = wmbs_file_details.ID where child = 6708;
select distinct wmbs_job.* from wmbs_job_assoc join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID where wmbs_job_assoc.FILEID in (select parent from wmbs_file_parent where child = 3756584);
select wmbs_job.CACHE_DIR, wmbs_file_details.FILESIZE from wmbs_file_details join wmbs_file_parent on wmbs_file_parent.CHILD = wmbs_file_details.ID join wmbs_job_assoc on wmbs_file_parent.PARENT = wmbs_job_assoc.FILEID join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/MuOnia/RAW/v1/000/300/515/00000/D662FD9E-177A-E711-8F1B-02163E019D28.root';
select wmbs_job.CACHE_DIR, wmbs_file_details.* from wmbs_file_details join wmbs_file_parent on wmbs_file_parent.CHILD = wmbs_file_details.ID join wmbs_job_assoc on wmbs_file_parent.PARENT = wmbs_job_assoc.FILEID join wmbs_job on wmbs_job_assoc.JOB = wmbs_job.ID join wmbs_job_mask on wmbs_job.ID = wmbs_job_mask.JOB where wmbs_file_details.LFN = '/store/unmerged/data/Run2017C/SingleMuon/ALCARECO/DtCalib-PromptReco-v1/000/299/616/00000/86B8AC3B-5B71-E711-ABA7-02163E0118E2.root'; and wmbs_job.CACHE_DIR not like '%Cleanup%'
select wmbs_file_details.* from wmbs_job join wmbs_job_assoc on wmbs_job.ID = wmbs_job_assoc.JOB join wmbs_file_details on wmbs_job_assoc.FILEID = wmbs_file_details.ID where wmbs_job.STATE = 17;
select name from wmbs_fileset;
git clone https://github.com/cms-sw/pkgtools.git #points HEAD to V00-32-XX cd pkgtools git remote -v git fetch origin V00-32-XX git checkout V00-32-XX git pull origin V00-32-XX
#Clone cmsdist branch: comp_gcc630: https://github.com/cms-sw/cmsdist.git3. Now you should have the build environment properly configured. In order to build a new release:
### RPM cms t0 2.1.5Normally, you only need to increment it to the tag version created on GH before.
%define wmcver 1.2.3
Source0: git://github.com/<YOUR_GH_USERNAME>/T0.git?obj=master/%{realversion}&export=T0-%{realversion}&output=/T0-%{realversion}.tar.gzIf needed, the WMCore release can be adjusted as well (Source1 parameter in the spec file).
# build a new release pkgtools/cmsBuild -c cmsdist --repository comp -a slc7_amd64_gcc630 --builders 8 -j 4 --work-dir w build t0 | tee logBuild #upload it: pkgtools/cmsBuild -c cmsdist --repository comp -a slc7_amd64_gcc630 --builders 8 -j 4 --work-dir w --upload-user=$USER upload t0 | tee logUpload4. You should build a new release under your personal repository, run unit tests on it and test it in a replay:
# here change T0 release tag: TIER0_VERSION=2.1.4 ... # change the deployment source to your personal repo if it's a release from your personal repository #Vytas private repo deployment ./Deploy -s prep -r comp=comp.[your-CERN-username] -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION ./Deploy -s sw -r comp=comp.[your-CERN-username] -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION ./Deploy -s post -r comp=comp.[your-CERN-username] -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION #Usual deployment #./Deploy -s prep -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION #./Deploy -s sw -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSION #./Deploy -s post -r comp=comp -A $TIER0_ARCH -t $TIER0_VERSION -R tier0@$TIER0_VERSION $DEPLOY_DIR tier0@$TIER0_VERSIONThat's it. Now you can deploy a new T0 release. Just make sure there are no GH conflicts and other errors during the deployment procedure. 6. Then, once your personal release is tested and proven as working properly, you want to build it as the CMS package in cmssw-cmsdist repository. To prepare for that:
cd T0 (local T0 repo work area) # stg branch master -> need to be in master branch # stg pull -> need to be in sync with updates # don't build the release, just generate the changelog . bin/buildrelease.sh --skip-build --wmcore-tag=1.1.20.patch4 2.1.4 # update wmcore-tag (not sure it matters) # Now it will open an editor window where you can edit the CHANGES file # usually not desired since it auto-populates with all the changes # this creates a tag in the local area, DO NOT TAG MANUALLY # tag needs to be copied to the github user and dmwm repos # some of these might be redundant by now, I usually do all for safety git push upstream master git push origin master git push --tags upstream master git push --tags origin master
/data/tier0/sls/scripts/Logs
shutdown -r now
puppet agent -tv
# | Permissions | Owner | Group | Folder Name |
---|---|---|---|---|
1. | (775) drwxrwxr-x. | root | zh | admin |
2. | (775) drwxrwxr-x. | root | zh | certs |
3. | (755) drwxr-xr-x. | cmsprod | zh | cmsprod |
4. | (700) drwx------. | root | root | lost+found |
5. | (775) drwxrwxr-x. | root | zh | srv |
6. | (755) drwxr-xr-x. | cmst1 | zh | tier0 |
stat -c %a /path/to/file
EXAMPLE 1: chmod 775 /data/certs/
EXAMPLE 1: chown :zh /data/certs/ EXAMPLE 2: chown -R cmst1:zh /data/certs/*
File | Description |
---|---|
00_deploy_prod.sh | Script to deploy the WMAgent for production(*) |
00_deploy_replay.sh | Script to deploy the WMAgent for a replay(*) |
00_patches.sh | Script to apply patches from 00_software script. |
00_readme.txt | Some documentation about the scripts |
00_software.sh * | Gets the source code to use form Github for WMCore and the Tier0. Applies the described patches. |
00_start_agent.sh | Starts the agent after it is deployed. |
00_start_services.sh | Used during the deployment to start services such as CouchDB |
00_stop_agent.sh | Stops the components of the agent. It doesn't delete any information from the file system or the T0AST, just kill the processes of the services and the WMAgent components |
00_wipe_t0ast.sh | Invoked by the 00_deploy script. Wipes the content of the T0AST. Be careful! |
WMAGENT_SECRETS_LOCATION=$HOME/WMAgent.replay.secrets;
sed -i 's+TIER0_CONFIG_FILE+/data/tier0/admin/ReplayOfflineConfiguration.py+' ./config/tier0/config.py
sed -i "s+'team1,team2,cmsdataops'+'tier0replay'+g" ./config/tier0/config.py
# Workflow archive delay echo 'config.TaskArchiver.archiveDelayHours = 1' >> ./config/tier0/config.py
$manage execute-agent wmagent-unregister-wmstats `hostname -f`
# | Instruction | Responsible Role |
---|---|---|
0 | If there are any exceptions when logging into a new headnode, then you should restart it at first. Restarting a vobox section. | Tier0 |
1 | Run a replay on the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section | Tier0 |
2 | When the replay is done, deploy the new T0 prod WMAgent instance on the new headnode. You should use a 00_deploy_prod.sh script. | Tier0 |
3 | Check the ProdOfflineconfiguration that is being used. It should have the desired new configuration (acq. era, GTs, processing versions, etc.) | Tier0 |
4 | Start the Tier0 WMAgent on the new headnode. | Tier0 |
5 | Change the configuration for Grafana monitoring pointing to the proper T0AST instance. (on vocms015 at /data/tier0/sls/etc/config.py) | Tier0 |
6 | Change the acronjob job execution node on cmst1 lxplus to point to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts: */10 * * * * lxplus ssh vocms0314 "/data/tier0/tier0_monitoring/src/cmst0_diagnoseActiveRuns/activeRuns.sh" &> /afs/cern.ch/user/c/cmst1/www/tier0/diagnoseActiveRuns.out */5 * * * * lxplus ssh vocms0314 "/data/tier0/tier0_monitoring/src/cmst0_checkPendingSubscriptions/checkPendingSubscriptions.sh" &> /afs/cern.ch/user/c/cmst1/www/tier0/checkPendingSubscriptions.out |
Tier0 |
7 | Change the main Production configuration symlink on cmst1 lxplus acrontab job at /afs/cern.ch/user/c/cmst1/www/tier0/ : ln -sfn ProdOfflineConfiguration_123.py ProdOfflineConfiguration.py |
Tier0 |
curl https://patch-diff.githubusercontent.com/raw/dmwm/T0/pull/4500.patch | patch -d /data/tier0/srv/wmagent/current/apps/t0/lib/python2.7/site-packages/ -p3Do not forget to make sure that the destination lib directory exists and there were no git errors/conflicts when adding the patch.
00_stop_agent.sh
service condor stopIf you want your data to be still available, then cp your spool directory to disk
cp -r /mnt/ramdisk/spool /data/
t0_control start
00_start_agentParticularly, check the PhEDExInjector component, if there you see errors, try restarting it after sourcing init.sh
source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh $manage execute-agent wmcoreD --restart --component PhEDExInjector
sudo su -
cd /etc/condor/config.d/
-rw-r--r--. 1 condor condor 1849 Mar 19 2015 00_gwms_general.config -rw-r--r--. 1 condor condor 1511 Mar 19 2015 01_gwms_collectors.config -rw-r--r-- 1 condor condor 678 May 27 2015 03_gwms_local.config -rw-r--r-- 1 condor condor 2613 Nov 30 11:16 10_cms_htcondor.config -rw-r--r-- 1 condor condor 3279 Jun 30 2015 10_had.config -rw-r--r-- 1 condor condor 36360 Jun 29 2015 20_cms_secondary_collectors_tier0.config -rw-r--r-- 1 condor condor 2080 Feb 22 12:24 80_cms_collector_generic.config -rw-r--r-- 1 condor condor 3186 Mar 31 14:05 81_cms_collector_tier0_generic.config -rw-r--r-- 1 condor condor 1875 Feb 15 14:05 90_cms_negotiator_policy_tier0.config -rw-r--r-- 1 condor condor 3198 Aug 5 2015 95_cms_daemon_monitoring.config -rw-r--r-- 1 condor condor 6306 Apr 15 11:21 99_local_tweaks.config
# Knob to enable or disable flocking # To enable, set this to True (defragmentation is auto enabled) # To disable, set this to False (defragmentation is auto disabled) ENABLE_PROD_FLOCKING = True
ENABLE_PROD_FLOCKING = False
condor_reconfig
condor_config_val -master gsi_daemon_name
ps aux | grep "condor_negotiator" kill -9 <replace_by_condor_negotiator_process_id>
sudo su -
cd /etc/condor/config.d/
# How to drain the slots # graceful: let the jobs finish, accept no more jobs # quick: allow job to checkpoint (if supported) and evict it # fast: hard kill the jobs DEFRAG_SCHEDULE = graceful
DEFRAG_SCHEDULE = fast
DEFRAG_SCHEDULE = graceful
/afs/cern.ch/user/e/ebohorqu/public/HIStats/stats.pyFor the analysis we need to define certain things:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/RecoStatsProcessing.jsonWith a separate script in R, I was reading and summarizing the data:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/parse_cpu_info.RThere, task type should be defined and also output file. With this script I was just summarizing cpu data, but we could modify it a little to get memory data. Maybe it is quicker to do it directly with the first python script, if you like to do it :P That script calculates efficiency of each job:
TotalLoopCPU / TotalJobTime * numberOfCoresand an averaged efficiency per dataset:
sum(TotalLoopCPU) / sum(TotalJobTime * numberOfCores)numberOfCores was obtained from job.pkl, TotalLoopCPU and TotalJobTime were obtained from report.pkl Job type could be Processing, Merge and Harvesting. For Processing type, task could be Reco or AlcaSkim and for Merge type, ALCASkimMergeALCARECO, RecoMergeSkim, RecoMergeWrite _AOD, RecoMergeWrite _DQMIO, RecoMergeWrite _MINIAOD and RecoMergeWrite _RECO.
git checkout master git fetch dmwm git pull dmwm master git push origin master
git checkout -b <branch-name> dmwm/master
git add <file-name>
git commit
git push origin <branch-name>
git commit --amend
git push -f origin <branch-name>
git branch -d <branch-name>Other useful commands
git branch git status
git reset git diff git log git checkout .
Path | Use | Who writes | Who reads | Who cleans |
---|---|---|---|---|
/eos/cms/store/t0streamer/ | Input streamer files transferred from P5 | Storage Manager | Tier-0 worker nodes | Tier-0 t0streamer area cleanup script |
/eos/cms/store/unmerged/ | Store output files smaller than 2GB until the merge jobs put them together | Tier-0 worker nodes (Processing/Repack jobs) | Tier-0 worker nodes(Merge Jobs) | ? |
/eos/cms/tier0/ | Files ready to be transferred to Tape and Disk | Tier-0 worker nodes (Processing/Repack/Merge jobs) | PhEDEx Agent | Tier-0 WMAgent creates and auto approves transfer/deletion requests. PhEDEx executes them |
/eos/cms/store/express/ | Output from Express processing | Tier-0 worker nodes | Users | Tier-0 express area clenaup script |
/eos/cms/store/t0streamer/SM writes raw files there. And we delete the files with the script. the script is on the acronjob under the cmst0 acc. It keeps data which are not repacked yet. Also, keeps the data not older than 7 days. The data is repacked (rewritten) dat files > PDs (raw .root files).
/eos/cms/store/unmerged/There go the files which need to be merged into larger files. Not all the files go there. The job itself manages it (after merging, the job deletes the unmerged files).
/eos/cms/store/express/Express output after being merged. Jobs from the tier0 are writing to it. Data deletions are managed by DDM.
https://cmsweb.cern.ch/phedex/datasvc/json/prod/requestlist?dataset=/*/Tier0_REPLAY_vocms015*/*&node=T2_CH_CERNThe request above will return a JSON resultset as follows:
{ "phedex": { "call_time": 9.78992, "instance": "prod", "request": [ { "approval": "approved", "id": 1339469, "node": [ { "decided_by": "Daniel Valbuena Sosa", "decision": "approved", "id": 1561, "name": "T2_CH_CERN", "se": "srm-eoscms.cern.ch", "time_decided": 1526987916 } ], "requested_by": "Vytautas Jankauskas", "time_create": 1526644111.24301, "type": "delete" }, { ... } ], "request_call": "requestlist", "request_date": "2018-07-17 20:53:23 UTC", "request_timestamp": 1531860803.37134, "request_url": "http://cmsweb.cern.ch:7001/phedex/datasvc/json/prod/requestlist", "request_version": "2.4.0pre1" } }The PhEDEx services not only allows you to create more detailed queries, but are faster than query the information on PhEDEx website.
/afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/streamer_delete.log.
# Firstly # source /data/tier0/admin/env.sh # source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh # then you can use this simple script to retrieve a list of files from related runs. # Keep in mind, that in the below snippet, we are ignoring all Express streams output. # This is just a snippet, so it may not work out of box. from dbs.apis.dbsClient import DbsApi from pprint import pprint import os dbsUrl = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader' dbsApi = DbsApi(url = dbsUrl) runList = [316569] with open(os.path.join('/data/tier0/srv/wmagent/current/tmpRecovery/', "testRuns.txt"), 'a') as the_file: for a in runList: datasets = dbsApi.listDatasets(run_num=a) pprint(datasets) for singleDataset in datasets: pdName = singleDataset['dataset'] if 'Express' not in pdName and 'HLTMonitor' not in pdName and 'Calibration' not in pdName and 'ALCALUMIPIXELSEXPRESS' not in pdName: datasetFiles = dbsApi.listFileArray(run_num=a, dataset=pdName) #print("For run %d the dataset %s", a, pdName) for singleFile in datasetFiles: print(singleFile['logical_file_name']) the_file.write(singleFile['logical_file_name']+"\n")
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ALCALUMIPIXELSEXPRESS' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'Express' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'HLTMonitor' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'Calibration' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ExpressAlignment' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ExpressCosmics'