/data/tier0/, absolute paths are provided in these instructions for clearness In order to start a new replay, firstly you need to make sure that the instance is available Check the Tier0 project on Jira
condor_q. If any, you can use
condor_rm -allto remove everything.
runningagent (This is an alias included in the cmst1 config, actual command: ps aux | egrep 'couch|wmcore|mysql|beam'
/data/tier0/00_stop_agent.sh
/data/tier0/admin/ReplayOfflineConfiguration.py
./00_software.sh # loads the newest version of WMCore and T0 github repositories. ./00_deploy.sh # deploys the new configuration, wipes the toast database etc.
./00_start_agent.sh # starts the new agent - loads the job list etc.
vim /data/tier0/srv/wmagent/2.0.8/install/tier0/Tier0Feeder/ComponentLog vim /data/tier0/srv/wmagent/2.0.8/install/tier0/JobCreator/ComponentLog vim /data/tier0/srv/wmagent/2.0.8/install/tier0/JobSubmitter/ComponentLog
condor_q runningagent
<meaningfulName>Scenario = "<actualNameOfTheNewScenario>"
defaultRecoTimeout = 48 * 3600to something higher like 10 * 48 * 3600. Tier0Feeder checks this timeout every polling cycle. So when you want to release it again, you just need to go back to the 48h delay.
/data/tier0/admin/ProdOfflineConfiguration.py
defaultCMSSWVersion = "CMSSW_7_4_7"
repackVersionOverride = { "CMSSW_7_4_2" : "CMSSW_7_4_7", "CMSSW_7_4_3" : "CMSSW_7_4_7", "CMSSW_7_4_4" : "CMSSW_7_4_7", "CMSSW_7_4_5" : "CMSSW_7_4_7", "CMSSW_7_4_6" : "CMSSW_7_4_7", } expressVersionOverride = { "CMSSW_7_4_2" : "CMSSW_7_4_7", "CMSSW_7_4_3" : "CMSSW_7_4_7", "CMSSW_7_4_4" : "CMSSW_7_4_7", "CMSSW_7_4_5" : "CMSSW_7_4_7", "CMSSW_7_4_6" : "CMSSW_7_4_7", }
select RECO_CONFIG.RUN_ID, CMSSW_VERSION.NAME from RECO_CONFIG inner join CMSSW_VERSION on RECO_CONFIG.CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>' select EXPRESS_CONFIG.RUN_ID, CMSSW_VERSION.NAME from EXPRESS_CONFIG inner join CMSSW_VERSION on EXPRESS_CONFIG.RECO_CMSSW_ID = CMSSW_VERSION.ID where name = '<CMSSW_X_X_X>'
UPDATE ( SELECT reco_release_config.released AS released, reco_release_config.delay AS delay, reco_release_config.delay_offset AS delay_offset FROM reco_release_config WHERE checkForZeroOneState(reco_release_config.released) = 0 AND reco_release_config.run_id <= <Replace By the desired Run Number> ) t SET t.released = 1, t.delay = 10, t.delay_offset = 5;
$manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --cms-name=T1_IT_CNAF --pnn=T1_IT_CNAF_Disk --ce-name=T1_IT_CNAF --pending-slots=100 --running-slots=1000 --plugin=PyCondorPlugin $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Processing --pending-slots=1500 --running-slots=4000 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Production --pending-slots=1500 --running-slots=4000 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Merge --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Cleanup --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=LogCollect --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Skim --pending-slots=50 --running-slots=50 $manage execute-agent wmagent-resource-control --site-name=T1_IT_CNAF --task-type=Harvesting --pending-slots=10 --running-slots=20A useful command to check the current state of the site (agent parameters for the site, running jobs etc.):
$manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN -p
datasets = [ "DisplacedJet" ] for dataset in datasets: addDataset(tier0Config, dataset, do_reco = True, raw_to_disk = True, tape_node = "T1_IT_CNAF_MSS", disk_node = "T1_IT_CNAF_Disk", siteWhitelist = [ "T1_IT_CNAF" ], dqm_sequences = [ "@common" ], physics_skims = [ "LogError", "LogErrorMonitor" ], scenario = ppScenario)
subject : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch/CN=110263821 issuer : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch identity : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch type : RFC3820 compliant impersonation proxy strength : 1024 path : /data/certs/serviceproxy-vocms001.pem timeleft : 157:02:59 key usage : Digital Signature, Key Encipherment === VO cms extension information === VO : cms subject : /DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch issuer : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch attribute : /cms/Role=production/Capability=NULL attribute : /cms/Role=NULL/Capability=NULL timeleft : 157:02:58 uri : voms2.cern.ch:15002
# Tier0 - /eos/cms/store/t0streamer/ area cleanup script. Running here as cmst0 has writing permission on eos - cms-tier0-operations@cern.ch 0 10,22 * * * lxplus.cern.ch /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.sh >> /afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/streamer_delete.log 2>&1To add a run to the skip list:
/afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/analyzeStreamers_prod.py
# run number in this list will be skipped in the iteration below runSkip = [251251, 251643, 254608, 254852, 263400, 263410, 263491, 263502, 263584, 263685, 273424, 273425, 273446, 273449, 274956, 274968, 276357,...]
SELECT id, name, cache_dir FROM wmbs_job WHERE state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused');You can use this query to get the workflows that have paused jobs:
SELECT DISTINCT(wmbs_workflow.NAME) FROM wmbs_job inner join wmbs_jobgroup on wmbs_job.jobgroup = wmbs_jobgroup.ID inner join wmbs_subscription on wmbs_subscription.ID = wmbs_jobgroup.subscription inner join wmbs_workflow on wmbs_subscription.workflow = wmbs_workflow.ID WHERE wmbs_job.state = (SELECT id FROM wmbs_job_state WHERE name = 'jobpaused') and wmbs_job.cache_dir like '%Reco%';Paused jobs can also be in state 'submitfailed'
cd /data/tier0/srv/wmagent/current/install/tier0 find ./JobCreator/JobCache -name Report.3.pklThis will return the cache dir of the paused jobs (This may not work if the jobs were not actually submitted - submitfailed jobs do not create Report.*.pkl)
xrdcp PFN .
#Source environment source /data/tier0/admin/env.sh # Fail paused-jobs $manage execute-agent paused-jobs -f -j 10231 # Resume paused-jobs $manage execute-agent paused-jobs -r -j 10231You can use the following options:
-j job -w workflow -t taskType -s site -d do not commit changes, only show what will doTo do mass fails / resumes for a single error code, the follow commands are useful:
cp ListOfPausedJobsFromDB /data/tier0/jocasall/pausedJobsClean.txt python /data/tier0/jocasall/checkPausedJobs.py awk -F '_' '{print $6}' code_XXX > jobsToResume.txt while read job; do $manage execute-agent paused-jobs -r -j ${job}; done <jobsToResume.txt
select DISTINCT(tar_details.LFN) from wmbs_file_parent inner join wmbs_file_details parentdetails on wmbs_file_parent.CHILD = parentdetails.ID left outer join wmbs_file_parent parents on parents.PARENT = wmbs_file_parent.PARENT left outer join wmbs_file_details childsdetails on parents.CHILD = childsdetails.ID left outer join wmbs_file_parent childs on childsdetails.ID = childs.PARENT left outer join wmbs_file_details tar_details on childs.CHILD = tar_details.ID where childsdetails.LFN like '%tar.gz' and parentdetails.LFN in ('/store/unmerged/express/Commissioning2014/StreamExpressCosmics/ALCARECO/Express-v3/000/227/470/00000/A25ED7B5-5455-E411-AA08-02163E008F52.root', '/store/unmerged/data/Commissioning2014/MinimumBias/RECO/PromptReco-v3/000/227/430/00000/EC5CF866-5855-E411-BC82-02163E008F75.root');
lcg-cp srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/logs/prod/2014/10/WMAgent/PromptReco_Run227430_MinimumBias/PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar ./PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar
tar -xvf PromptReco_Run227430_MinimumBias-LogCollect-1-logs.tar zgrep <UUID> ./LogCollection/*.tar.gz
S'trivialcatalog_file:/home/glidein_pilot/glide_aXehes/execute/dir_30664/job/WMTaskSpace/cmsRun2/CMSSW_7_1_10_patch2/override_catalog.xml?protocol=direct'Changes to:
S'trivialcatalog_file:/afs/cern.ch/user/l/lcontrer/scram/plugins/override_catalog.xml?protocol=direct'
eos cp <local file> </eos/cms/store/unmerged/...>
tar -zxvf 68d93c9c-db7e-11e3-a585-00221959e789-46-0-logArchive.tar.gz
# Create a valid proxy voms-proxy-init -voms cms # Source CMSSW environment source /cvmfs/cms.cern.ch/cmsset_default.sh # Create the scram area (Replace the release for the one the job should use) scramv1 project CMSSW CMSSW_7_4_0
# Go to the src area cd CMSSW_7_4_0/src/
eval `scramv1 runtime -sh` # Actually run the job (you can pass the parameter to create a fwjr too) cmsRun PSet.py
# If you need to modify the job for whatever reason (like drop some input to get at least some # statistics for a DQM harvesting job) you need to first to get a config dump in python format # instead of pickle. Keep in mind that the config file is very big. # Modify PSet.py by adding "print process.dumpPython()" as a last command and run it using python python PSet.py > cmssw_config.py # Modify cmssw_config.py (For example find process.source and remove files that you don't want to run on). Save it and use it as input for cmsRun instead of PSet.py cmsRun cmssw_config.py
update lumi_section_closed set filecount = 0, CLOSE_TIME = <timestamp> where lumi_id in ( <lumisection ID> ) and run_id = <Run ID> and stream_id = <stream ID>;Example:
update lumi_section_closed set filecount = 0, CLOSE_TIME = 1436179634 where lumi_id in ( 11 ) and run_id = 250938 and stream_id = 14;
export PYTHONPATH=/cvmfs/cms.cern.ch/slc6_amd64_gcc491/cms/cmssw-patch/CMSSW_7_5_8_patch1/pythonIn the previous example we assume the job is using CMSSW_7_5_8_patch1 for runningm and that's why we point to this particular path in cvmfs. You should modify it according to the CMSSW version your job is intended to use. Now you can use the following snippet to dump the file: //PYTHON
import FWCore.ParameterSet.Config import pickle pickleHandle = open('PSet.pkl','rb') process = pickle.load(pickleHandle) #This line only will print the python version of the pkl file on the screen process.dumpPython() #The actual writing of the file outputFile = open('PSetPklAsPythonFile.py', 'w') outputFile.write(process.dumpPython()) outputFile.close()After dumping the file you can modify its contents. It is not necessary to pkl it again. you can use the cmsRun command normally
cmsRun PSetPklAsPythonFile.pyor
cmsRun -e PSet.py 2>err.txt 1>out.txt &
#Source environment source /data/tier0/admin/env.sh # Run the diagnose script (change run number) $manage execute-tier0 diagnoseActiveRuns 231087
curl https://github.com/dmwm/WMCore/commit/8c5cca41a0ce5946d0a6fb9fb52ed62165594eb0.patch | patch -d /data/tier0/srv/wmagent/1.9.92/sw.pre.hufnagel/slc6_amd64_gcc481/cms/wmagent/1.0.7.pre6/data/ -p 2Then init couchapp, this will create the view. It may take some time if you have a big database to map.
$manage execute-agent wmagent-couchapp-initThen curl the results for the given time frame (look for the timestamps you need, change user and password accordingly)
curl -g -X GET 'http://user:password@localhost:5984/wmagent_jobdump%2Fjobs/_design/JobDump/_view/statusByTime?startkey=["executing",1432223400]&endkey=["executing",1432305900]'
# source environment source /data/tier0/srv/wmagent/current/apps/t0/etc/profile.d/init.sh # go to the job area, open a python console and do: import cPickle jobHandle = open('job.pkl', "r") loadedJob = cPickle.load(jobHandle) jobHandle.close() print loadedJob # for Report.*.pkl do: import cPickle jobHandle = open("Report.3.pkl", "r") loadedJob = cPickle.load(jobHandle) jobHandle.close() print loadedJob
import cPickle, os jobHandle = open('job.pkl', "r") loadedJob = cPickle.load(jobHandle) jobHandle.close() # Do the changes on the loadedJob output = open('job.pkl', 'w') cPickle.dump(loadedJob, output, cPickle.HIGHEST_PROTOCOL) output.flush() os.fsync(output.fileno()) output.close()
import FWCore.ParameterSet.Config as cms import pickle handle = open('PSet.pkl', 'r') process = pickle.load(handle) handle.close() print process.dumpConfig()
feature=maxRSS value=15360000Executing generate_code.sh would create a script named after the feature, like modify_wmworkload_maxRSS.py. The later will modify the selected feature in the Workflow Sandbox. After generated, you need to add a call to that script in modify_one_workflow.sh. The later will call all the required scripts, create the tarball and locate it where required (Specs folder). Finally, execute modify_several_workflows.sh which will call modify_one_workflow.sh for all the desired workflows. The previous procedure has been followed for several jobs, so for some features the required personalization of the scripts has been already done, and you would just need to comment or uncomment the required lines. As a summary, you would need to proceed as detailed bellow:
vim list ./print_workflow_config.sh vim generate_code.sh ./generate_code.sh vim modify_one_workflow.sh ./modify_several_workflows.sh
vim list cp modify_pset.py modify_pset_<feature>.py vim modify_pset_<feature>.py vim modify_one_job.sh ./modify_several_jobs.sh
# Copy the workflow sandbox from /data/tier0/admin/Specs to your work area cp /data/tier0/admin/Specs/PromptReco_Run245436_Cosmics/PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 /data/tier0/lcontrer/tempThe work area should only contain the workflow sandbox. Go there and then untar the sandbox and unzip WMCore:
cd /data/tier0/lcontrer/temp tar -xjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 unzip -q WMCore.zipNow replace/modify the files in WMCore. Then you have to merge all again. You should remove the old sandbox and WMCore.zip too:
# Remove former sandbox and WMCore.zip, then create the new WMCore.zip rm PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 WMCore.zip zip -rq WMCore.zip WMCore # Now remove the WMCore folder and then create the new sandbox rm -rf WMCore/ tar -cjf PromptReco_Run245436_Cosmics-Sandbox.tar.bz2 ./* # Clean workarea rm -rf PSetTweaks/ WMCore.zip WMSandbox/Now copy the new sandbox to the Specs area. Keep in mind that only jobs submitted after the sandbox is replaced will catch it. Also it is a good practice to save a copy of the original sandbox, just in case something goes wrong.
# find lumis to update update lumi_section_closed set filecount = 0 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>; update lumi_section_closed set filecount = 1 where lumi_id in ( ... ) and run_id = <run> and stream_id = <stream_id>;
delete from wmbs_sub_files_available where fileid in ( ... ); delete from wmbs_fileset_files where fileid in ( ... ); delete from wmbs_file_location where fileid in ( ... ); delete from wmbs_file_runlumi_map where fileid in ( ... ); delete from streamer where id in ( ... ); delete from wmbs_file_details where id in ( ... );
SELECT WMBS_WORKFLOW.NAME AS NAME, WMBS_WORKFLOW.TASK AS TASK, LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION AS SUBSCRIPTION, LUMI_SECTION_SPLIT_ACTIVE.RUN_ID AS RUN_ID, LUMI_SECTION_SPLIT_ACTIVE.LUMI_ID AS LUMI_ID FROM LUMI_SECTION_SPLIT_ACTIVE INNER JOIN WMBS_SUBSCRIPTION ON LUMI_SECTION_SPLIT_ACTIVE.SUBSCRIPTION = WMBS_SUBSCRIPTION.ID INNER JOIN WMBS_WORKFLOW ON WMBS_SUBSCRIPTION.WORKFLOW = WMBS_WORKFLOW.ID;# This will actually show the pending active lumi sections for repack. One of this should be related to the corrupted file, compare this result with the first query
SELECT * FROM LUMI_SECTION_SPLIT_ACTIVE;# You HAVE to be completely sure about to delete an entry from the database (don't do this if you don't understand what this implies)
DELETE FROM LUMI_SECTION_SPLIT_ACTIVE WHERE SUBSCRIPTION = 1345 and RUN_ID = 207279 and LUMI_ID = 129;
reco_locked tableIf you want to manually set a run as the fcsr you have to make sure that it is the lowest run with locked = 0
update reco_locked set locked = 0 where run >= <desired_run>
https://cmsweb.cern.ch/t0wmadatasvc/prod/run_stream_done?run=305199&stream=ZeroBias https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306462 https://cmsweb.cern.ch/t0wmadatasvc/prod/run_dataset_done/?run=306460&primary_dataset=METrun_dataset_done can be called without any primary_dataset parameters, in which case it reports back overall PromptReco status. It aggregates over all known datasets for that run in the system (ie. all datasets for all streams for which we have data for this run).
source /data/tier0/admin/env.sh
$manage execute-agent wmcoreD --restart --components=<componentName>Example
$manage execute-agent wmcoreD --restart --components=DBS3Upload
cd /data/tier0/ ./00_stop_agent.sh ./00_start_agent.sh
source /data/tier0/admin/env.sh $manage execute-agent wmcoreD --restart --component ComponentName
https://github.com/ticoann/WmAgentScripts/blob/wmstat_temp_test/test/updateT0RequestStatus.py
/data/tier0/srv/wmagent/2.0.4/sw/slc6_amd64_gcc493/cms/wmagent/1.0.17.pre4/bin/
if info['Run'] < <RunNumber>As you should notice, the given run number would be the oldest run to be shown in WMStats.
$manage execute-agent updateT0RequestStatus.py
#Source environment source /data/tier0/admin/env.sh # Add a site to Resource Control - Change site, thresholds and plugin if needed $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --cms-name=T2_CH_CERN_T0 --se-name=srm-eoscms.cern.ch --ce-name=T2_CH_CERN_T0 --pending-slots=1000 --running-slots=1000 --plugin=CondorPlugin # Change/init thresholds by task: $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --task-type=Processing --pending-slots=500 --running-slots=500 # Change site status (normal, drain, down) $manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN_T0 --down
source /data/tier0/admin/env.sh
$manage execute-agent wmagent-unregister-wmstats `hostname -f`
source /data/tier0/admin/env.sh
$manage execute-agent wmagent-resource-control --site-name=<Desired_Site> --task-type=Processing --pending-slots=<desired_value> --running-slots=<desired_value>Example:
$manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=3000 --running-slots=9000
$manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=PyCondorPlugin
$manage execute-agent wmagent-resource-control -p
select blockname from dbsbuffer_block where deleted = 0Some datasets could be marked as subscribed in the database, but not been really subscribed in PhEDEx. You can check this with Transfer Team and if that is the case, retry the subscription setting subscribed to 0. You can narrow the query to some blocks with a given name pattern or blocks in a specific site.
update dbsbuffer_dataset_subscription set subscribed = 0 where dataset_id in ( select dataset_id from dbsbuffer_block where deleted = 0 <and blockname like...> ) <and site like ...>Some blocks can be marked as closed, but still being open in PhEDEx. If this is the case, you can set status to "InDBS", to try closing them again. For example, if you want to closed MiniAOD blocks, you can provide a name pattern like '%/MINIAOD#%'. Attribute status can have 3 values: 'Open', 'InDBS' and 'Closed'. 'Open' is the first value assigned to all blocks, when they are closed and injected into DBS, status is changed to 'InDBS' and when they are closed in PhEDEx, status is changed to 'Closed'. Setting status to 'InDBS' would make the agent retries to close the blocks in PhEDEx.
update dbsbuffer_block set status = 'InDBS' where deleted = 0 and status = 'Closed' and blockname like ...If some subscriptions shouldn't be checked anymore, remove these subscriptions from database. For instance, if you want to remove RAW subscriptions to disk of all T1s, you can give a path pattern like '/%/%/RAW' and a site like 'T1_%_Disk'.
delete dbsbuffer_dataset_subscription where dataset_id in ( select id from dbsbuffer_dataset where path like ... ) and site like ...
condor_q 52982.15 -l | less -iTo get condor list by regexp:
condor_q -const 'regexp("30199",WMAgent_RequestName)' -af
condor_qedit <job-id> JobPrio "<New Prio (numeric value)>"
for job in $(condor_q -w | awk '{print $1}') do condor_qedit $job JobPrio "508200001" done
condor_qedit -const 'MaxWallTimeMins>30000' MaxWallTimeMins 1440
condor_status -pool vocms007 -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")' -af Cpus | sort | uniq -c
condor_status -pool vocms007 -const 'Slottype=="Dynamic" && ( ClientMachine=="vocms001.cern.ch" || ClientMachine=="vocms014.cern.ch" || ClientMachine=="vocms015.cern.ch" || ClientMachine=="vocms0313.cern.ch" || ClientMachine=="vocms0314.cern.ch" || ClientMachine=="vocms039.cern.ch" || ClientMachine=="vocms047.cern.ch" || ClientMachine=="vocms013.cern.ch")' -af Cpus | awk '{sum+= $1} END {print(sum)}'
condor_status -pool vocms007 -const 'State=="Claimed" && ( ClientMachine=!="vocms001.cern.ch" && ClientMachine=!="vocms014.cern.ch" && ClientMachine=!="vocms015.cern.ch" && ClientMachine=!="vocms0313.cern.ch" && ClientMachine=!="vocms0314.cern.ch" && ClientMachine=!="vocms039.cern.ch" && ClientMachine=!="vocms047.cern.ch" && ClientMachine=!="vocms013.cern.ch")' -af Cpus | awk '{sum+= $1} END {print(sum)}'
/etc/condor/config.d/99_local_tweaks.config
MAX_JOBS_RUNNING = <value>
MAX_JOBS_RUNNING = 12000
condor_reconfig
condor_qedit <job-id> Requestioslots "0"
for job in $(cat <text_file_with_the_list_of_job_condor_IDs>) do condor_qedit $job Requestioslots "0" done
9. namesToMapToTIER0 = [ "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms15.cern.ch", 10. "/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch"]
38. elif p[ 'dn' ] in namesToMapToTIER0: 39. dnmap[ p['dn'] ] = "cmst0"
/data/certs
admin/env.sh admin/env_unit.sh
/data/TransferSystem/t0_control.sh
Instance Name | TNS |
CMS_T0DATASVC_REPLAY_1 | INT2R |
CMS_T0DATASVC_REPLAY_2 | INT2R |
CMS_T0DATASVC_PROD | CMSR |
/data/tier0/00_stop_agent.sh
ps aux | egrep 'couch|wmcore'
sqlplus <instanceName>/<password>@<tns>Replacing the brackets with the proper values for each instance.
SQL> password Changing password for <user> Old password: New password: Retype new password: Password changed SQL> exit
/data/tier0/admin/And normally are named as following (not all the instances will have all the files):
WMAgent.secrets WMAgent.secrets.replay WMAgent.secrets.prod WMAgent.secrets.localcouch WMAgent.secrets.remotecouch
/data/tier0/srv/wmagent/current/config/tier0/config.pyThere you must look for the entry:
config.T0DAtaScvDatabase.connectUrland do the update.
https://session-manager.web.cern.ch/session-manager/
Node | Use | Type |
---|---|---|
vocms001 |
|
Virtual machine |
vocms015 |
|
Virtual machine |
vocms047 |
|
Virtual machine |
vocms0313 |
|
Physical Machine |
vocms0314 |
|
Physical Machine |
/data/tier0/sls/scripts/Logs
shutdown -r now
puppet agent -tv
# | Permissions | Owner | Group | Folder Name |
---|---|---|---|---|
1. | (775) drwxrwxr-x. | root | zh | admin |
2. | (775) drwxrwxr-x. | root | zh | certs |
3. | (755) drwxr-xr-x. | cmsprod | zh | cmsprod |
4. | (700) drwx------. | root | root | lost+found |
5. | (775) drwxrwxr-x. | root | zh | srv |
6. | (755) drwxr-xr-x. | cmst1 | zh | tier0 |
stat -c %a /path/to/file
EXAMPLE 1: chmod 775 /data/certs/
EXAMPLE 1: chown :zh /data/certs/
EXAMPLE 2: chown -R cmst1:zh /data/certs/*
glidecondorfolder, used to....
File | Description |
---|---|
00_deploy.prod.sh | Script to deploy the WMAgent for production(*) |
00_deploy.replay.sh | Script to deploy the WMAgent for a replay(*) |
00_fix_t0_status.sh | |
00_patches.sh | |
00_readme.txt | Some documentation about the scripts |
00_software.sh | Gets the source code to use form Github for WMCore and the Tier0. Applies the described patches if any. |
00_start_agent.sh | Starts the agent after it is deployed. |
00_start_services.sh | Used during the deployment to start services such as CouchDB |
00_stop_agent.sh | Stops the components of the agent. It doesn't delete any information from the file system or the T0AST, just kill the processes of the services and the WMAgent components |
00_wipe_t0ast.sh | Invoked by the 00_deploy script. Wipes the content of the T0AST. Be careful! |
Folder | Description |
---|
WMAGENT_SECRETS_LOCATION=$HOME/WMAgent.replay.secrets;
sed -i 's+TIER0_CONFIG_FILE+/data/tier0/admin/ReplayOfflineConfiguration.py+' ./config/tier0/config.py
sed -i "s+'team1,team2,cmsdataops'+'tier0replay'+g" ./config/tier0/config.py
# Workflow archive delay echo 'config.TaskArchiver.archiveDelayHours = 1' >> ./config/tier0/config.py
./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --cms-name=T0_CH_CERN --pnn=T0_CH_CERN_Disk --ce-name=T0_CH_CERN --pending-slots=1600 --running-slots=1600 --plugin=SimpleCondorPlugin ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Processing --pending-slots=800 --running-slots=1600 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Merge --pending-slots=80 --running-slots=160 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Cleanup --pending-slots=80 --running-slots=160 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=LogCollect --pending-slots=40 --running-slots=80 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Skim --pending-slots=0 --running-slots=0 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Production --pending-slots=0 --running-slots=0 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Harvesting --pending-slots=40 --running-slots=80 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Express --pending-slots=800 --running-slots=1600 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T0_CH_CERN --task-type=Repack --pending-slots=160 --running-slots=320 ./config/tier0/manage execute-agent wmagent-resource-control --site-name=T2_CH_CERN --cms-name=T2_CH_CERN --pnn=T2_CH_CERN --ce-name=T2_CH_CERN --pending-slots=0 --running-slots=0 --plugin=SimpleCondorPlugin
# | Instruction | Responsible Role |
---|
0. | Run a replay in the new headnode. Some changes have to be done to safely run it in a Prod instance. Please check the Running a replay on a headnode section | Tier0 |
1. | Deploy the new prod instance in a new vocmsXXX node, check that we use. Obviously, you should use a production version of 00_deploy.sh script. | Tier0 |
1.5. | Check the ProdOfflineconfiguration that is being used | Tier0 |
2. | Start the Tier0 instance in vocmsXXX | Tier0 |
3. | THIS IS OUTDATED ALREADY I THINK Coordinate with Storage Manager so we have a stop in data transfers, respecting run boundaries (Before this, we need to check that all the runs currently in the Tier0 are ok with bookkeeping. This means no runs in Active status.) | SMOps |
4. | THIS IS OUTDATED ALREADY I THINK Checking al transfer are stopped | Tier0 |
4.1. | THIS IS OUTDATED ALREADY I THINK Check http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager | |
4.2. | THIS IS OUTDATED ALREADY I THINK Check /data/Logs/General.log | |
5. | THIS IS OUTDATED ALREADY I THINK Change the config file of the transfer system to point to T0AST1. It means, going to /data/TransferSystem/Config/TransferSystem_CERN.cfg and change that the following settings to match the new head node T0AST) "DatabaseInstance" => "dbi:Oracle:CMS_T0AST", "DatabaseUser" => "CMS_T0AST_1", "DatabasePassword" => 'superSafePassword123', |
Tier0 |
6. | THIS IS OUTDATED ALREADY I THINK Make a backup of the General.log.* files (This backup is only needed if using t0_control restart in the next step, if using t0_control_stop + t0_control start logs won't be affected) | Tier0 |
7. | THIS IS OUTDATED ALREADY I THINK Restart transfer system using: A) t0_control restart (will erase the logs) B) t0_control stop t0_control start (will keep the logs)
|
Tier0 |
8. | THIS IS OUTDATED ALREADY I THINK Kill the replay processes (if any) | Tier0 |
9. | THIS IS OUTDATED ALREADY I THINK Start notification logs to the SM in vocmsXXX | Tier0 |
10. | Change the configuration for Kibana monitoring pointing to the proper T0AST instance. | Tier0 |
11. | THIS IS OUTDATED ALREADY I THINK Restart transfers | SMOps |
12. | RECHECK THE LIST OF CRONTAB JOBS Point acronjobs ran as cmst1 on lxplus to a new headnode. They are checkActiveRuns and checkPendingTransactions scripts. | Tier0 |
00_stop_agent.sh
service condor stopIf you want your data to be still available, then cp your spool directory to disk
cp -r /mnt/ramdisk/spool /data/
t0_control start
00_start_agentParticularly, check the PhEDExInjector component, if there you see errors, try restarting it after sourcing init.sh
source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh $manage execute-agent wmcoreD --restart --component PhEDExInjector
Some required packages are missing: + for p in '$missingSeeds' + echo mesa-libGLU mesa-libGLU + exit 1One needs to install the package manually (with a superuser access):
$ sudo yum install mesa-libGLU
sudo su -
cd /etc/condor/config.d/
-rw-r--r--. 1 condor condor 1849 Mar 19 2015 00_gwms_general.config -rw-r--r--. 1 condor condor 1511 Mar 19 2015 01_gwms_collectors.config -rw-r--r-- 1 condor condor 678 May 27 2015 03_gwms_local.config -rw-r--r-- 1 condor condor 2613 Nov 30 11:16 10_cms_htcondor.config -rw-r--r-- 1 condor condor 3279 Jun 30 2015 10_had.config -rw-r--r-- 1 condor condor 36360 Jun 29 2015 20_cms_secondary_collectors_tier0.config -rw-r--r-- 1 condor condor 2080 Feb 22 12:24 80_cms_collector_generic.config -rw-r--r-- 1 condor condor 3186 Mar 31 14:05 81_cms_collector_tier0_generic.config -rw-r--r-- 1 condor condor 1875 Feb 15 14:05 90_cms_negotiator_policy_tier0.config -rw-r--r-- 1 condor condor 3198 Aug 5 2015 95_cms_daemon_monitoring.config -rw-r--r-- 1 condor condor 6306 Apr 15 11:21 99_local_tweaks.config
# Knob to enable or disable flocking # To enable, set this to True (defragmentation is auto enabled) # To disable, set this to False (defragmentation is auto disabled) ENABLE_PROD_FLOCKING = True
ENABLE_PROD_FLOCKING = False
condor_reconfig
condor_config_val -master gsi_daemon_name
ps aux | grep "condor_negotiator"
kill -9 <replace_by_condor_negotiator_process_id>
sudo su -
cd /etc/condor/config.d/
# How to drain the slots # graceful: let the jobs finish, accept no more jobs # quick: allow job to checkpoint (if supported) and evict it # fast: hard kill the jobs DEFRAG_SCHEDULE = graceful
DEFRAG_SCHEDULE = fast
DEFRAG_SCHEDULE = graceful
watch "tail /data/tier0/srv/wmagent/current/install/tier0/Tier0Feeder/ComponentLog; tail /data/tier0/sminject/Logs/General.log; tail /data/tier0/srv/wmagent/current/install/tier0/JobCreator/ComponentLog"
watch "tail /data/TransferSystem/Logs/General.log; tail /data/TransferSystem/Logs/Logger/LoggerReceiver.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckManager.log; tail /data/TransferSystem/Logs/CopyCheckManager/CopyCheckWorker.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorManager.log; tail /data/TransferSystem/Logs/Tier0Injector/Tier0InjectorWorker.log"
cd /data/TransferSystem ./t0_control stop ./t0_control start
cd /data/tier0/sminject ./t0_control stop ./t0_control start
/afs/cern.ch/user/e/ebohorqu/public/HIStats/stats.pyFor the analysis we need to define certain things:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/RecoStatsProcessing.jsonWith a separate script in R, I was reading and summarizing the data:
/afs/cern.ch/user/e/ebohorqu/public/HIStats/parse_cpu_info.RThere, task type should be defined and also output file. With this script I was just summarizing cpu data, but we could modify it a little to get memory data. Maybe it is quicker to do it directly with the first python script, if you like to do it :P That script calculates efficiency of each job:
TotalLoopCPU / TotalJobTime * numberOfCoresand an averaged efficiency per dataset:
sum(TotalLoopCPU) / sum(TotalJobTime * numberOfCores)numberOfCores was obtained from job.pkl, TotalLoopCPU and TotalJobTime were obtained from report.pkl Job type could be Processing, Merge and Harvesting. For Processing type, task could be Reco or AlcaSkim and for Merge type, ALCASkimMergeALCARECO, RecoMergeSkim, RecoMergeWrite _AOD, RecoMergeWrite _DQMIO, RecoMergeWrite _MINIAOD and RecoMergeWrite _RECO.
git checkout master git fetch dmwm git pull dmwm master git push origin master
git checkout -b <branch-name> dmwm/master
git add <file-name>
git commit
git push origin <branch-name>
git commit --amend
git push -f origin <branch-name>
git branch -d <branch-name>Other useful commands
git branch git status
git reset git diff git log git checkout .
Path | Use | Who writes | Who reads | Who cleans |
---|---|---|---|---|
/eos/cms/store/t0streamer/ | Input streamer files transferred from P5 | Storage Manager | Tier-0 worker nodes | Tier-0 t0streamer area cleanup script |
/eos/cms/store/unmerged/ | Store output files smaller than 2GB until the merge jobs put them together | Tier-0 worker nodes (Processing/Repack jobs) | Tier-0 worker nodes(Merge Jobs) | ? |
/eos/cms/tier0/ | Files ready to be transferred to Tape and Disk | Tier-0 worker nodes (Processing/Repack/Merge jobs) | PhEDEx Agent | Tier-0 WMAgent creates and auto approves transfer/deletion requests. PhEDEx executes them |
/eos/cms/store/express/ | Output from Express processing | Tier-0 worker nodes | Users | Tier-0 express area clenaup script |
/eos/cms/store/t0streamer/SM writes raw files there. And we delete the files with the script. the script is on the acronjob under the cmst0 acc. It keeps data which are not repacked yet. Also, keeps the data not older than 7 days. The data is repacked (rewritten) dat files > PDs (raw .root files).
/eos/cms/store/unmerged/There go the files which need to be merged into larger files. Not all the files go there. The job itself manages it (after merging, the job deletes the unmerged files).
/eos/cms/store/express/Express output after being merged. Jobs from the tier0 are writing to it. Data deletions are managed by DDM.
/*/Tier0_REPLAY_vocms015*/*
with requests over node T2_CH_CERN
do:
https://cmsweb.cern.ch/phedex/datasvc/json/prod/requestlist?dataset=/*/Tier0_REPLAY_vocms015*/*&node=T2_CH_CERNThe request above will return a JSON resultset as follows:
{ "phedex": { "call_time": 9.78992, "instance": "prod", "request": [ { "approval": "approved", "id": 1339469, "node": [ { "decided_by": "Daniel Valbuena Sosa", "decision": "approved", "id": 1561, "name": "T2_CH_CERN", "se": "srm-eoscms.cern.ch", "time_decided": 1526987916 } ], "requested_by": "Vytautas Jankauskas", "time_create": 1526644111.24301, "type": "delete" }, { ... } ], "request_call": "requestlist", "request_date": "2018-07-17 20:53:23 UTC", "request_timestamp": 1531860803.37134, "request_url": "http://cmsweb.cern.ch:7001/phedex/datasvc/json/prod/requestlist", "request_version": "2.4.0pre1" } }The PhEDEx services not only allows you to create more detailed queries, but are faster than query the information on PhEDEx website.
/afs/cern.ch/user/c/cmst0/tier0_t0Streamer_cleanup_script/streamer_delete.log.
# Firstly # source /data/tier0/admin/env.sh # source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh # then you can use this simple script to retrieve a list of files from related runs. # Keep in mind, that in the below snippet, we are ignoring all Express streams output. # This is just a snippet, so it may not work out of box. from dbs.apis.dbsClient import DbsApi from pprint import pprint import os dbsUrl = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader' dbsApi = DbsApi(url = dbsUrl) runList = [316569] with open(os.path.join('/data/tier0/srv/wmagent/current/tmpRecovery/', "testRuns.txt"), 'a') as the_file: for a in runList: datasets = dbsApi.listDatasets(run_num=a) pprint(datasets) for singleDataset in datasets: pdName = singleDataset['dataset'] if 'Express' not in pdName and 'HLTMonitor' not in pdName and 'Calibration' not in pdName and 'ALCALUMIPIXELSEXPRESS' not in pdName: datasetFiles = dbsApi.listFileArray(run_num=a, dataset=pdName) #print("For run %d the dataset %s", a, pdName) for singleFile in datasetFiles: print(singleFile['logical_file_name']) the_file.write(singleFile['logical_file_name']+"\n")
and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ALCALUMIPIXELSEXPRESS' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'Express' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'HLTMonitor' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'Calibration' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ExpressAlignment' and CMS_STOMGR.FILE_TRANSFER_STATUS.STREAM != 'ExpressCosmics'