Alan Malta Twiki Page
- Alan Malta Twiki Page
- Useful links and documents (TODO: need to review and clean up these links)
- Useful Monitoring Links
- CMSWEB related
- RelVal Validation
- Other general hints
- Workload Management (WMCore stuff)
- Setup for Rucio client on lxplus
- Locally killing (aborting/force-completing) a workflow inside an agent
- Pushing couchapps to couch
- Renewing myproxy in the production agents
- Adding a site to the WMAgent resource-control database
- Updating the thresholds for a site. Setting the same values for ALL task types.
- Handy database queries
- Getting the number of files available for unfinished subscriptions (provided a workflow name)
- Getting the number of jobs in the 'created' status by workflows
- Getting a map of workflow -> subscriptions -> fileset for a specific workflow:
- Investigating held jobs in bossAir and wmbs
- Getting the number of jobs in each status for a specific workflow
- Provided a workflow task and an input file, get all the wmbs id that processed part of that input file and failed
- Check number of jobs in each job state
- Provided a workflow, get an overview of the amount of jobs in each state and job type.
- Provided a workflow, get an overview of the amount of jobs for each task, in each state, job type and the job outcome (1 is success).
- Provided a workflow, list the WMBS job id, state and job type for every job.
- Getting an overview of wmbs_job: number of jobs in each wmbs_job state, grouped by the job type and state.
- Getting an overview of bl_runjob: number of jobs in each schedd_status and their status in the agent (active or not).
- Count the job types in created status (in the JobSubmitter queue)
- Get acquired files, subscription and location for a specific workflow id:
- Get the job mask and the amount of input files each wmbs job is supposed to process:
- Get the job mask for every job in a workflow, ordered by workflow task:
- Given a file name, find all its run/lumi mapping:
- Provided a block name (and files if needed), find out which workflow created them:
- Inserting a new PNN into the database and creating its PSN to PNN association (Oracle backend)
- Inserting a new PNN into the database and creating its PSN to PNN association (MySQL backend)
- Unregistering a WMAgent agent from WMStats watch
- WMAgent deployment
- Injecting a PSet configuration file into CouchDB
- Unpickling a WMAgent job config
- Looking for a specific LFN in the agent condor logs
- Running a production job interactively
- Changing schedd prio factor in the production pool
- Update block (and file) location:
- Retrieving logs from castor@CERN:
- Injecting and running a test workflow with reqmgr.py
- SQL hints
- Workflow of request injection into ReqMgr
- Debugging MySQL (MariaDB) issues ('lock wait timeout exceeded'):
- My Personal Preferences
- Related Topics
Useful links and documents (TODO: need to review and clean up these links)
General links
Important Links
Production Statistics
Grid and Transfer Issues
- Downtimes - CMS Schedule and unschedule downtimes
- SiteReadiness - T1 Site Availability
- NagiosT2Tests
- SAM Availability based on Nagios tests (Latest results for T2's sites)
- SiteDB
- Information about Sites and Site Admins
- PledgeSlots - To check the pledge number of slots
Useful Monitoring Links
CMSWEB related
In order to restart a service running in the CMSWEB cluster, the following command needs to be executed:
sudo -H -u _couchdb bashs -l -c "/data/srv/current/config/couchdb/manage restart 'I did read documentation'"
Where
restart
can be replaced by one of the following:
status | stop | start
.
- Check if your jobs are matching in the FE, so if the FE is asking for them:
condor_status -any -const 'MyType=="glideresource"&&GLIDEIn_CMSSite=="T1_US_FNAL"' -format '%5i' GlideFactoryMonitorRequestedIdle -format '%s\n' Name | grep main
- Check if new Glideins are getting into the queues (if you find 0's, then ask the FE expert):
condor_status -any -const 'MyType=="glideresource"&&GLIDEIn_CMSSite=="T1_US_FNAL"' -format '%5i' GlideFactoryMonitorStatusPending -format '%s\n' Name | grep main
- Check if there are running Glideins (notify the FE expert in case of discrepancy):
condor_status -any -const 'MyType=="glideresource"&&GLIDEIn_CMSSite=="T1_US_FNAL"' -format '%5i' GlideFactoryMonitorStatusRunning -format '%s\n' Name | grep main
SVN commands
svn co svn+ssh://svn.cern.ch/reps/WmAgentScripts ./WmAgentScripts
- Adding a file into SVN, note that the file/dir won't appear unless you do an svn commit afterwards:
svn add <file/dir>
- Recursively commits all the changed, added and deleted files to the SVN server:
svn commit -m "<your message>"
GIT commands
- Changing files from a commit existent in a PR:
git reset --soft HEAD~1
git reset FILENAME ### to unstage a file
git checkout -- FILENAME ### to rollback that file to its original state before that commit
LSF commands
- Getting information about LSF host configuration, including host type, host model, CPU normalization factor, number of CPUs, maximum memory, and available resources:
lshosts -l HOST
- Getting information about the server hosts in the LSF Batch System:
bhosts -l HOST
- Getting a list of hosts that fills the resource requirements:
bhosts -R "select[type==SLC6_64]"
- Submitting a LSF job to SLC6 worker nodes:
bsub -R "select[type==SLC6_64]" -q cmsrelval -J "$job" < $job
Condor job status
- 0 - Unexpanded U
- 1 - Idle I
- 2 - Running R
- 3 - Removed X
- 4 - Completed C
- 5 - Held H
- 6 - Submission_err E
Hints for cmsDriver.py
This tool creates production configuration files. For example the following command:
# cmsDriver.py SingleMuPt1_cfi.py -s GEN,FASTSIM --conditions=MC_3XY_V10::All --eventcontent=RECOSIM -n 10 --no_exec
- SingleMuPt1_cfi.py: this file is used in order to get its settings to the config file to be created (placed in Configuration/Generators/).
- GEN,FASTSIM: runs these parts of the CMSSW, generator followed by fast simulation.
- MC_3XY_V10::All: this is the global tag which will be used for apply the alignment and calibration constants.
- RECOSIM: the output file of this job will contain RECOSIM content.
- 10: the number of events per run
Hints for WMStats
One could use the following couch query to get the status timestamps and calculate the time that the workflow took to complete:
https://cmsweb.cern.ch/couchdb/wmstats/_all_docs?keys=%5B%22jen_a_ACDC_Fall53_2011A_METBTag_Run2011A-v1_Prio1_5312p1_131101_030858_2617%22%5D&include_docs=true
Other general hints
Getting CERN users info
phonebook -lo username
phonebook full_name
Hints for PhEDEx Data Service
- To get information regarding the dataset/block/file, lfn, checksum, etc:
https://cmsweb.cern.ch/phedex/datasvc/xml/prod/data?file=/store/data/Run2010A/EG/RAW-RECO/WZEG-Nov4Skim_v1/0042/7A527E82-31ED-DF11-855E-003048D439C6.root
- Checking requests information about a given dataset:
https://cmsweb-testbed.cern.ch/phedex/datasvc/xml/prod/requestlist?dataset=/EG/Run2010A-WZEG-Nov4Skim_v1/RAW-RECO
- Checking requests information about a given dataset, transfer only type and to T2_DE_DESY only:
https://cmsweb-testbed.cern.ch/phedex/datasvc/xml/prod/requestlist?dataset=/ReggeGribovPartonMC_EposLHC_pPb_4080_4080_DataBS/pPb816Spring16GS-MB_80X_mcRun2_pA_v4-v2/GEN-SIM&type=xfer&node=T2_DE_DESY
- To get information regarding the block (filtering according to the node):
https://cmsweb.cern.ch/phedex/datasvc/xml/prod/blockreplicas?dataset=/Jet/Run2010B-RelValRawSkim-v1/RAW&node=T1_US_FNAL*
- Checking the TFC for a given site:
https://cmsweb.cern.ch/phedex/datasvc/xml/prod/tfc?node=T2_CH_CERN
- To get the PFN of a given file:
https://cmsweb.cern.ch/phedex/datasvc/perl/prod/lfn2pfn?node=T1_FR_CCIN2P3_MSS&lfn=/store/unmerged/logs/prod/2012/12/13/spinoso_HIG-Summer12-00976_R2087_B222_00_LHE_121202_135447_6236/MonteCarloFromGEN/0160/0/013c8d40-4238-11e2-8352-003048f0e3f4-26796-0-logArchive.tar.gz&protocol=srmv2
- Checking for all DQM samples at T2_CH_CERN:
https://cmsweb.cern.ch/phedex/datasvc/xml/prod/data?node=T2_CH_CERN&dataset=*4_4_2*DQM
- Getting the total number of files for a given dataset:
curl -ks --cert $X509_USER_PROXY --key $X509_USER_PROXY "https://cmsweb.cern.ch/phedex/datasvc/perl/prod/blockreplicas?dataset=/QCD_Pt-30to50_MuEnrichedPt5_TuneZ2_7TeV_pythia6/Summer11Leg-START53_LV4-v1/GEN-SIM" | grep -A1 NAME | grep FILES | sed 's/,//g' | awk '{FILES+=$3}END{printf("total files: %d\n",FILES)}'
- Getting all samples at T2_CH_CERN that match a specific regex:
curl -k -H 'Accept: text/xml' 'https://cmsweb.cern.ch/phedex/datasvc/perl/prod/blockreplicas?dataset=/SingleElectron/CMSSW_7_4_0_pre8*/*&node=T2_CH_CERN'
Hints for DBS3 API
- Querying for a dataset and its status:
# https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datasets?dataset=/RelValSingleMuPt100/CMSSW_7_0_0_pre11-START70_V4_FastSim_TEST_Agent0988_Validation_RelVal-v1/GEN-SIM-DIGI-RECO&dataset_access_type=*
- Request dataset/blocks details, such as # of files, of events and total file size:
# https://cmsweb-testbed.cern.ch/dbs/prod/global/DBSReader/blocksummaries?dataset=/RelValSingleMuPt100/CMSSW_7_0_0_pre11-START70_V4_FastSim_TEST_Agent0988_Validation_RelVal-v1/GEN-SIM-DIGI-RECO
or
# https://cmsweb-testbed.cern.ch/dbs/prod/global/DBSReader/blocks?dataset=/RelValSingleMuPt100/CMSSW_7_0_0_pre11-START70_V4_FastSim_TEST_Agent0988_Validation_RelVal-v1/DQM&detail=True
Git hints
- Getting stats from the diff between two branches.
# git diff --stat --color comp..comp_gcc481
- Retrieving the diff of a file betweem two branches:
# git diff --color comp..comp_gcc481 reqmon.spec
RPM hints
- Lists all rpms installed in the system and grep for expat packages:
# rpm -qa | grep expat
- Show all the files installed by a given package
# rpm -ql expat-2.0.1-11.el6_2.x86_64
- Listing symbols provided by a given library:
# nm -a libexpat.so
- Listing files stored in an rpm file:
# rpm -qpl file.rpm
Hints for T1_US_FNAL
- Copying a file from FNAL to your machine:
# dccp dcap://cmsdca3.fnal.gov:24145/pnfs/fnal.gov/usr/cms/WAX/11/store/data/Run2010A/EG/RAW-RECO/WZEG-Nov4Skim_v1/0043/6AFF1EFA-3FED-DF11-AAF0-002481E1512E.root .
Troubleshooting Links
Linux Administration
- Tracing a process and sending its system calls/logs to a file:
# strace -r -p PID -f -o outputfile
- Collecting overall IOPS during an interval of 30min (for the vdb device). Then we also calculate the average of these values a) considering all reads and b) discarding 0s. Also collects max and min values:
# iostat 1 1800 -d vdb | grep vdb | awk '{ print $2; }' > iops_30min
# cat iops_30min | awk '{ sum += $1 } END { if (NR > 0) print sum / NR }'
# egrep -v '^0.00' iops_30min | awk '{ sum += $1 } END { if (NR > 0) print sum / NR }'
# cat iops_30min | sort -n | tail
- Rsync a file with a less costly encryption algorithm:
# rsync -e "ssh -c arcfour" /data/srv/logs/workqueue/old-logs-20141102-0107.zip amaltaro@alan-wmacloud:/data/amaltaro/
- For the sake of my memory:
# for comp in `cat ../components`; do echo $comp; tail -n100 install/wmagent/$comp/ComponentLog; echo ""; done > currentLogs
- Querying CMSWEB from SL6 nodes:
# wget -q --certificate=$X509_USER_PROXY --private-key=$X509_USER_PROXY --ca-certificate=$X509_USER_PROXY "https://cmsweb.cern.ch/sitedb/data/prod/whoami" -O -
- Applying WMAgent patches:
# wget https://github.com/dmwm/WMCore/pull/$PR_NUM.patch -O - | patch -d apps/wmagent/lib/python2.6/site-packages -p 3
- Removing applied WMAgent patches:
# wget https://github.com/dmwm/WMCore/pull/$PR_NUM.patch -O - | patch -d apps/wmagent/lib/python2.6/site-packages -p 3 -R
or
# patch -p0 -R < patch_file_path.patch
- Component monitoring with `ps` as done in WMCore:
# ps -p PID_NUMBER -o pid,ppid,rss,vsize,pcpu,pmem,cmd -ww | grep PID_NUMBER
- Listing the background linux processes:
# jobs
- Converting a ugly PSet.py to a beautiful one :):
# python -i Pset.py
# python>print '%s' % process.dumpPython()
- Adding and committing a file to svn:
# svn add CheckWorkQueueElements.py
# svn commit mc/CheckWorkQueueElements.py
- To list all files greater than ~10G:
# find . -type f -size +10G -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'
- Find a given wmbs_id in the local agent:
# find . -name \*job_342678\*
- To check which kind of shell you are using:
# echo $0
# python -m compileall <DIR>
- To remove lines that has a string:
sed '/<string>/d' <Input_file> > <Output_file>
- To replace a string by a new one in several files:
sed -i 's/<old_string>/<new_string>/g' <Files>*
- Replacing a comma separated list of runs by a pipe separated one:
perl -p -i -e 's/([0-9]{6}),(?=[0-9]{6})/\1|/g' cmsDriver_standard_hlt.txt
- To print only directories
# ls -d */
- To convert the EPOCH time in human readable time
# date -d @<epoch_time>
- To check the used space at T2_CH_CERN for RelVal team (only logi bytes):
# eoscms quota | grep -A 5 /relval
- List content in CMS EOS space at CERN:
# eoscms ls /store/logs/prod/2015/10/WMAgent/
- Copying files over with XROOTD (you need your proxy and the PFN from storage.xml):
# xrdcp root://cmsxrootd.fnal.gov//store/unmerged/logs/prod/2016/10/21/mewu_RVCMSSW_8_1_0_pre13RunZeroBias2015D__RelVal_zb2015D_161019_140327_9153/HLTDR2_25ns/0000/1/025cb918-9755-11e6-ab8f-02163e00d7b3-0-1-logArchive.tar.gz .
- Copy a file from CMS EOS storage to your local disk:
# xrdcp root://eoscms.cern.ch//eos/cms//store/logs/prod/2015/10/WMAgent/amaltaro_MonteCarlo_Ext_alan_TEST_StageOut_v15_151001_131202_7326/amaltaro_MonteCarlo_Ext_alan_TEST_StageOut_v15_151001_131202_7326-LogCollect-s1wn80-1-logs.tar .
# fs lq ~
- To check information related to a given file in EOS:
# eos fileinfo $LFN
- There are 2 output streams, stdout for normal output, stderr for warnings.
# command> myfile # will redirect the stdout of command to myfile
# command 2> myfile # will redirect the stderr of command to myfile
- Show all output lines of command except those containing the string str.
# command | grep -v <str>
- To check what services should be automatically started:
# ntsysv
# crontab -l # minute hour day month week_day linha-de-comando
# pkill -KILL -u <user>
- Command line to comment out all lines in a file (using vim):
:1,$s/^/#/g
- Command to erase the first word in all lines of the file (using vim):
:%norm dw
- Command to find all files ended in ch:
# find . -name "*.ch"
- To check the selinux status on SLC5:
# cat /etc/sysconfig/selinux
- To list the active internet connections:
# netstat -atunp
- To check if a service is set up in the startup of the machine:
# chkconfig --list | grep <service>
# chkconfig --add <service>
# chkconfig --level <run_levels> <service> on // e.g.: chkconfig --level 2345 pool on
# chkconfig <service> off
# route add default gw <ip>
# route -n
# chkconfig --level <run_levels> <service> on // e.g.: chkconfig --level 2345 pool on
- How to list, remove and install a new kernel:
# rpm -q kernel
# rpm -ev <kernel_returned_above>
# yum install kernel kernel-devel
- Checking the state of a hard disk (in a non-destructive way):
# badblocks -o badblocks_found.dat -n -v /dev/sda4
or
# fsck.ext3 -n -v /dev/sda4
- sed: Text manipulation such as replace/delete/etc. :
# sed 's/old_string/new_string/g' <file_in.job >file_out.job
# echo $1 | sed 's/\*//g' # removing the stars from the string
- awk: just print the select columns from the filename, adding a space between the second and fifth column:
# awk '{print $1 $2 " " $5}' <filename>
- nohup: Keeps a program running in background even after you log off from the terminal. Example:
# nohup program &
- grep: looking for a pattern (in this case, X86_64)
- -R: recursively
- -H: print the filename for each match
- -n: Prefix each line of output with the line number within its input file.
- -s: Suppress error messages about nonexistent or unreadable files.
- -I Process a binary file as if it did not contain matching data;
- -i: Ignore case distinctions in both the PATTERN and the input files.
# grep -R -H -I -n -s X86_64 *
- networking: locate of some files to configure the network interface
# cd /etc/sysconfig/network-scripts/
- Cleaning the bash history:
# history -c
- Checking out a given branch:
# cvs checkout -r branch_52X Configuration/PyReleaseValidation
Grid Administration
- Commands related to proxy:
# voms-proxy-init -debug -voms cms # the cert/key pair of files should be placed under ~/.globus
# voms-proxy-init --valid 192:00 -voms cms:/cms/Role=production
# globusrun -a -r osgce.hepgrid.uerj.br # testing authentication
# voms-proxy-info -all -file <your proxy>
- Copying the stageout file into your localhost
# globus-url-copy -vb gsiftp://osgce.hepgrid.uerj.br//share/apps/scratch/<user>-stageout/<file> file://localhost//`pwd`/<file>
# srmcp -debug -2 "srm://se-dcache.hepgrid.uerj.br:8443//srm/managerv2?SFN=/pnfs/hepgrid.uerj.br/data/cms/store/user/amaltaro/outfile.root" file:///`pwd`/Alan_output.root
# lcg-cp -b -D srmv2 --vo cms "srm://se-dcache.hepgrid.uerj.br:8443//srm/managerv2?SFN=/pnfs/hepgrid.uerj.br/data/cms/store/user/amaltaro/outfile.root" file:///`pwd`/Alan_output.root
# lcg-cp -b -D srmv2 --vo cms "srm://cmssrm.fnal.gov:8443/srm/managerv2?SFN=/11/store/data/Run2012A/MET/RAW/v1/000/191/226/1859BA1D-DA85-E111-ABB7-003048F118DE.root" file:///`pwd`/Alan_output.root
# lcg-ls -b -D srmv2 --vo cms "srm://se-dcache.hepgrid.uerj.br:8443/srm/managerv2?SFN=/pnfs/hepgrid.uerj.br/data/cms/store/user/amaltaro/"
# lcg-cp -v -b -D srmv2 srm://storm-fe-cms.cr.cnaf.infn.it:8444/srm/managerv2?SFN=/cms$tarball_lfn file:////`pwd`/logArchive.tar.gz
# srmls "srm://se-dcache.hepgrid.uerj.br:8443/srm/managerv2?SFN=/pnfs/hepgrid.uerj.br/data/cms"
- Manually updating CRLs (SE and GridFTP, respectivelly):
# /opt/vdt/fetch-crl/sbin/fetch-crl --loc /opt/vdt/globus/TRUSTED_CA/ --out /opt/vdt/globus/TRUSTED_CA/
# /opt/vdt-client/fetch-crl/sbin/fetch-crl --loc /opt/vdt-client/globus/TRUSTED_CA/ --out /opt/vdt-client/globus/TRUSTED_CA/
- Submitting jobs via globus:
# globus-job-run osgce.hepgrid.uerj.br:/jobmanager-condor /bin/uname -a
# globus-job-run red.unl.edu:/jobmanager-condor /bin/uname -a
# globus-job-run cit-gatekeeper.ultralight.org:/jobmanager-condor /bin/uname -a
# condor_hold -reason 'alan messing around' 6502.11
# condor_rm -const 'CMS_JobType =?= "Merge"'
# condor_rm -const 'WMAgent_RequestName == "amaltaro_ACDC3_task_SUS-RunIIWinter15wmLHE-00039__v1_T_150817_164946_9264"'
# condor_submit <submit.file>
# watch -n 1 'condor_q'
# condor_qedit 763.0 RequestMemory 2200
# condor_qedit 261.0 DESIRED_Sites "\"T1_US_FNAL\""
# condor_qedit -const 'JobStatus=?=1 && (CMS_JobType =?= "Production" || CMS_JobType =?= "Processing")' JobPrio 999999
# condor_qedit -const 'JobStatus=?=1 && isUndefined(REQUIRED_OS)' REQUIRED_OS "\"any\""
# condor_q -af:h MaxWallTimeMins
# condor_q -af:h WMAgent_SubTaskName RequestCpus RequestMemory RequestDisk MachineCount AccountingGroup MaxWallTimeMins
# condor_rm -name analysis4@analysis4.local64 <user>
# condor_q -global | grep running
# condor_q -hold
# condor_status -submitters
# condor_status -run
# condor_rm <job_id>
# condor_q -bet <job_id>
# condor_q -global -better-analyze <job_id>
# condor_prio -p 20 <job_id> ### Giving the maximum priority in the local queue
# condor_history -const 'regexp("_150903_",WMAgent_RequestName)' -format "%s " MaxWallTimeMins -format "%s " RequestDisk -format "%s \n" WMAgent_SubTaskName
# condor_userprio -all -allusers
# condor_history -const 'WMAgent_RequestName == "amaltaro_RV620pre5slc6_TEST_SL6_FNAL_SingleMuPt100_131010_150812_226"' -format "%s " LastRemoteHost -format "%s " REQUIRED_OS -format "%s \n" WMAgent_SubTaskName
- Getting a total number of autoclusters or listing them in a given schedd
# condor_status -schedd -af:h name Autoclusters | grep submit
# condor_q -name vocms0250.cern.ch -autocluster
- Checking where a workflow ran:
# condor_history -const 'WMAgent_RequestName == "amaltaro_ReDigi_cmsRun2_LumiMask_DMWM_TEST_ALAN_location_v4_160210_100338_3679"' -af:h DESIRED_Sites MachineAttrGLIDEIN_CMSSite0
- Looking for condor jobs running in a specific site:
# condor_q -w -constraint 'MachineAttrGLIDEIN_CMSSite0 == "'T2_DE_DESY'"'
- Condor history to get the exact reason for condor remove:
# condor_history -b -m 1 15824.1 -format '%s ' DiskUsage -format '%s ' NumShadowStarts -format '%s\n' NumJobStarts
- Get the number of running jobs per site and per schedd (run it from any agent):
# condor_q -g -const 'MachineAttrGLIDEIN_CMSSite0=?="T2_CH_CERN"' -format '%s\n' User | sort | uniq -c
- Query for jobs pending with T3_US_NERSC in the DESIRED_Sites list:
# condor_q -const 'JobStatus==1 && stringListIMember("T3_US_NERSC", DESIRED_Sites,",")' -af WMAgent_SubTaskName DESIRED_Sites
- Changing a WMAgent schedd priority in the condor collector (vocms97):
# condor_userprio -all -allusers
# condor_userprio -setfactor cmst1@vocmsxxx.cern.ch 100000.00
# scramv1 list CMSSW
# scramv1 project -n <project_name> CMSSW CMSSW_4_1_2_patch1
# eval `scramv1 runtime -sh` // or just "cmsenv"
# crab -create
# crab -validateCfg
# crab -submit
# crab -status
# crab -status -c <project_dir>
# crab -getoutput
# crab -listMatch
# crab -uploadLog
- Checking the validity of the certificates (CEs):
# openssl x509 -in /data/srv/current/auth/proxy/proxy.cert -noout -dates -subject
# openssl x509 -in /etc/grid-security/hostcert.pem -dates -noout
# openssl x509 -noout -subject -in /etc/grid-security/hostcert.pem | sed -e 's/^subject= //'
# openssl x509 -text -in /etc/grid-security/hostcert.pem | grep Subject: # to get the hosts' DN
# openssl x509 -dates -in /etc/grid-security/http/httpcert.pem -noout
# openssl x509 -dates -in /etc/grid-security/rsvcert.pem -noout
# openssl x509 -dates -in /etc/grid-security/containercert.pem -noout
- Starting/stopping services on OSGCE:
# source /opt/osg/setup.sh
# vdt-control -off
# vdt-control -list
# vdt-control -<on/off> <service>
- Logfiles to check the GUMS service:
# tail -f /opt/gums/tomcat/v55/logs/gums-service-admin.log
# vim /etc/grid-security/grid-mapfile
- Extracting the *.pem certificate/key through the *.p12 file.
# openssl pkcs12 -clcerts -nokeys -in cert.p12 -out usercert.pem
# openssl pkcs12 -nocerts -in cert.p12 -out userkey.pem
- Making a DBS query using the CLI (don't forget to initialize your proxy):
# dbs listBlocks --report --path=/MinimumBias/Commissioning10-Sep17ReReco_v2/RECO
# find dataset, block, block.size where dataset=/HIAllPhysics/HIRun2010-v1/RAW and run=152698
- Making a PhEDEx xml query thru the url (if you wish you can replace "xml" by "json"):
# https://cmsweb.cern.ch/phedex/datasvc/xml/prod/blockreplicas?node=T2_CH_CERN&dataset=*4_4_2*DQM
Workload Management (WMCore stuff)
Setup for Rucio client on lxplus
source /cvmfs/cms.cern.ch/cmsset_default.sh
voms-proxy-init -voms cms -rfc -valid 192:00
source /cvmfs/cms.cern.ch/rucio/setup-py3.sh
export RUCIO_ACCOUNT=`whoami`
If you want to query the Integration instance, you can use it like:
rucio --host http://cms-rucio-int.cern.ch --auth-host https://cms-rucio-auth-int.cern.ch whoami
Locally killing (aborting/force-completing) a workflow inside an agent
$manage execute-agent kill-workflow-in-agent pdmvserv_task_HIN-pPb816Spring16GS-00123__v1_T_170913_201506_4093
Pushing couchapps to couch
$manage execute-agent wmagent-couchapp-init
Renewing myproxy in the production agents
From one of the agents, check the timeleft for the long term proxy:
(export X509_USER_CERT=/data/certs/myproxy.pem; export X509_USER_KEY=/data/certs/myproxy.pem; myproxy-info -v -l amaltaro -s "myproxy.cern.ch" -k amaltaroCERN)
Renewing the "amaltaroFNAL" long term proxy. Connect to a FNAL node or cmslpc cluster and run:
myproxy-init -v -l amaltaro -c 720 -t 168 -s "myproxy.cern.ch" -x -Z "/DC=org/DC=incommon/*/CN=(wmagent/|)wmagent.fnal.gov" -n -k amaltaroFNAL
Renewing the "amaltaroCERN" proxy. Connect to vocms0192 and run:
myproxy-init -v -l amaltaro -c 720 -t 168 -s "myproxy.cern.ch" -x -Z "/DC=ch/DC=cern/OU=computers/CN=wmagent/(vocms0192|vocms0255).cern.ch" -n -k amaltaroCERN
Adding a site to the WMAgent resource-control database
Method 1: if the site has a proper registration into SiteDB, including a proper PSN and PNN
$manage execute-agent wmagent-resource-control --plugin=SimpleCondorPlugin --pending-slots=100 --running-slots=200 --add-one-site T3_US_Colorado
Method 2: mostly for T3s or opportunistic resources that don't have a PNN (storage). Be *careful* because a wrong information can break... AgentStatusWatcher
$manage execute-agent wmagent-resource-control --site-name=T1_US_FNAL --cms-name=T1_US_FNAL --ce-name=T1_US_FNAL --pnn=T1_US_FNAL_Disk --plugin=SimpleCondorPlugin --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_US_FNAL --task-type=Processing --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_US_FNAL --task-type=Production --pending-slots=1500 --running-slots=4000
$manage execute-agent wmagent-resource-control --site-name=T1_US_FNAL --task-type=Merge --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_US_FNAL --task-type=Cleanup --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_US_FNAL --task-type=LogCollect --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_US_FNAL --task-type=Skim --pending-slots=50 --running-slots=50
$manage execute-agent wmagent-resource-control --site-name=T1_US_FNAL --task-type=Harvesting --pending-slots=10 --running-slots=20
Updating the thresholds for a site. Setting the same values for ALL task types.
$manage execute-agent wmagent-resource-control --site-name=T1_US_FNAL --pending-slots=100 --running-slots=1000 --apply-to-all-tasks
Handy database queries
Getting the number of files available for unfinished subscriptions (provided a workflow name)
SELECT wmbs_subscription.id AS subId, wmbs_workflow.task AS wfName, wmbs_workflow.injected As wfInj, wmbs_fileset.name AS fsetName,
wmbs_fileset.open AS fsetOpen, wmbs_subscription.finished AS subFin, wmbs_sub_types.name AS subType, count(wmbs_sub_files_available.fileid) AS filesAvail
FROM wmbs_subscription
INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id
INNER JOIN wmbs_sub_files_available ON wmbs_sub_files_available.subscription = wmbs_subscription.id
WHERE wmbs_workflow.name='pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573'
AND wmbs_subscription.finished = 0
GROUP BY wmbs_subscription.id, wmbs_workflow.task, wmbs_workflow.injected, wmbs_fileset.name, wmbs_fileset.open,
wmbs_subscription.finished, wmbs_sub_types.name;
Getting the number of jobs in the 'created' status by workflows
SELECT wmbs_workflow.name, count(wmbs_job.id)
FROM wmbs_job
INNER JOIN wmbs_jobgroup ON
wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON
wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_sub_types ON
wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_job_state ON
wmbs_job.state = wmbs_job_state.id
INNER JOIN wmbs_workflow ON
wmbs_subscription.workflow = wmbs_workflow.id
WHERE wmbs_job_state.name = 'created'
GROUP BY wmbs_workflow.name;
Getting a map of workflow -> subscriptions -> fileset for a specific workflow:
# SELECT wmbs_subscription.id AS subId, wmbs_workflow.task AS wfName, wmbs_workflow.injected As wfInj, wmbs_fileset.name AS fsetName,
wmbs_fileset.open AS fsetOpen, wmbs_subscription.finished AS subFin, wmbs_sub_types.name AS subType
FROM wmbs_subscription
INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id
WHERE wmbs_workflow.name='sagarwal_task_TOP-RunIIFall18wmLHEGS-00109__v1_T_181105_200726_3792'
AND wmbs_fileset.open = 1
AND wmbs_subscription.finished = 0
ORDER BY fsetName;
Investigating held jobs in bossAir and wmbs
SELECT rj.id, rj.wmbs_id, rj.grid_id, rj.status, rj.sched_status, rj.status_time,
wmbs.state AS wmbs_state, wmbs.state_time AS wmbs_state_time, wmbs.outcome AS wmbs_outcome, wmbs.cache_dir
FROM bl_runjob rj INNER JOIN wmbs_job wmbs ON rj.wmbs_id = wmbs.id
WHERE sched_status = 5;
Getting the number of jobs in each status for a specific workflow
SELECT wmbs_job_state.name, count(wmbs_job.id)
FROM wmbs_job
INNER JOIN wmbs_jobgroup ON
wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON
wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_sub_types ON
wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_job_state ON
wmbs_job.state = wmbs_job_state.id
INNER JOIN wmbs_workflow ON
wmbs_subscription.workflow = wmbs_workflow.id
WHERE wmbs_workflow.name = 'pdmvserv_task_HIG-RunIIWinter15wmLHE-00561__v1_T_151207_101950_7247'
GROUP BY wmbs_job_state.name;
Provided a workflow task and an input file, get all the wmbs id that processed part of that input file and failed
SELECT wj.id AS jobid, wfd.id AS fileid, wfd.events, wfd.first_event, wfd.merged
FROM wmbs_file_details wfd
INNER JOIN wmbs_job_assoc wja ON wja.fileid = wfd.id
INNER JOIN wmbs_job wj ON wj.id = wja.job
INNER JOIN wmbs_jobgroup wjg ON wjg.id = wj.jobgroup
INNER JOIN wmbs_subscription ws ON ws.id = wjg.subscription
INNER JOIN wmbs_workflow ww ON ww.id = ws.workflow
WHERE wfd.lfn = 'MCFakeFile-pdmvserv_SUS-RunIISpring16FSPremix-00055_00034_v0__160929_122033_2363-Production-fa3a91bc780de492218c2fe34047a1c3'
AND wj.outcome=0 AND ww.task = '/pdmvserv_SUS-RunIISpring16FSPremix-00055_00034_v0__160929_122033_2363/Production';
Check number of jobs in each job state
SELECT wmbs_job_state.name, COUNT(*) FROM wmbs_job
JOIN wmbs_job_state ON (wmbs_job.state = wmbs_job_state.id)
GROUP BY wmbs_job.state, wmbs_job_state.name;
Provided a workflow, get an overview of the amount of jobs in each state and job type.
SELECT wmbs_sub_types.name AS job_type, wmbs_job_state.name as state, COUNT(*) AS count FROM wmbs_job
INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_workflow ON wmbs_workflow.id = wmbs_subscription.workflow
INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_job_state ON wmbs_job.state = wmbs_job_state.id
WHERE wmbs_workflow.name = 'cmsunified_ACDC0_task_TOP-RunIISummer15wmLHEGS-00341__v1_T_191210_085250_7580'
GROUP BY wmbs_sub_types.name, wmbs_job_state.name;
Provided a workflow, get an overview of the amount of jobs for each task, in each state, job type and the job outcome (1 is success).
SELECT wmbs_workflow.task, wmbs_sub_types.name AS job_type, wmbs_job_state.name as state, wmbs_job.outcome, COUNT(*) AS count FROM wmbs_job
INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_workflow ON wmbs_workflow.id = wmbs_subscription.workflow
INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_job_state ON wmbs_job.state = wmbs_job_state.id
WHERE wmbs_workflow.name = 'pdmvserv_task_TOP-RunIISummer15wmLHEGS-00427__v1_T_191009_224926_441'
GROUP BY wmbs_workflow.task, wmbs_sub_types.name, wmbs_job_state.name, wmbs_job.outcome
ORDER BY wmbs_workflow.task;
Provided a workflow, list the WMBS job id, state and job type for every job.
SELECT wmbs_sub_types.name AS job_type, wmbs_job_state.name as state, wmbs_job.id FROM wmbs_job
INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_workflow ON wmbs_workflow.id = wmbs_subscription.workflow
INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_job_state ON wmbs_job.state = wmbs_job_state.id
WHERE wmbs_workflow.name = 'cmsunified_ACDC0_task_TOP-RunIISummer15wmLHEGS-00341__v1_T_191210_085250_7580'
ORDER BY wmbs_sub_types.name, wmbs_job_state.name;
Getting an overview of wmbs_job: number of jobs in each wmbs_job state, grouped by the job type and state.
SELECT wmbs_sub_types.name AS job_type, wmbs_job_state.name as state, COUNT(*) AS count FROM wmbs_job
INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_job_state ON wmbs_job.state = wmbs_job_state.id
GROUP BY wmbs_sub_types.name, wmbs_job_state.name;
Getting an overview of bl_runjob: number of jobs in each schedd_status and their status in the agent (active or not).
SELECT bl_status.name AS sched_status, status AS active, count(*) AS count
FROM bl_runjob
INNER JOIN bl_status ON bl_runjob.sched_status = bl_status.id
LEFT OUTER JOIN wmbs_users ON wmbs_users.id = bl_runjob.user_id
INNER JOIN wmbs_job ON wmbs_job.id = bl_runjob.wmbs_id
GROUP BY sched_status, active;
Count the job types in created status (in the JobSubmitter queue)
SELECT wmbs_sub_types.name, COUNT(*) FROM wmbs_job
INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
WHERE wmbs_job.state = (SELECT id FROM wmbs_job_state WHERE name = 'created')
GROUP BY wmbs_subscription.subtype, wmbs_sub_types.name;
Get acquired files, subscription and location for a specific workflow id:
SELECT wmbs_sub_files_acquired.fileid, wmbs_sub_files_acquired.subscription, wpnn.pnn AS pnn
FROM wmbs_sub_files_acquired
INNER JOIN wmbs_subscription ON wmbs_sub_files_acquired.subscription = wmbs_subscription.id
INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
INNER JOIN wmbs_file_location ON wmbs_sub_files_acquired.fileid = wmbs_file_location.fileid
INNER JOIN wmbs_pnns wpnn ON wmbs_file_location.pnn = wpnn.id
WHERE wmbs_workflow.id=207 ORDER BY pnn;
Get the job mask and the amount of input files each wmbs job is supposed to process:
SELECT wmbs_job.id, wmbs_job_mask.firstlumi, wmbs_job_mask.lastlumi, count(wmbs_job_assoc.fileid) AS num_files
FROM wmbs_job INNER JOIN wmbs_job_mask ON wmbs_job.id=wmbs_job_mask.job
INNER JOIN wmbs_job_assoc ON wmbs_job.id=wmbs_job_assoc.job
WHERE cache_dir like '%TOP-RunIISummer16DR80Premix-00266_1/TOP-RunIISummer16MiniAODv2-00263_0/JobCollection%'
GROUP BY wmbs_job.id, wmbs_job_mask.firstlumi, wmbs_job_mask.lastlumi
ORDER BY num_files DESC, wmbs_job.id;
Get the job mask for every job in a workflow, ordered by workflow task:
SELECT wmbs_job_mask.*, wmbs_workflow.task FROM wmbs_job_mask
INNER JOIN wmbs_job ON wmbs_job.id = wmbs_job_mask.job
INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
WHERE wmbs_workflow.name = 'amaltaro_TC_PreMix_Apr2019_Validation_190416_212623_616' ORDER BY wmbs_workflow.task;
Given a file name, find all its run/lumi mapping:
SELECT * FROM wmbs_file_runlumi_map
WHERE fileid=(SELECT id from wmbs_file_details where lfn='MCFakeFile-pdmvserv_task_HIG-RunIISummer15wmLHEGS-02325__v1_T_190324_065355_1119-HIG-RunIISummer15wmLHEGS-02325_0-a7eaaaa702533cd0213c409632df836f');
Provided a block name (and files if needed), find out which workflow created them:
SELECT dbsbuffer_block.id AS blockid, dbsbuffer_block.blockname AS blockname,
dbsbuffer_file.id AS fileid, dbsbuffer_file.lfn AS lfn,
dbsbuffer_workflow.id AS wfid, dbsbuffer_workflow.name AS name FROM dbsbuffer_block
INNER JOIN dbsbuffer_file ON dbsbuffer_block.id = dbsbuffer_file.block_id
INNER JOIN dbsbuffer_workflow ON dbsbuffer_file.workflow = dbsbuffer_workflow.id
WHERE dbsbuffer_block.blockname
IN ('/TTJets_SingleLeptFromT_TuneCP5_13TeV-madgraphMLM-pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM#db434bf2-1827-484d-93b6-2317a2a9d422',
'/TTJets_SingleLeptFromT_TuneCP5_13TeV-madgraphMLM-pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM#6070c12c-9db4-4a2e-bc06-31210e06d6dc')
AND dbsbuffer_file.lfn
IN ('/store/mc/RunIIAutumn18MiniAOD/TTJets_SingleLeptFromT_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/102X_upgrade2018_realistic_v15-v1/70000/78504BE6-7D0C-504D-AC33-4633FB046646.root',
'/store/mc/RunIIAutumn18MiniAOD/TTJets_SingleLeptFromT_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/102X_upgrade2018_realistic_v15-v1/70000/99D23191-62CC-8A4B-8DDF-175BB2786FFC.root');
Inserting a new PNN into the database and creating its PSN to PNN association (Oracle backend)
First we need to create a new PNN
INSERT INTO wmbs_pnns (id, pnn) VALUES (wmbs_pnns_SEQ.nextval, 'T2_US_Caltech_Ceph');
next we need to create the site vs pnn association:
INSERT INTO wmbs_location_pnns (location, pnn)
VALUES(
(SELECT id from wmbs_location where site_name='T2_US_Caltech'),
(SELECT id from wmbs_pnns where pnn='T2_US_Caltech_Ceph')
);
last but not least, do not forget we do not have autocommit enabled for Oracle, so:
commit;
Inserting a new PNN into the database and creating its PSN to PNN association (MySQL backend)
First we need to create a new PNN
INSERT INTO wmbs_pnns (pnn) VALUES ('T2_US_Caltech_Ceph');
next we need to create the site vs pnn association:
INSERT INTO wmbs_location_pnns (location, pnn)
VALUES(
(SELECT id from wmbs_location where site_name='T2_US_Caltech'),
(SELECT id from wmbs_pnns where pnn='T2_US_Caltech_Ceph')
);
Unregistering a WMAgent agent from WMStats watch
The smart way is:
$manage execute-agent wmagent-unregister-wmstats `hostname -f`
The harder way.
First you need to source the agent environment:
source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
Then run this stuff (changing agent and wmstats url):
from WMCore.Services.WMStats.WMStatsWriter import WMStatsWriter
wmstats = WMStatsWriter("https://cmsweb-testbed.cern.ch/couchdb/wmstats")
agentUrl = "cmsgwms-submit1.fnal.gov:9999"
wmstats.deleteDocsByIDs([agentUrl])
WMAgent deployment
1. Once the agent is correctly drained with no jobs in the system, all blocks files injected, etc. you can then connect to the machine and stop agent and services. Also cross-check all process have been stopped
ps aux | egrep 'wmcore|couch|beam|mysql'
2. Check the validity of the service certificate and remove (or not) the old WMAgent version:
openssl x509 -in /data/certs/servicecert.pem -dates -subject -noout
3. Start a fresh new terminal and update the secrets file:
vim /data/admin/wmagent/WMAgent.secrets
cd /data/srv
4. Get the deployment script from GH:
wget --no-check-certificate https://raw.githubusercontent.com/amaltaro/scripts/master/deployProd.sh
5. Run the deployment script providing arguments in command line:
./deployProd.sh -w 0.9.95b -c HG1406e -s slc5_amd64_gcc461 -t reproc_lowprio -f mysql -n 3 | tee log
6. X-check all the logs to make sure everything succeeded
7. Source the new WMA environment
source /data/admin/wmagent/env.sh
8. Double check the agent configuration
vim config/wmagent/config.py
9. Provided everything is ok, you can start the agent
$manage start-agent
Injecting a PSet configuration file into CouchDB
In order to upload a PSet.py configuration file to central
CouchDB, we need both CMSSW and WMCore libraries in the path.
1. log in to lxplus or aiadm
2. create a CMSSW project area:
cd workarea/
source /cvmfs/cms.cern.ch/cmsset_default.sh
export SCRAM_ARCH=slc6_amd64_gcc630
scram project CMSSW_9_3_7
cd CMSSW_9_3_7/src
cmsenv
3. download and change the PSet.py file you want to upload
4. clone the wmcontrol repo:
git clone https://github.com/cms-PdmV/wmcontrol
5. access the wmcontrol/modules dir. Then create your proxy and source the wmcontrol script:
export X509_USER_PROXY=/tmp/x509up_u5514
source /afs/cern.ch/cms/PPD/PdmV/tools/wmclient/current/etc/wmclient.sh
6. finally, open the python interpreter and run something like
from wma import upload_to_couch
step_cfg_name = "/afs/cern.ch/user/a/amaltaro/workarea/CMSSW_9_3_7/src/pset.py"
section_name = "alan_test"
user_name = "amaltaro"
group_name = "DATAOPS"
url = "https://cmsweb-testbed.cern.ch/couchdb"
res = upload_to_couch(step_cfg_name, section_name, user_name, group_name, url=url)
Unpickling a WMAgent job config
Once you have a logArchive tarball, you can see two PSet files, PSet.py is just a wrapper while the PSet.pkl has the real job information.
You first need a CMSSW environment (to use the FWCore module):
amaltaro@lxplus135:~ $ cd workarea/trustPU/
source /cvmfs/cms.cern.ch/cmsset_default.sh
export SCRAM_ARCH=slc6_amd64_gcc530
scram project CMSSW_8_1_0
cd CMSSW_8_1_0/src
cmsenv
cp ~/workarea/trustPU/cmsRun1/PSet.pkl .
then you can unpickle and dump its configuration with dumpPython() on the process object.
amaltaro@lxplus135:~/workarea/trustPU/CMSSW_5_3_19/src $ python
import pickle
with open('PSet.pkl') as pf:
procObj = pickle.load(pf)
#data = procObj.dumpPython()
with open('myPSet.py', 'w') as f:
f.write(procObj.dumpPython())
Looking for a specific LFN in the agent condor logs
First, we need to get the list of files available for a workflow, inside a specific agent, e.g.:
tree -f -i ../install/wmagent/JobArchiver/logDir/p/pdmvserv_task_SUS-RunIIFall15DR76-00098__v1_T_160223_044155_6562/ | grep tar > alan
Then, we extract the condor.out log file from each of those tarballs and grep for the LFN with:
for f in `cat alan`; do echo $f; tar jxf $f -x */condor.*.out -O | grep "2673FE5D-DCDC-E511-A120-001E67396ACC.root"; done > alanLog
Now check the text file alanLog and you should see which tarball has it.
Running a production job interactively
Connect to a machine that has access to EOS at CERN and grab a logArchive tarball. Then you can create a CMSSW environment and run the job like:
cd /data/amaltaro/
source /cvmfs/cms.cern.ch/cmsset_default.sh # to get scram in the path
export SCRAM_ARCH=slc5_amd64_gcc462
scram project CMSSW_5_3_2_patch5
cd CMSSW_5_3_2_patch5/src
cmsenv
#git cms-addpkg Configuration/PyReleaseValidation
scramv1 b
### untar the logarchive
tar xvzf 6754b712-fab8-11e3-b35b-00155dffc454-0-0-logArchive.tar.gz
cmsRun -e PSet.py 2>err.txt 1>out.txt &
Changing schedd prio factor in the production pool
- Connect to vocms97 as _condor user
- Check the schedd priorities:
condor_userprio -all -allusers
- And change it to the number you want, e.g.:
condor_userprio -setfactor cmst1@vocms85.cern.ch 15000.00
Update block (and file) location:
SQL> SELECT * FROM dbsbuffer_block WHERE status!='Closed';
ID BLOCKNAME LOCATION CREATE_TIME STATUS STATUS3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1611 /ZPrimeToTTJets_M1000GeV_W100GeV_Tune4C_13TeV-madgraph-tauola/Fall13-POSTLS162_V1-v1/GEN-SIM#f3b9fc9c-1bc6-11e4-b97e-003048f02c8a 217 1407150203 InDBS Pending
1606 /RPV_SM2_Muon_TuneZ2star_8TeV-madgraph-tauola/Summer12-START53_V19_FSIM_PU_S12-v1/AODSIM#3eec6722-1be5-11e4-b97e-003048f02c8a 217 1407163215 InDBS Pending
SQL> SELECT * FROM dbsbuffer_location WHERE id=217;
ID SE_NAME
---------- -------------------------------------------------------------------------------------------------
217 ntugrid6.phys.ntu.edu.tw
SQL> SELECT * FROM dbsbuffer_location WHERE se_name='ntugrid4.phys.ntu.edu.tw';
ID SE_NAME
---------- -------------------------------------------------------------------------------------------------
161 ntugrid4.phys.ntu.edu.tw
SQL> UPDATE dbsbuffer_block SET location=161 WHERE location=217 AND status='InDBS';
2 rows updated.
SQL> SELECT dbsbuffer_file.id, dbsbuffer_file_location.location
FROM dbsbuffer_file INNER JOIN dbsbuffer_file_location
ON dbsbuffer_file.id = dbsbuffer_file_location.filename
WHERE dbsbuffer_file.block_id=1611 OR dbsbuffer_file.block_id=1606;
ID LOCATION
---------- ----------
222253 217
222502 217
SQL> UPDATE (SELECT dbsbuffer_file.id, dbsbuffer_file_location.location
FROM dbsbuffer_file INNER JOIN dbsbuffer_file_location
ON dbsbuffer_file.id = dbsbuffer_file_location.filename
WHERE dbsbuffer_file.block_id=1611 OR dbsbuffer_file.block_id=1606)
SET location=161;
2 rows updated.
- First log in into aiadm (since some clients are not available in vocms049) and request a stage of the file:
stager_get -M /castor/cern.ch/cms/store/logs/prod/2015/04/WMAgent/whatever-1-logs.tar
- Then you can check the status of this request by (STAGED means you can copy it):
stager_qry -M /castor/cern.ch/cms/store/logs/prod/2015/04/WMAgent/whatever-1-logs.tar
- Finally, you can copy this file with xrdcp:
xrdcp root://castorcms//castor/cern.ch/cms/store/logs/prod/2015/04/WMAgent/whatever-1-logs.tar .
- If you retrieve data from EOS CERN, just:
xrdcp root://eoscms.cern.ch//store/lhe/10643/8TeV_GoGo_2j_gluino1550_chargino1025_run22892_unwt_aw_winclusive.lhe .
Injecting and running a test workflow with reqmgr.py
- Get the ReqMgr client and the template you want to inject:
wget --no-check-certificate https://raw.githubusercontent.com/dmwm/WMCore/master/test/data/ReqMgr/reqmgr.py
wget --no-check-certificate https://raw.githubusercontent.com/dmwm/WMCore/master/test/data/ReqMgr/requests/MonteCarlo.json
- Then create the request, pointing it to the correct ReqMgr URL (usually in cmsweb). You also need to provide new values for all those "OVERRIDE-ME" fields. E.g.:
python2.6 reqmgr.py -u https://cmsweb-testbed.cern.ch -f MonteCarlo.json -j '{"createRequest": {"Campaign": "DMWM_Test", "RequestString" : "MonteCarlo_Alan_25Jul2014"}}' -i
- The command above will create a new request in ReqMgr (which re-uses a configuration previously injected into couchdb). You should get the request name in the output of that command. E.g.:
INFO:root:Approving request 'amaltaro_MonteCarlo_Alan_25Jul2014_140725_170927_3909' ...
- Now with the request name in hands, you can assign this workflow (again overriding all "OVERRIDE-ME" fields). E.g.
python2.6 reqmgr.py -u https://cmsweb-testbed.cern.ch -f MonteCarlo.json -r amaltaro_MonteCarlo_Alan_25Jul2014_140725_170927_3909 -j '{"assignRequest": {"SiteWhitelist": ["T1_US_FNAL"], "Team": "testbed-dmwm", "dashboard": "test", "AcquisitionEra": "DMWM_Test", "ProcessingString": "MC_TEST_Alan_25Jul2014"}}' -g
- That's all, now you have to monitor your workflow...
SQL hints
- Updating rogue jobs in the database that are still marked as executing but are gone long ago:
# mysql> UPDATE wmbs_job SET state=(SELECT id from wmbs_job_state WHERE name='cleanout') WHERE state=(SELECT id from wmbs_job_state WHERE name='executing');
- Selecting only duplicate values:
# mysql> SELECT lfn, COUNT(*) FROM dbsbuffer_file GROUP BY lfn HAVING COUNT(*) > 1;
- Getting table constraints information (Oracle):
# SQL> SELECT cons.constraint_type, cons.constraint_name, cols.column_name
FROM all_constraints cons, all_cons_columns cols
WHERE cols.table_name = 'DBSBUFFER_FILE'
AND cons.constraint_name = cols.constraint_name;
- Getting table namespace usage and quota (Oracle):
# select * from user_ts_quotas;
# select SEGMENT_NAME, SEGMENT_TYPE, TABLESPACE_NAME, BYTES, BLOCKS from user_segments ORDER BY BYTES DESC;
- Getting the PhEDEx subscription status for a given dataset:
# mysql> SELECT * FROM dbsbuffer_dataset_subscription WHERE id=1252;
- Then you can see to each dataset this id belongs to:
# mysql> SELECT * FROM dbsbuffer_dataset WHERE id=211;
- In case you want to update the site that this subscription is supposed to be done (e.g. T0_CH_CERN_MSS):
# mysql> UPDATE dbsbuffer_dataset_subscription SET site='T0_CH_CERN' WHERE id=1252 AND subscribed=0;
- Selecting all subscriptions_id, dataset name and site for subscriptions that need to be made to T0_CH_CERN_MSS:
# mysql> SELECT dbsbuffer_dataset_subscription.id, dbsbuffer_dataset_subscription.site, dbsbuffer_dataset.path
FROM dbsbuffer_dataset INNER JOIN dbsbuffer_dataset_subscription
ON dbsbuffer_dataset.id=dbsbuffer_dataset_subscription.dataset_id
WHERE dbsbuffer_dataset_subscription.site='T0_CH_CERN_MSS';
- Select and group dataset_id (print how many times they appear in the table):
# mysql> SELECT dataset_id,count(*) FROM dbsbuffer_dataset_subscription WHERE site='T0_CH_CERN_MSS' OR site='T0_CH_CERN' GROUP BY dataset_id;
- Getting all wmbs subscriptions and their status for a specific workflow:
# mysql> SELECT wmbs_workflow.id AS wf_id, wmbs_subscription.id AS sub_id, wmbs_workflow.task, wmbs_subscription.finished
FROM wmbs_workflow INNER JOIN wmbs_subscription
ON wmbs_workflow.id=wmbs_subscription.workflow
WHERE wmbs_workflow.name='amaltaro_ReReco_Agent105_CDB16_forceComplete_150413_202829_507';
- Checking files available (jobs) and their location for a specific subscription:
# SELECT wmbs_sub_files_available.fileid, wlpnn.pnn AS pnn
FROM wmbs_sub_files_available
INNER JOIN wmbs_file_location ON
wmbs_sub_files_available.fileid = wmbs_file_location.fileid
INNER JOIN wmbs_location_pnns wlpnn ON
wmbs_file_location.location = wlpnn.location
WHERE wmbs_sub_files_available.subscription = 24125;
- NEW Getting a map of PSN and given a PNN:
# select wmbs_location.id as site_id, wmbs_location.site_name, wmbs_location.state, wmbs_pnns.pnn from wmbs_location_pnns
inner join wmbs_pnns on wmbs_location_pnns.pnn = wmbs_pnns.id
inner join wmbs_location on wmbs_location_pnns.location = wmbs_location.id
where wmbs_pnns.pnn in ('T2_UK_London_IC', 'T2_UK_SGrid_RALPP', 'T2_UK_London_Brunel');
- Grouping available files by subscription:
# select subscription,count(*) from wmbs_sub_files_available group by subscription;
- Show all tables in MySQL:
# mysql> SHOW tables;
- Getting thresholds per site per task:
# mysql> SELECT wmbs_location.site_name as site, wmbs_sub_types.name AS task, rc_threshold.max_slots, rc_threshold.pending_slots
FROM wmbs_location INNER JOIN rc_threshold ON rc_threshold.site_id = wmbs_location.id
INNER JOIN wmbs_sub_types ON wmbs_sub_types.id = rc_threshold.sub_type_id ORDER BY site;
- Describing a table in MySQL:
# mysql> DESCRIBE table_name;
- Printing a general overview of tables and their column names and sizes in MySQL:
# mysql> SELECT TABLE_NAME,COLUMN_NAME,DATA_TYPE,CHARACTER_MAXIMUM_LENGTH,COLUMN_TYPE FROM information_schema.columns WHERE table_schema = 'wmagent';
- Changing a table to increase a column's length:
# mysql> ALTER TABLE wmbs_fileset MODIFY name varchar(650);
- Find subscriptions not finished:
# SQL> select wmbs_subscription.workflow, wmbs_subscription.id, wmbs_subscription.fileset, wmbs_fileset.open FROM wmbs_subscription
INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id where wmbs_subscription.finished = 0 ORDER BY wmbs_subscription.workflow;
- Find files available for a subscription not finished:
# SQL> select wmbs_subscription.workflow, wmbs_subscription.id, wmbs_subscription.fileset, wmbs_sub_files_available.fileid
FROM wmbs_sub_files_available INNER JOIN wmbs_subscription ON wmbs_sub_files_available.subscription = wmbs_subscription.id
WHERE wmbs_subscription.finished = 0;
- Find files acquired for a subscription not finished:
# SQL> select wmbs_subscription.workflow, wmbs_subscription.id, wmbs_subscription.fileset, wmbs_sub_files_acquired.fileid
FROM wmbs_sub_files_acquired INNER JOIN wmbs_subscription ON wmbs_sub_files_acquired.subscription = wmbs_subscription.id
WHERE wmbs_subscription.finished = 0 and rownum < 3;
* Getting available LFNs for subscriptions not finished:
# SQL> select wmbs_subscription.workflow, wmbs_subscription.id, wmbs_subscription.fileset, wmbs_sub_files_available.fileid, wmbs_file_details.lfn
FROM wmbs_sub_files_available INNER JOIN wmbs_subscription ON wmbs_sub_files_available.subscription = wmbs_subscription.id
AND wmbs_subscription.finished = 0
INNER JOIN wmbs_file_details ON wmbs_sub_files_available.fileid = wmbs_file_details.id and rownum < 3;
- Insert a row into an oracle table:
# SQL> INSERT INTO REQMGR_REQUEST_TYPE
(TYPE_ID, TYPE_NAME) VALUES (122, 'StepChain');
- Show all tables in Oracle SQL:
# SQL> SELECT tname FROM tab;
- Show all indexes and columns of a table in Oracle:
# SQL> SELECT b.uniqueness, a.index_name, a.table_name, a.column_name
FROM all_ind_columns a, all_indexes b
WHERE a.index_name=b.index_name
AND a.table_name = upper('WMBS_PNNS')
ORDER by a.table_name, a.index_name, a.column_position;
- Describing a table in Oracle SQL:
# SQL> DESC table_name;
Workflow of request injection into ReqMgr
First we call the putRequest() method from (L480):
WMCore/HTTPFrontEnd/RequestManager/ReqMgrRESTModel.py
then a makeRequest() call is made (L468):
WMCore/HTTPFrontEnd/RequestManager/ReqMgrWebTools.py
which will build and validate(?) the request schema (L367):
WMCore/HTTPFrontEnd/RequestManager/ReqMgrWebTools.py
falling into this buildWorkloadForRequest() call (L99):
WMCore/RequestManager/RequestMaker/Registry.py
which finally instantiates the correct request spec file and it's parent (
StdBase usually):
WMCore/WMSpec/StdSpecs/StdBase.py
WMCore/WMSpec/StdSpecs/StepChain.py
Debugging MySQL (MariaDB) issues ('lock wait timeout exceeded'):
* Connect to the database and first let's check what's the lock wait timeout:
# mysql> show variables like 'innodb_lock_wait_timeout';
* Now we can see the list of locked tables:
# mysql> show open tables where in_use>0;
* Now we list the current process, one of them must be locking our table (time is in secs):
# mysql> show processlist;
* Kill the greedy process:
# mysql> kill put_process_id_here;
* IF we want to have much more details about these events, we can issue a:
# mysql> show engine innodb status\G
My Personal Preferences
- Show tool-tip topic info on mouse-over of WikiWord links, on or off:
- Set LINKTOOLTIPINFO = off
- More preferences
TWiki has system wide preferences settings defined in TWikiPreferences. You can customize preferences settings to your needs: To overload a system setting, (1) do a "raw view" on TWikiPreferences, (2) copy a Set VARIABLE = value
bullet, (3) do a "raw edit" of your user profile page, (4) add the bullet to the bullet list above, and (5) customize the value as needed. Make sure the settings render as real bullets (in "raw edit", a bullet requires 3 or 6 spaces before the asterisk).
Related Topics