ILCDIRAC monitoring pages
JIRA
https://its.cern.ch/jira/browse/ILCDIRAC
Shifter pages to check
Check these pages for anything out of the ordinary.
- IT Status Board
: Incidents, announcements, changes of CERN IT infrastructure
Grid Resource Sites
Websites offering overview or portals for the grid:
- GGus
: Ticketing for LCG (and some OSG) sites
- GStat
: Information about Sites, their queues and resources
- GOCDB
: Downtime and Resource overview
Overview sites
The following links point to useful pages to monitor the status of the system:
Debugging CEs and SEs
How to directly access the underlying CEs for more direct debugging of issues: Cream, ARC, Globus and SEs/FileCatalogs
Checking the Status of Machines, Agents, and Services
Castor Diskpool, Unstaging Files
When the
castor diskpool
is getting full, one needs to unstage files from it.
Clone the
ILCDiracOps git repository and run the script
StagerScripts/UnstageProdFiles -P'prodID' -F'REC|SIM'
prodID is the first production to check, REC|SIM the file types to unstage.
Production Output Data Checking
Sometimes something goes wrong and files that should not exist do exist. Jobs can fail, but their outputfiles are still picked up before they are removed. Jobs are rescheduled for no good reason, the jobs fail between uploading outputdata and creating removal requests...
Here are some scripts to deal with these outputfiles, these are also in the
ILCDiracOps repository:
Check if Successful jobs have used the same input data for a given production:
CheckProdJobs -P'ProdID'
Check if the output from failed jobs still exists:
CheckFailedJobsFourOutputData -P'ProdID'
Check if there are production files wihtout ancestors:
CheckProductionAncestry -p'prodID' -t'filetype'
Also use the
dirac-transformation-cli
command to check and change files for productions.