Categories of files/samples on _Disk:
- files/samples without custodial location (we produced them but don’t want them to go to tape)
- files/samples with custodial location, we decided to keep them on _Disk permanently
- RAW of 2012 data for re-reconstruction passes
- AOD of 2012 data for analysis
- => the rules we used for populating the _Disk endpoint
- input to workflows which are not in the other categories
- can be cleaned up after we checked that the workflow is finished
- start talking to the workflow team => why can’t the workflow team delete the input samples after they announced the output
- there is also a web interface where you can see if a workflow is still running or not
- output of workflows (AOD, AODSIM)
- in general we would like output of recent workflows to stay on disk
- old output can be considered to be cleaned up
* important that we check that there is a custodial location * we split in the above 3 categories * make lists for all samples split into era/tier
Goal:
to have the T1_*_Disk endpoints filled below 80% of their capacity
Procedure
- check every week and if _Disk is > 80% full, make suggestion to clean up
- check for samples that are on several _Disk endpoints
First to clean up:
* GEN-SIM which has already been processed and is not expected to be processed again
1- Get all GEN-SIM datasets at T1_UK_RAL_Disk
https://cmsweb.cern.ch/phedex/datasvc/xml/prod/blockreplicasummary?node=T1_UK_RAL_Disk&create_since=0&dataset=/*/*/GEN-SIM
or
python datasvc.py --service blockreplicasummary --options "node=T1_UK_RAL_Disk&create_since=0" --path /phedex/block/name | tee cleaning_blocks
2- Exclude datasets without custodial location
cat cleaning_blocks | cut -d '#' -f 1 | sort | uniq > cleaning_datasets
awk '{system("python checkReplica.py --option custodial:y --dataset "$1)}' cleaning_datasets
3- Exclude datasets if they are being used by a running workflow as an input ("assigned|acquired|running|running-open|running-closed|assignment-approved")
4- Exclude dataset if it is an output dataset of a workflow which finished within last ~6 months
5- Cross-check if remaining datasets have a custodial location
6- Send the list to Andrew(
andrew.lahiff@stfcNOSPAMPLEASE.ac.uk) and Dave(
dmason@fnalNOSPAMPLEASE.gov)
7- If they approve the list, create a deletion request.
Second to clean up:
data RAW: first 2011, then 2012