Procedure for recovering lost data at T2s
Aim
This procedure is intended to recover and identify lost data at a CMS T2. Here it is assumed that a list of lost files (lfn,pfn,pnfsid) is given, and
the example is performed using dCache.
There are various types of data that can be lost:
- Official Datasets
- Production files to be uploaded to T1s
- User files
- Not Identified
For each of the identified types, a different procedure has to be followed:
- Official datasets: retransfer them
- Production files: ask for invalidation
- User/Other files: I hope you have a backup
Some useful script can be found here:
http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/UserCode/leo/Utilities/
Category Identification
Retrieving lost files from PhEDEx
Needed input is a filelist. The files registered to
PhEDEx can be retrieved using the
BlockConsistencyCheck utility, from the
PhEDEx node:
PHEDEX/Utilities/BlockConsistencyCheck -verbose -db ~/config/DBParam.CSCS:Prod/CSCS \
-buffer T2_CH_CSCS -block '%' -tfc config/SITECONF/T2_CH_CSCS/PhEDEx/storage.xml > consistency_check_`date '%Y%m%d'`.txt
-verbose actually lists all the missing files and not only the blocks, while
`date '%Y%m%d'`
automatically retrieves the current date in the YYYYMMDD format.
The outpufile contains lots of data, but we are interested only in missing files (not in size mismatch, in this case), so:
cat consistency_check_20091029.txt | grep LFN | grep "SizeMismatch" | \
cut -d= -f2 | awk '{print $1}' | grep store > missingFiles_20091029_mismatch.txt
cat consistency_check_20091029.txt | grep LFN | grep "Missing" | \
cut -d= -f2 | awk '{print $1}' | grep store > missingFiles_20091029_missing.txt
The missingFiles_20091029_missing.txt is the file you're interested in, an contains a lfn list. Let's suppose you have a a file containing all the lost files in pfn format: you need to convert it to
be able to check the two lists.
Finding categories
First of all, production files need to be found. It is assumed that a production file has no other replicas.
data_getNotReplicatedFiles.py
takes as input a lfn list and checks for files with no other replicas (please, modify MY_SITE variable in the script accordingly). In order, to be used,
dbssql
is needed:
$ wget --no-check-certificate https://cmsweb.cern.ch/dbs_discovery/dbssql
$ chmod +x dbssql
$ ./data_getNotReplicatedFiles.py lost-lfn.lst
In this case, lost-lfn_withReplica.lst will contain files with replicas, lost-lfn_noReplica.lst then will be production files. Now, we need to remove from the total list files the production ones:
$ cat lost-lfn.lst | sort > lost-lfn_sorted.lst & cat lost-lfn_withReplica.lst | sort > lost-lfn_withReplica_sorted.lst
$ comm -23 lost-lfn_sorted.lst lost-lfn_withReplica_sorted.lst > lost-lfn-2.lst
$ rm lost-lfn_sorted.lst lost-lfn_withReplica_sorted.lst
Given lost-lfn-2.lst, now we need to divide it in categories.
data_divideCases.py
is a script which takes a lfn list as input
and then outputs a file for each of the following categories:
If a second lfn list is given, this is actually subtracted from the first: the resulting list is *_common.lst. So:
# ./data_divideCases.py lost-lfn-2.lst missingFiles_20091029_missing.txt
# ls *.lst
lost-lfn-2_common.lst lost-lfn-2_LT07.lst lost-lfn-2_Unmerged.lst
lost-lfn-2.lst lost-lfn-2_Other.lst lost-lfn-2_User.lst
Actions
"Sandbox.PhEDEx" files
The lost blocks are inside the consistency_check_20091029.txt file. You can use
phedex_getMissingBlocks.py
to retrieve them:
#./phedex_getMissingBlocks.py consistency_check_20091116.txt
/InclusiveMu5_Pt350/Summer09-MC_31X_V3_7TeV_AODSIM-v1/AODSIM#a8fcf7cb-2132-4412-bc58-7da615c39f9f
/InclusiveMu5_Pt350/Summer09-MC_31X_V3_7TeV_AODSIM-v1/AODSIM#f68fa3a8-d590-42e9-88a4-7ad0601d17bd
/InclusiveMu5_Pt350/Summer09-MC_31X_V3_AODSIM-v1/AODSIM#16664801-953f-4c6a-8451-c3cba277c92c
/InclusiveMu5_Pt350/Summer09-MC_31X_V3_AODSIM-v1/AODSIM#b3dede4f-b8a0-46ec-a8c8-dd9a7f0b96d1
/InclusiveMu5_Pt350/Summer09-MC_31X_V3_AODSIM-v1/AODSIM#f4732cba-6597-499f-8e83-df14f8cfce9e
/InclusiveMu5_Pt50/Summer09-MC_31X_V3_7TeV_AODSIM-v1/AODSIM#39f31108-2271-453c-a516-a84c53ed281d
/PhotonJet_Pt300/Summer09-MC_31X_V3_AODSIM-v1/AODSIM#34d43ac1-96b5-4f4d-b5d3-2cd84a41261e
/PhotonJet_Pt300/Summer09-MC_31X_V3_AODSIM-v1/AODSIM#e00557d7-2b8b-4b81-800e-12f217ce1c60
/PhotonJet_Pt3000/Summer09-MC_31X_V3_AODSIM-v1/AODSIM#4c1c7820-9be8-4c9c-86e9-bd4933caf7e1
/PhotonJet_Pt30to50/Summer09-MC_31X_V3_7TeV_AODSIM-v1/AODSIM#046a5481-be3a-4040-9b9c-691ae07ab3f6
/PhotonJet_Pt30to50/Summer09-MC_31X_V3_7TeV_AODSIM-v1/AODSIM#a2218ab8-4c19-4f53-ac76-2f568013d3a7
/PhotonJet_Pt470/Summer09-MC_31X_V3_AODSIM-v1/AODSIM#4fbca47e-248c-4f1c-9216-b4fb9b4c47de
/PhotonJet_Pt470/Summer09-MC_31X_V3_AODSIM-v1/AODSIM#8addac52-ef86-440c-8a1a-86568ebc02b8
/PhotonJet_Pt470/Summer09-MC_31X_V3_AODSIM-v1/AODSIM#9a78fcbc-cb39-488f-bfd5-912ff94d9518
/PhotonJet_Pt500toInf/Summer09-MC_31X_V3_7TeV_AODSIM-v1/AODSIM#f9280a19-c5c8-4860-95f3-37a0883bc372
/PhotonJet_Pt50to80/Summer09-MC_31X_V3_7TeV_AODSIM-v1/AODSIM#899a1674-301a-46cd-9f02-f2f34504ffb7
"Production" files
They are the files contained in lost-lfn_noReplica.lst. Open a Savannah ticket to
DataOps (
https://savannah.cern.ch/support/?group=cmscompinfrasup
), attaching the file list
"LoadTest" files
They are part of the
LoadTest for debugging links. Open a Savannah ticket as above or, if there are many missing files, restranster them from some other site
"User" files
Send a sad mail
"Unmerged" files
They should not be a problem, check with the Prod team
"Other" files
You actually need to parse the list and see what they are...
--
LeonardoSala - 30-Oct-2009