Consistency Checks
Introduction
Periodically, we conduct a set of checks at the T1s and T2s, to make sure that the files stored in their storage elements, and registered in the Database are in sync.
Types of Checks
Currently, these consist of:
- StorageConsistencyCheck (SCC) - Check if all files in the storage are registered in the database. The ones that are not are called 'orphans' and are removed from the storage, because without anyone knowing about their presence at the site they are simply useless.
- BlockDownloadVerify (BDV) - Check that all blocks registered in the DB as present at the site can actually be accessed and that the sizes of the files in them correspond to what is registered in the DB at file creation.
Procedure
SCC
- The coordination of the checks is currently done through GGUS Tickets.
- As a start, a ticket is opened to the site in question, asking the site admins to provide a storage dump for their site.
- dump should contain the following directories:
-
/store/mc
-
/store/data
-
/store/generator
-
/store/results
-
/store/hidata
-
/store/himc
-
/store/lumi
-
/store/relval
- storage dump tools can be found at StorageDumps
- When the site provides the full storage dump ( lfn_list.txt) the actual check is performed by running:
-
./SCCHelper.sh --db ~/path_to/DBParam.Jorge:Prod/OPSJORGE --node T1_FR_CCIN2P3_Disk --dump ~/CC_April_2016/CCIN2P3/SCC/LFN.store.disk.txt --output ~/CC_April_2016/CCIN2P3/SCC/
- Dataset and file status of the potential orphan files will be created under output directory of the script
- (orphan: files that are present at the site, but are not registered in TMDB, thus nobody knows about them, and they are useless).
- Then you should find true orphans by using this list. (The major distinction is: Files which belong to dataset still in production at the site ("fake orphans"), vs. files belonging to old/deprecated/deleted datasets, or invalid files ("true orphans"))
- The resulting list of true orphans is sent back to the site admins, requesting that they delete these files from their storage.
- Common Issue: Every Deletion request approved after getting the dump will cause false orphan files:
- They appear in the dump as they were at the site at that time
- They appear in the orphan list because they're not registered to the site any more due to deletion requests
BDV
- Full set of BDV checks is injected for the site:
- Inject BDV size test for all blocks at a site (use --expire 8640000 for sites with lots of files like T1_US_FNAL_Buffer)
-
./BlockDownloadVerify-injector.pl --db ~/path_to/DBParam.Jorge:Prod/OPSJORGE --node T1_ES_PIC_Disk --block % --test size --expire 2592000 --priority 1 --force --verbose
- When the BDV checks have been completed (there is no tests with status
None
, Active
, Queued
or Error
in the Data->Verification
page). At this point, run the following command:
-
./BDVParser.sh --db ~/path_to/DBParam.Jorge:Prod/OPSJORGE --node T1_FR_CCIN2P3_Buffer --day 5 --output ~/CC_April_2016/CCIN2P3/BDV/ --verbose
.
- The script creates two lists - one for global invalidation, and one for local invalidation. You can find the invalidation instructions at CompOpsTransferTeamFileInvalidations
- Always ask site to x-check because BDV agent might mark a file as failed even if there is a transient error at the time test ran
Notes
Scripts
BDV Test Status
- BDV tests status:
- 1) None == pending in DB, still in queue
- 2) Queued == pending in agent
- 3) Active == running now
- 4) OK == all files OK
- 5) Fail == some files not OK
- 6) Error == test didn't work (e.g. crashed)
- 7) Indeterminate == cannot run this test (e.g. ask for checksum test of files without checksum, empty blocks)'
- Status comments :
- <1> Usually, the status will change like this: None -> Queued -> OK/Fail
- <2> If the status is Error , we had better ask the site to check their agent logs and fix the agent if necessary.
- <3>If the status is Indeterminate , we ignore the test, since this status is intended for tests that cannot get a result. For example, empty blocks.
Node name for the scripts
- For T2 and T1_Disk endpoints, you should use
- T1 disk node name: T1_XX_YYY_Disk (eg. T1_DE_KIT_Disk)
- T2 disk node name: T2_XX_YYY (eg. T2_US_Vanderbilt)
- For T1 tape endpoints(MSS/Buffer), you should use:
- T1_XX_YYY_Buffer (eg. T1_DE_KIT_Buffer)
- DO NOT USE T1_XX_YYY_MSS
Schedule
Starting in 2011, the tests are conducted every month at the T1s. Starting in 2012, the tests are conducted few times in the year at the T2s.
Current round of ongoing checks:
Feb.2014
Previous rounds of checks:
T2 matrix
Topic revision: r39 - 2017-02-20
- unknown