Consistency Checks

Introduction

Periodically, we conduct a set of checks at the T1s and T2s, to make sure that the files stored in their storage elements, and registered in the Database are in sync.

Types of Checks

Currently, these consist of:

  • StorageConsistencyCheck (SCC) - Check if all files in the storage are registered in the database. The ones that are not are called 'orphans' and are removed from the storage, because without anyone knowing about their presence at the site they are simply useless.
  • BlockDownloadVerify (BDV) - Check that all blocks registered in the DB as present at the site can actually be accessed and that the sizes of the files in them correspond to what is registered in the DB at file creation.

Procedure

SCC

  • The coordination of the checks is currently done through GGUS Tickets.
  • As a start, a ticket is opened to the site in question, asking the site admins to provide a storage dump for their site.
    • dump should contain the following directories:
      • /store/mc
      • /store/data
      • /store/generator
      • /store/results
      • /store/hidata
      • /store/himc
      • /store/lumi
      • /store/relval
    • storage dump tools can be found at StorageDumps

  • When the site provides the full storage dump ( lfn_list.txt) the actual check is performed by running:
    • ./SCCHelper.sh --db ~/path_to/DBParam.Jorge:Prod/OPSJORGE --node T1_FR_CCIN2P3_Disk --dump ~/CC_April_2016/CCIN2P3/SCC/LFN.store.disk.txt --output ~/CC_April_2016/CCIN2P3/SCC/
    • Dataset and file status of the potential orphan files will be created under output directory of the script
    • (orphan: files that are present at the site, but are not registered in TMDB, thus nobody knows about them, and they are useless).
    • Then you should find true orphans by using this list. (The major distinction is: Files which belong to dataset still in production at the site ("fake orphans"), vs. files belonging to old/deprecated/deleted datasets, or invalid files ("true orphans"))
  • The resulting list of true orphans is sent back to the site admins, requesting that they delete these files from their storage.
  • Common Issue: Every Deletion request approved after getting the dump will cause false orphan files:
    • They appear in the dump as they were at the site at that time
    • They appear in the orphan list because they're not registered to the site any more due to deletion requests

BDV

  • Full set of BDV checks is injected for the site:
    • Inject BDV size test for all blocks at a site (use --expire 8640000 for sites with lots of files like T1_US_FNAL_Buffer)
    • ./BlockDownloadVerify-injector.pl --db ~/path_to/DBParam.Jorge:Prod/OPSJORGE --node T1_ES_PIC_Disk --block % --test size --expire 2592000 --priority 1 --force --verbose
  • When the BDV checks have been completed (there is no tests with status None, Active, Queued or Error in the Data->Verification page). At this point, run the following command:
    • ./BDVParser.sh --db ~/path_to/DBParam.Jorge:Prod/OPSJORGE --node T1_FR_CCIN2P3_Buffer --day 5 --output ~/CC_April_2016/CCIN2P3/BDV/ --verbose.
    • The script creates two lists - one for global invalidation, and one for local invalidation. You can find the invalidation instructions at CompOpsTransferTeamFileInvalidations
  • Always ask site to x-check because BDV agent might mark a file as failed even if there is a transient error at the time test ran

Notes

Scripts
BDV Test Status
  • BDV tests status:
    • 1) None == pending in DB, still in queue
    • 2) Queued == pending in agent
    • 3) Active == running now
    • 4) OK == all files OK
    • 5) Fail == some files not OK
    • 6) Error == test didn't work (e.g. crashed)
    • 7) Indeterminate == cannot run this test (e.g. ask for checksum test of files without checksum, empty blocks)'

  • Status comments :
    • <1> Usually, the status will change like this: None -> Queued -> OK/Fail
    • <2> If the status is Error , we had better ask the site to check their agent logs and fix the agent if necessary.
    • <3>If the status is Indeterminate , we ignore the test, since this status is intended for tests that cannot get a result. For example, empty blocks.

Node name for the scripts

  • For T2 and T1_Disk endpoints, you should use
    • T1 disk node name: T1_XX_YYY_Disk (eg. T1_DE_KIT_Disk)
    • T2 disk node name: T2_XX_YYY (eg. T2_US_Vanderbilt)
  • For T1 tape endpoints(MSS/Buffer), you should use:
    • T1_XX_YYY_Buffer (eg. T1_DE_KIT_Buffer)
    • DO NOT USE T1_XX_YYY_MSS

Schedule

Starting in 2011, the tests are conducted every month at the T1s. Starting in 2012, the tests are conducted few times in the year at the T2s.

Current round of ongoing checks:

Feb.2014

Previous rounds of checks:

T1s T2s
Aug.2012 Dec2012
Sep.2012 Nov2012
Oct.2012 Sep2012
Nov.2012 Aug2012
Dec.2012 Jun. 2012
Jan.2013 Apr. 2012
August.2013 Mar. 2012
  Nov. 2013
  November.2013
T2 matrix
Edit | Attach | Watch | Print version | History: r39 < r38 < r37 < r36 < r35 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r39 - 2017-02-20 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback