DDMOperationsScripts

Introduction

This page lists all scripts written by the DDM Operations Group to manage DDM datasets and associated physical files and catalog entries. They are complementary to rucio functionalities.

The latest version of the scripts described on this page are stored in gitlab under atlas-adc-ddm-support group. Old scripts are in the subversion repository.

Tools for site administrators and cloud support

Dark data and lost files detection

  • DDMDarkDataAndLostFiles (new documentation still in development, feedback is welcome).
  • Lost or corrupted files: Lost files are the ones that are recorded in Rucio catalog but they are not present on the Storage element. The topic is described on a dedicated twiki page DDMLostFiles.
  • Dark data cleaning The description of what dark data is and how to clean them is on a separate twiki page: DDMDarkDataCleaning

Tools for DDM operations (and/or ADC operations)

Monitoring and reporting scripts

Storage monitor and RRD plots

Tables and plots per RSE with numbers showing Storage (Free, Used), Rucio, Dark data, Unlocked, Free deletion target, Primary occupancy target etc available at http://adc-ddm-mon.cern.ch/ddmusr01/ The details are described in the gitlab repository. The scripts are executed every hour on aiatlasddm003 by ddmusr01.

A daily cron backs up the rrd files on the ddmusr01 EOS space:

# Backup rrd files used for space monitoring
00 08 * * * kinit -kt /data/ddmusr01/keytab ddmusr01@CERN.CH; /data/ddmusr01/backuprrd.sh
The backed up files are at /eos/atlas/user/d/ddmusr01/rrdbackup

Backlogs

Report of rules in trouble is sent to atlas-adc-ddm-ops with the subject "[STUCK/SUSP RULES]". The details are described in the gitlab repository. The scripts are executed every day at 8AM on aiatlas114 by adcmusr1.

Lifetime model

More details on dedicated twiki page DDMLifetimeModel.

Manual distribution scripts

RPG

  • Replication Policy on the Grid: The tool used for automatic replication of data when Rucio subscriptions are not appropriate
  • Code and configs are here in DDM gitlab project: https://gitlab.cern.ch/atlas-adc-ddm/rpg
  • The main code is in RPG.py, it takes a configuration file as an argument which describes the datasets and policy to apply.
    • How it works:
    • Scan the list of destination sites for datasets matching the pattern
    • Scan the list of source sites for datasets matching the pattern and metadata
    • If a dataset does not have the required number of replicas on the destinations, create Rucio rules
    • A limit can be applied on the total active rules to all destinations to avoid eg overloading tape buffers.
  • RPG runs via crons on ddmusr01@aiatlasddm003
    • Code: /data/ddmusr01/rpg/
    • Logs: /data/ddmusr01/log/
  • The machine is fully managed via puppet so any changes to crons or configuration files must be done through gitlab. Puppet automatically syncs the machine with the gitlab repo.
    • Manual changes on the machine will be overwritten!

Migration of data from disk to tape

  • RPG (see above) transfers datasets from disk to T1 tape. It manages the queue of rules at each site and includes a delay between task completion and migration, to allow the dataset to be deleted in case of errors.
  • Several configuration files for this script are used for managing different types of datasets.

How to decomission or migrate an RSE

The procedure to decomission a site or to migrate an old RSE to a new one is very similar. The process should be organised soon in advance and followed up with DDM Ops. A JIRA ticket in DDMOps section should be opened for this purpose. The connection with Panda queue update is detailed in this page. The procedure for these 2 operations is detailed below. Items in Purple can be done by the Cloud squad, whereas items in Green must be done by DDM Central operation. Items in Uppercase Roman Numerals (I, II, III, IV...) are needed for the migration only. Items in Arabic numerals (1, 2, 3, 4...) are needed for both the decomission and the migration :

  1. Setup the new RSE in AGIS
  2. Set the new Panda Queues pointing to the new RSE. If you want to use the computing resources of the old site during the RSE migration, modify the old Panda Queues to point to the new RSE

  1. Set the following RSE attributes :
    • Goal : Stop to store new files in the old RSE
    • For all Storage class (DATADISK, SCRATCHDISK, LOCALGROUPDISK...) :
      • rucio-admin rse set-attribute --rse ENDPOINT --key dontkeeplog --value TRUE
        • It prevents the replication of distributed log datasets to the site ( Should have been set to TRUE by default for any T3 or test storage )
        • Status can be checked with rucio list-rses --expression dontkeeplog=True (WARNING : The info is cached on different servers. It should not take more than an hour to have them synchronised)
      • rucio-admin rse set-attribute --rse ENDPOINT --key greedyDeletion --value TRUE
        • It enables the reaper in greedy mode for the site (To be written in a way which can be understood)
      • rucio-admin rse set-attribute --rse ENDPOINT --key bb8-enabled --value FALSE
        • It prevents the rebalancing of dataset to the site ( Should have been set to TRUE by default for any T3 or test storage )
        • Status can be checked with rucio list-rses --expression bb8-enabled=True (WARNING : The info is cached on different servers. It should not take more than an hour to have them synchronised)
    • If the site includes a SCRATCHDISK endpoint set also the following variable :
      • rucio-admin rse set-attribute --rse ENDPOINT --key notforextracopy --value TRUE
        • It prevents the extra copy of users' datasets to the site
    • Warning, important If storage is no more accessible it should be run with mock protocol (no replication and only rucio catalog cleaned):
      • /usr/bin/rucio-reaper --threads-per-worker 5 --greedy --scheme MOCK --include-rses RSE
      • This may require MOCK protocol being added to the endpoint (to check) (by API: Client.add_protocol(RSE_name, {'impl':'rucio.rse.protocols.mock.Default', 'scheme':'mock'}))
  2. Blacklist in AGIS for writing the old RSE
    • curl --cert $X509_USER_PROXY --key $X509_USER_PROXY -k -X GET 'https://atlas-agis-api.cern.ch/request/ddmendpointstatus/update/set_location_status/?json&ddmendpoint=ENDPOINT&activity=uw&value=OFF&reason=decommissioning&expiration=2020-01-01T00:00:00'
    • Substitute ENDPOINT with the old RSE name
    • To blacklist all old RSEs in a site: curl --cert $X509_USER_PROXY --key $X509_USER_PROXY -k -X GET 'https://atlas-agis-api.cern.ch/request/ddmendpointstatus/update/set_location_status/?json&site=SITE&activity=uw&value=OFF&reason=decommissioning&expiration=2020-01-01T00:00:00'
  3. Replicate all unique data using BB8 in decomissioning mode :
    • rucio-bb8 --decomission ENDPOINT 1000000000000000
    • Datasets are replicated randomly on the Grid (No automatic tool to replicate from one point to another)
  4. Wait for the replication to complete
    • If after 1 month some rules are still not replicated, investigate
  5. Check what is left, either from dump the next day or long query on DB
    • To get dump: curl -o rse_dump "https://rucio-hadoop.cern.ch/replica_dumps?rse=RSE&date=DD-MM-YYYY", substituting today's date and the old RSE name. The dump is bzipped.
      • Make sure to put the date, since by default you will get the last non-empty dump
      • If there are no replicas left the curl will produce an html error message (and hence bunzip2 will fail).
      • To double-check that dumps from today are ok, try to download a dump from another RSE. If that fails then there may be a general problem so check previous days or use the DB query
    • DB query: select * from atlas_rucio.replicas where rse_id=atlas_rucio.rse2id('...')
    • Possible leftovers:
      • Files with no locks and no tombstones can be created from leaks in the judge, a tombstone can be set manually with rucio-admin replicas set-tombstone --rse ENDPOINT scope:name
      • Files used as source for ongoing transfer. Depending on the urgency either wait for the transfer to finish or manually remove from the sources table (risks failing transfer but other sources should be retried)
  6. Run reaper again if necessary
  7. Check next day's dump or run query, there should now be nothing left
  8. Disable the old RSE in AGIS
  9. Delete the RSE in Rucio (rucio-admin rse delete)
    • This will likely fail because the RSE counters are not in sync (rucio list-rse-usage shows negative number of files and bytes) - then the counters need to be corrected in the DB.
    • Check the log of the reaper responsible for the RSE - if there are sudden tracebacks, some files were probably forgotten.
  10. Inform the site that they may take back the storage. Anything left in the namespace is dark data and can be deleted.
  11. If the whole site is being decommissioned:
    • The services behind the old RSE(s) should be disabled in AGIS
    • Make sure no SAM tests are running against the site

Legacy scripts

Scripts used in the past have their own twiki page DDMOperationsScriptsHistory


Major updates:
-- StephaneJezequel - 01 Jun 2007

Responsible: TomasKouba
Last reviewed by: Never reviewed

Topic attachments
I Attachment History Action Size Date Who Comment
Compressed Zip archivetgz consistencycheck_clean.tgz r2 r1 manage 12.5 K 2011-02-03 - 21:57 ErmingPei SE/LFC/DQ2 Consistency Check and Dark Data Cleaning
Texttxt getsrmv1Files.py.txt r1 manage 1.7 K 2011-04-18 - 13:02 GraemeAStewart List LFC entries which do not correspond to DDM end points in a cloud
Edit | Attach | Watch | Print version | History: r151 < r150 < r149 < r148 < r147 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r151 - 2020-09-30 - NicoloMagini1
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Atlas All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback