DDMOperationsScripts
Introduction
This page lists all scripts written by the DDM Operations Group to manage
DDM datasets and associated physical files and catalog entries. They are complementary to rucio functionalities.
The latest version of the scripts described on this page are stored in gitlab under
atlas-adc-ddm-support
group. Old scripts are in the
subversion repository
.
Tools for site administrators and cloud support
Dark data and lost files detection
- DDMDarkDataAndLostFiles (new documentation still in development, feedback is welcome).
- Lost or corrupted files: Lost files are the ones that are recorded in Rucio catalog but they are not present on the Storage element. The topic is described on a dedicated twiki page DDMLostFiles.
- Dark data cleaning The description of what dark data is and how to clean them is on a separate twiki page: DDMDarkDataCleaning
Tools for DDM operations (and/or ADC operations)
Monitoring and reporting scripts
Storage monitor and RRD plots
Tables and plots per RSE with numbers showing Storage (Free, Used), Rucio, Dark data, Unlocked, Free deletion target, Primary occupancy target etc available at
http://adc-ddm-mon.cern.ch/ddmusr01/
The details are described in the
gitlab repository
. The scripts are executed every hour on aiatlasddm003 by ddmusr01.
A daily cron backs up the rrd files on the ddmusr01 EOS space:
# Backup rrd files used for space monitoring
00 08 * * * kinit -kt /data/ddmusr01/keytab ddmusr01@CERN.CH; /data/ddmusr01/backuprrd.sh
The backed up files are at
/eos/atlas/user/d/ddmusr01/rrdbackup
Backlogs
Report of rules in trouble is sent to atlas-adc-ddm-ops with the subject "[STUCK/SUSP RULES]". The details are described in the
gitlab repository
. The scripts are executed every day at 8AM on aiatlas114 by adcmusr1.
Lifetime model
More details on dedicated twiki page
DDMLifetimeModel.
Manual distribution scripts
RPG
- Replication Policy on the Grid: The tool used for automatic replication of data when Rucio subscriptions are not appropriate
- Code and configs are here in DDM gitlab project: https://gitlab.cern.ch/atlas-adc-ddm/rpg
- The main code is in RPG.py, it takes a configuration file as an argument which describes the datasets and policy to apply.
- How it works:
- Scan the list of destination sites for datasets matching the pattern
- Scan the list of source sites for datasets matching the pattern and metadata
- If a dataset does not have the required number of replicas on the destinations, create Rucio rules
- A limit can be applied on the total active rules to all destinations to avoid eg overloading tape buffers.
- RPG runs via crons on ddmusr01@aiatlasddm003
- Code: /data/ddmusr01/rpg/
- Logs: /data/ddmusr01/log/
- The machine is fully managed via puppet so any changes to crons or configuration files must be done through gitlab. Puppet automatically syncs the machine with the gitlab repo.
- Manual changes on the machine will be overwritten!
Migration of data from disk to tape
- RPG (see above) transfers datasets from disk to T1 tape. It manages the queue of rules at each site and includes a delay between task completion and migration, to allow the dataset to be deleted in case of errors.
- Several configuration files
for this script are used for managing different types of datasets.
How to decomission or migrate an RSE
The procedure to decomission a site or to migrate an old RSE to a new one is very similar. The process should be organised soon in advance and followed up with DDM Ops. A
JIRA ticket in DDMOps
section should be opened for this purpose.
The connection with Panda queue update is detailed in this
page. The procedure for these 2 operations is detailed below. Items in
Purple can be done by the
Cloud squad, whereas items in
Green must be done by
DDM Central operation. Items in Uppercase Roman Numerals (I, II, III, IV...) are needed for the migration only. Items in Arabic numerals (1, 2, 3, 4...) are needed for both the decomission and the migration :
- Setup the new RSE in AGIS
- Set the new Panda Queues pointing to the new RSE. If you want to use the computing resources of the old site during the RSE migration, modify the old Panda Queues to point to the new RSE
- Set the following RSE attributes :
- Goal : Stop to store new files in the old RSE
- For all Storage class (DATADISK, SCRATCHDISK, LOCALGROUPDISK...) :
-
rucio-admin rse set-attribute --rse ENDPOINT --key dontkeeplog --value TRUE
- It prevents the replication of distributed log datasets to the site ( Should have been set to TRUE by default for any T3 or test storage )
- Status can be checked with
rucio list-rses --expression dontkeeplog=True
(WARNING : The info is cached on different servers. It should not take more than an hour to have them synchronised)
-
rucio-admin rse set-attribute --rse ENDPOINT --key greedyDeletion --value TRUE
- It enables the reaper in greedy mode for the site (To be written in a way which can be understood)
-
rucio-admin rse set-attribute --rse ENDPOINT --key bb8-enabled --value FALSE
- It prevents the rebalancing of dataset to the site ( Should have been set to TRUE by default for any T3 or test storage )
- Status can be checked with
rucio list-rses --expression bb8-enabled=True
(WARNING : The info is cached on different servers. It should not take more than an hour to have them synchronised)
- If the site includes a SCRATCHDISK endpoint set also the following variable :
-
rucio-admin rse set-attribute --rse ENDPOINT --key notforextracopy --value TRUE
- It prevents the extra copy of users' datasets to the site
-
If storage is no more accessible it should be run with mock protocol (no replication and only rucio catalog cleaned):
-
/usr/bin/rucio-reaper --threads-per-worker 5 --greedy --scheme MOCK --include-rses RSE
- This may require MOCK protocol being added to the endpoint (to check) (by API:
Client.add_protocol(RSE_name, {'impl':'rucio.rse.protocols.mock.Default', 'scheme':'mock'})
)
- Blacklist in AGIS for writing the old RSE
-
curl --cert $X509_USER_PROXY --key $X509_USER_PROXY -k -X GET 'https://atlas-agis-api.cern.ch/request/ddmendpointstatus/update/set_location_status/?json&ddmendpoint=ENDPOINT&activity=uw&value=OFF&reason=decommissioning&expiration=2020-01-01T00:00:00'
- Substitute ENDPOINT with the old RSE name
- To blacklist all old RSEs in a site:
curl --cert $X509_USER_PROXY --key $X509_USER_PROXY -k -X GET 'https://atlas-agis-api.cern.ch/request/ddmendpointstatus/update/set_location_status/?json&site=SITE&activity=uw&value=OFF&reason=decommissioning&expiration=2020-01-01T00:00:00'
- Replicate all unique data using BB8 in decomissioning mode :
-
rucio-bb8 --decomission ENDPOINT 1000000000000000
- Datasets are replicated randomly on the Grid (No automatic tool to replicate from one point to another)
- Wait for the replication to complete
- If after 1 month some rules are still not replicated, investigate
- Check what is left, either from dump the next day or long query on DB
- To get dump:
curl -o rse_dump "https://rucio-hadoop.cern.ch/replica_dumps?rse=RSE&date=DD-MM-YYYY"
, substituting today's date and the old RSE name. The dump is bzipped.
- Make sure to put the date, since by default you will get the last non-empty dump
- If there are no replicas left the curl will produce an html error message (and hence bunzip2 will fail).
- To double-check that dumps from today are ok, try to download a dump from another RSE. If that fails then there may be a general problem so check previous days or use the DB query
- DB query:
select * from atlas_rucio.replicas where rse_id=atlas_rucio.rse2id('...')
- Possible leftovers:
- Files with no locks and no tombstones can be created from leaks in the judge, a tombstone can be set manually with
rucio-admin replicas set-tombstone --rse ENDPOINT scope:name
- Files used as source for ongoing transfer. Depending on the urgency either wait for the transfer to finish or manually remove from the sources table (risks failing transfer but other sources should be retried)
- Run reaper again if necessary
- Check next day's dump or run query, there should now be nothing left
- Disable the old RSE in AGIS
- Delete the RSE in Rucio (rucio-admin rse delete)
- This will likely fail because the RSE counters are not in sync (rucio list-rse-usage shows negative number of files and bytes) - then the counters need to be corrected in the DB.
- Check the log of the reaper responsible for the RSE - if there are sudden tracebacks, some files were probably forgotten.
- Inform the site that they may take back the storage. Anything left in the namespace is dark data and can be deleted.
- If the whole site is being decommissioned:
- The services behind the old RSE(s) should be disabled in AGIS
- Make sure no SAM tests are running against the site
Legacy scripts
Scripts used in the past have their own twiki page
DDMOperationsScriptsHistory
Major updates:
--
StephaneJezequel - 01 Jun 2007
Responsible:
TomasKouba
Last reviewed by:
Never reviewed