DDMOperationProcedures

This page is deprecated !!! Use the new one : DDMOperationsGroup

Introduction

DDM Operation is responsible to keep healthy files and datasets on Storage Elements. It means :

  • Keep consistent location catalog (mainly lost replicas)
  • Keep consistency between DDM, LFC and SE catalogs (mainly lost files)
  • Keep file checksum consistent with the value registered in DDM/LFC (corrupted files)

Procedures accessible to users/shifters

Checksum error triggered by dq2-get/lcg-cp

DDM tools can check file (on Storage) consistency with LFC/DDM catalogs (filled when the file is registered in DDM and before any replication) on file by file basis. As soon as you have a doubt about a file, follow the procedure:

  1. Check the file consistency:
    • If the file belongs to a tid dataset: run a script to check the consistency of the file on the Storage at the source T1: link
    • If the file does not belong to a tid dataset : use dq2-get which will copy the file locally, compute the checksum and report inconsistency
  2. If the file is not corrupted on the source T1 Storage:
    • Run dq2-get to check the file consistency on the SE used by your application. If the file is correct, go to next point . If the file is not correct, fill a Savannah ticket to DDM Ops providing the dataset name and the file name. Somebody with special priviledges will do the cleaning (not automatic yet). Consider the file as lost.
    • Check within your application. For example, it is possible that the file was not copied on the scratch disk associated to the CPU because it was full or the copy time-out occured before the file was completly copied.
  3. If the file is corrupted at the source T1, it needs to be deleted from the Storage and DDM. Fill a Savannah ticket to DDM Ops providing the dataset name and the file name. Somebody with special priviledges will do the cleaning (not automatic yet). Consider the file as lost.

Procedures to be done by DDM Ops expert

The whole system shutdown

Communication

Well in advance before the work :

  • Notify the cs-ops about the downtime of the httpd
  • The services which are running apache :
    • Central catalogs (voatlas03, voatlas05, voatlas26, voatlas72, voatlas146)
    • Collectors (voatlas154)
    • Recovery service (atlddm32)
    • Which correspond to the nodes : voatlas03, voatlas05, voatlas26, voatlas72, voatlas146, voatlas154, atlddm32. All nodes should be set in the maintenance mode.

Stopping the services
Recovery service
atlddm32
ssh ddmusr01@<hostname>
dashb-agent-list
dashb-agent-stop dq2consistencyagent
exit
ssh <hostname>
sudo /sbin/service httpd status
sudo /sbin/service httpd stop
Victor
atlas-ddm-cleaning
 
Site services
voatlas113 voatlas114 voatlas115 voatlas116 voatlas117 atlddm18 atlddm19 atlddm24 atlddm28 atlddm29 atlddm35
ssh ddmusr01@<hostname>
dashb-agent-list
dashb-agent-stop dq2agents
Santa Clauss
voatlas09
ssh ddmusr01@voatlas09
dashb-agent-list
dashb-agent-stop SantaClaus-STEPT1T1
dashb-agent-stop SantaClaus-beam
dashb-agent-stop SantaClaus-hi
dashb-agent-stop SantaClaus-muoncalib
dashb-agent-stop SantaClaus-HIST
dashb-agent-list
Deletion service
voatlas66 voatlas68
ssh $host sudo /etc/init.d/dq2-deletionagents status
ssh $host sudo /etc/init.d/dq2-deletionagents stop
Tracker
voatlas22
ssh $host sudo /etc/init.d/dq2-consistencyagents status
ssh $host sudo /etc/init.d/dq2-consistencyagents stop
Collectors
voatlas127 voatlas154
prevent the cron jobs to run
ssh $host sudo mv /opt/dq2/etc/dq2.cfg  /opt/dq2/etc/dq2.cfg.$(date +%Y%m%d)
stop the httpd
ssh $host sudo /etc/init.d/httpd status
ssh $host sudo /etc/init.d/httpd stop
Popularity
voatlas89
ssh $host sudo /etc/init.d/lighttpd status
ssh $host sudo /etc/init.d/lighttpd stop
Central catalogs
voatlas03 voatlas05 voatlas26 voatlas72 voatlas146
  • stop httpd and prevent auto-restart of httpd
    ssh $host 
    /etc/init.d/httpd status
    sudo /etc/init.d/httpd stop
    sudo mv /etc/http/conf/http.conf /etc/http/conf/http.conf.$(date +%Y%m%d)

Tracer
Tracer module is part of the central catalogs. No separate nodes

Read/Write services:

  • Central catalogs
  • Tracker
  • Collectors

Read only services:

  • Recovery service
  • Deletion service
  • Site services
  • Santa Clauss
  • Popularity

cf. https://svnweb.cern.ch/trac/dq2/wiki/ATLR%20-%3E%20ADCR%20migration

The whole system restart

Starting the services
$ httpd nodes to be brought back into production

Central catalogs
voatlas03 voatlas05 voatlas26 voatlas72 voatlas146
  • stop httpd and prevent auto-restart of httpd
    ssh $host 
    /etc/init.d/httpd status
    sudo /etc/init.d/httpd stop
    sudo mv /etc/http/conf/http.conf.YYYYMMDD /etc/http/conf/http.conf

Collectors
voatlas127 voatlas154
move back the cfg file and restart the httpd
ssh $host 
sudo ls -lt /opt/dq2/etc
sudo mv /opt/dq2/etc/dq2.cfg.YYYYMMDD  /opt/dq2/etc/dq2.cfg
sudo /etc/init.d/httpd status
sudo /etc/init.d/httpd start

Tracker
voatlas22
ssh $host 
sudo /etc/init.d/dq2-consistencyagents status
sudo /etc/init.d/dq2-consistencyagents start

Popularity
voatlas89
ssh $host 
sudo /etc/init.d/lighttpd status
sudo /etc/init.d/lighttpd start

Deletion service
voatlas66 voatlas68
ssh $host 
ls -lt /opt/dq2/etc
sudo /etc/init.d/dq2-deletionagents status
sudo /etc/init.d/dq2-deletionagents start

Site services
voatlas113 voatlas114 voatlas115 voatlas116 voatlas117 atlddm18 atlddm19 atlddm24 atlddm28 atlddm29
ssh ddmusr01@<hostname>
ls -lt /opt/dq2/etc/
dashb-agent-list
dashb-agent-start dq2agents
tail -f /var/log/dq2/dq2.log
spare site service node
atlddm35 -- do nothing

Recovery service
atlddm32
ssh ddmusr01@<hostname>
dashb-agent-list
dashb-agent-start dq2consistencyagent
exit
ssh <hostname>
sudo /sbin/service httpd status
sudo /sbin/service httpd start

Santa Clauss
voatlas09
ssh ddmusr01@voatlas09
dashb-agent-list
dashb-agent-start SantaClaus-STEPT1T1 ; tail -f /var/log/dq2/SantaClaus-STEPT1T1.log
dashb-agent-start SantaClaus-beam ; tail -f /var/log/dq2/SantaClaus-beam.log
dashb-agent-start SantaClaus-hi ; tail -f /var/log/dq2/SantaClaus-hi.log
dashb-agent-start SantaClaus-muoncalib ; tail -f /var/log/dq2/SantaClaus-muoncalib.log
dashb-agent-start SantaClaus-HIST ; tail -f /var/log/dq2/SantaClaus-HIST.log
dashb-agent-list

Victor
atlas-ddm-cleaning
 

Tracer
Tracer module is part of the central catalogs. No separate nodes

ToA

Recipe for introduction in ToA

Necessary Information

  • To include a site in the ToA, the site has to provide:
    • the site name in the GoCDB (if registered in GOCDB)
    • provide the name of the associated cloud
    • the SE name ( example: srm://lapp-se01.in2p3.fr/dpm/in2p3.fr/home/atlas )
    • Provide information about space token availability
    • The email address of the responsible person (if not provided)
    • 'seinfo' is necessary for the site to be used in ganga analysis

  • It can be usefull to check the SE software version
  • For any new gocdb entry, atlas-grid-installATcernDOTch should be informed to setup the parameters for Installation DB (especially define DQ2_LOCAL_SITE_ID when running $VO_ATLAS_SW_DIR/ddm/latest/setup.sh)

Savannah Ticket

  • The request with the necessary informations should be posted in DDM Ops Savannah.

Site Name

  • The name in ToA will be_ .
    • Space tokens are described in this Twiki (Please pay attention to the acls).
    • If the site name does not follow the above convention, it will not appear on the dashboard automatically.
      • then, ddm-ops needs to contact the dashboard support.

Information Propagation

  • Central Catalog -- After the update of ToA, the central catalog is updated automatically, but it can take some time (up to 1800 sec).
  • Site Services
  • drchecker needs update (does not automatically import ToA sites)
  • PD2P if necessary

Recipe for removal from ToA

To remove a site (DDM endpoint) from DDM, here is the list of actions to be done (mainly by the cloud reponsible):

  • The cloud/site responsible is to:
    • inform users on the Grid Announcements Hypernews (2 weeks in advance) that, in order to keep data in their SCRATCHDISK or LOCALGROUPDISK area, they should migrate their data
    • ensure that all other applications do not point anymore to these DDM endpoints:
      • panda queue is pointing an
      • Automatic data distribution (AK47) and associated monitoring tools . If needed, redefine the shares
      • PD2P
    • prepare the list of datasets to be migrated in another place
  • The cloud/site manager fills a DDM Ops Savannah ticket to dq2-ddm-ops to request :
    • First step
      • Kill subscriptions (can be done with atlas/Role=production : dq2-delete-subscription)
      • Site is put with ReadOnlyPermission acl in ToA (no possibility for normal users to run dq2-put or make subscriptions)
      • Remove the DDM endpoints from DDM SS (no subscription processed)
      • Do data migration to another site (should be minimized as much as possible), look at the document
    • Second step
      • Remove all the files/datasets from the site (can be done with atlas/Role=production). There are 2 possibilities :
        • dq2-delete-replicas : Cleans DDM location, triggers cleaning of LFC and SE
        • If the storage is decommisionned, on can just delete the DDM location and request the T1 to clean LFC (SE will disappear)
    • Third step
      • Remove DDM endpoint from Central Deletion service (Logon the deletion service machines with ddmusr01 and edit /opt/dq2/etc/dq2.cfg. Stop and restart service)
    • Fourth step
      • Remove the reference to the site in ToA

Data Distribution

Disabling a T1 for T0 export

There are a couple of things to be done (each of described in the following sub sections)

  1. Site exclusion in DDM
  2. Disable the site in SantaClaus
  3. Redirect already subscribed datasets if any
  4. Resume the export to the site

Site exclusion in DDM

The site (or the specific endpoint) needs to be excluded (blacklisted) when there are failures or the data should not be sent to the site.

Disable the site in SantaClaus

A site being excluded in DDM does not mean data are not assigned to the site.

  • Data assignment and subscriptions are made independently to the site exclusion
  • Site exclusion is simply to stop treating subscriptions
  • So, when a site is excluded in DDM and it is not disabled in SantaClaus, then the subscriptions for the site will pile-up

In SantaClaus, we can disable a site for TAPE and/or DISK.

  • TAPE for RAW data export
  • DISK for both ESD and AOD

For T0 export, especially for RAW data, we should not leave the subscriptions unprocessed for long time.

  • We should assure there are two copies of RAW data as soon as possible
  • Although there is a buffer space for several days at T0 to keep data on disk, there is a possibility that we lose the data both from disk and tape.

  • For ESD and AOD, we need to decide what to do taking the followings into account:
    • there are other replicas of the same dataset. So, we can let the subscriptions to pile up if the downtime is not too long.
    • If we disable the site for the export of AOD, subscriptions for the AOD missing at the site need to be resumed (see the section below for "Resume the export").
    • If we do not disable the site for the export of ESD letting the subscriptions to pile up, and then later find the site is down too long, then we need to reassign the ESD to another site (see the section below "Redirect already subscribed datasets").
    • In general, we can let the data to assigned to the site DATADISK if the downtime is short (~1 day).

Disabling RAW data export to a site

  • Go to the SantaClaus machine and edit the configuration to set DATATAPE Availability to 0

Disabling ESD and AOD export to a site

  • Go to the SantaClaus machine and edit the configuration to set DATADISK Availability to 0

Redirect already subscribed datasets

We can think of two different ways

  1. manual resubscription B. resetting SantaClaus decision and let him do resubscription

Usually B. is better because SantaClaus knows all the parameters needed for the subscriptions.

B.

Note:

  • SantaClaus treats (retreats) datasets created only within a certain period (see service-config.xml for intervalSecs)
    • the datasets older than this period are not treated.
    • you can change the period temporarily if needed.

Resume the export to the site

SantaClaus automatically resume subscriptions for DATADISK.

  • By simply restoring the availability for DATADISK, SantaClaus starts making subscriptions for the datasets missing at the site.
  • Note, for the datasets older than the defined period (see service-config.xml for intervalSecs) are not picked up
    • you need to change the period temporarily to make SantaClaus to treat them.
    • or, otherwise you need to resume the subscriptions by hand.

Replication of data to tape

Some outputs of MC production and reprocessing must be archived on tape, however there should be a delay so that bad tasks can be aborted and we avoid bad data on tape. Currently this is implemented as 30 days after dataset freezing time. A DDM ops script scans for datasets to be archived and subscribes them to T1 tape. The data distribution is weighted by free space where free space is calculated as the pledged tape space for the current year minus what DQ2 reports as used on the tape.

The script runs under user ddmusr01 on aiatlasddm001.cern.ch and uses the ddm admin robot certificate to make subscriptions. The logic is done by /data/ddmusr01/src/adcops/ddmscripts/manual_distribution/Migrate.py and the particular dataset patterns to migrate to tape are specified in conf files in the same directory. Cron jobs run for each of the conf files once per day:

# migrate data to tape
00 20 * * * /data/ddmusr01/src/adcops/ddmscripts/manual_distribution/MigrateToTape.sh /data/ddmusr01/src/adcops/ddmscripts/manual_distribution/MigrateToTapeData.conf 
00 08 * * * /data/ddmusr01/src/adcops/ddmscripts/manual_distribution/MigrateToTape.sh /data/ddmusr01/src/adcops/ddmscripts/manual_distribution/MigrateToTapeMC.conf 

# migrate NTUP_COMMON to tape
00 02 * * * /data/ddmusr01/src/adcops/ddmscripts/manual_distribution/MigrateToTape.sh /data/ddmusr01/src/adcops/ddmscripts/manual_distribution/MigrateToTapeNTUPCOMMONData.conf 
00 14 * * * /data/ddmusr01/src/adcops/ddmscripts/manual_distribution/MigrateToTape.sh /data/ddmusr01/src/adcops/ddmscripts/manual_distribution/MigrateToTapeNTUPCOMMONMC.conf 

These configurations are for long running campaigns and are changed only a few times per year or less. For a one-off replication edit MigrateToTape.conf and run the script directly.

Logs of the script activity can be found in /data/ddmusr01/log.

Deletion Service

Monitoring

If you need urgently that central deletion service cleans a specific site, you have to restrict the list of sites.

To add/remove a site, edit the file /opt/dq2/etc/dq2.cfg and modify the list of sites. After stop and restart services:

  • dashb-agent-stop dq2deletionagents
  • dashb-agent-start dq2deletionagents

Cleaning Agent

DQ2 crons

atlddm31.

> crontab -l
30 * * * *    /opt/dq2/bin/python /usr/lib/python2.5/site-packages/dq2/info/server/collectors/ToACollector.py   > /tmp/toacollector.log 2>&1
0 */2 * * *   /data/ddmusr01/vomscollector.sh > /tmp/vomscollector.log 2>&1
0 * * * *     /opt/dq2/bin/python /usr/lib/python2.5/site-packages/dq2/info/server/collectors/AMICollector.py   > /tmp/amicollector.log 2>&1

Repairing datasets / files

Recover datasets from aborted tasks

If a physics group requests to resurrect some tasks or datasets, the actions are:

  • Ask Pavel to resurrect the tasks (otherwise datasets will be deleted later)
  • To avoid that the 'cleaning tool' will figure out that the dataset you got back has to be deleted,
ask Alexei to remove from the 'obsolete' list http://panda.cern.ch/?mode=listAbortedDatasetsState
  • Ask to get the list of dataset
  • Check if the dataset still contain files. If not, the dataset cannot be recovered (difficult to register correctly files on the Grid).

Checksum or file size error reported after dq2-put (only for input to evgen)

  • Dataset copied with dq2-put (usually input files to evgen production)
    • Usually, the file was badly copied on the local SE. This wrong information is propagated to LFC and DDM. After the file is overwritten on the SE, get the checksum and size values
    • To correct the DDM entry (you can get it with dq2-list-files DATASET), send the informations to be corrected to atlas-dq2-dmod.NOTSPAM.cern.ch
    • To correct the LFC entry, goto /afs/cern.ch/atlas/offline/external/GRID/ddm/Operations/correct_LFC and run (more informations in the code itself)
python fixlfcchksum.py SITENAME   GUID  CHECKSUMTYPE NEWCHECKSUM

Wrong guid already registered on LFC

This is an old problems which affect tasks < 10000. To cure the problem, goto /afs/cern.ch/atlas/offline/external/GRID/ddm/Operations/correct_LFC and run (after having defined correctly the LFC catalog):

python correctBadRegistrationSandjay.py

Remove corrupted files from dataset definition and remove files from the Storage/LFC

File corrupted at the source T1

When a file has been checked as corrupted at the source T1 (where the dataset was produced and from where it was replicated to other T1s), the file should be removed from the dataset definition and on storages.

Run python /afs/cern.ch/atlas/offline/external/GRID/ddm/Operations/corrupted_file/remove_corrupted_everywhere.py DATASETNAME GUID1 GUID2 ... with Role=production.

It

  • removes the files from the dataset
  • creates a temporary dataset (trash.*) containing problematic files and with the same location as DATASETNAME
  • declare the temporary dataset to be deleted everywhere to the central deletion (and the temporary dataset is erased)

File corrupted on a specific site (not from the production tools)

If the dataset is not produced by the production tool, it is usually not possible to know from DDM where it was produced. The file should be deleted and resubscribed (if possible).

  • get the lfc host for the site
    /afs/cern.ch/atlas/offline/external/GRID/ddm/Operations/dq2-get-site-lfc DQ2-SITE
  • set LFC_HOST
    export LFC_HOST=...
  • Delete the replica on the problematic site only
    lcg-del surl
  • Force consistency check of the dataset at the site (so that DDM realises that a file was removed on this site)
    dq2-check-replica-consistency DATASET SITE
  • Resubscribe the dataset to the site
     dq2-register-subscription DATASET SITE

File with zero-size at CERN castor

only DDM admins can do the following procedure:

  • login to the specific machine (see DDMOperationProceduresInternalNotes)
  • remove the file nsrm $CASTORFILE
  • put a non-zero file rfcp $TMPFILE $CASTORFILE
  • delete it as usual LFC_HOST=prod-lfc-atlas-local.cern.ch lcg-del $SURL
  • Force consistency check of the dataset at the site (so that DDM realises that a file was removed on this site)
  • Resubscribe the dataset to the site

remarks

  • lcg-del for such a file would fail
    Function 'lfc_unregister_pfns': invalid arguments
    lcg-del: Invalid argument
  • nsrm does not work even if the file and the directory is zp group writable because of ACL
    % nsls -l $CASTORFILE
    -rwxrwxr--   1 atlas003 zp                 32258236 Feb 27 16:07 /castor/cern.ch/grid/atlas/...
    
    % nsls -ld $CASTORDIR
    drwxrwxr-x 777 atlas003 zp                        0 Feb 27 16:07 /castor/cern.ch/grid/atlas/...
    
    % nsgetacl $CASTORFILE
    # file: /castor/cern.ch/grid/atlas/...
    # owner: atlas003
    # group: zp
    user::rwx
    user:atlas003:rwx               #effective:rwx
    user:atlas004:rwx               #effective:rwx
    group::r-x              #effective:r-x
    mask::rwx
    other::r--
    
    % nsgetacl $CASTORDIR
    # file: /castor/cern.ch/grid/atlas/...
    # owner: atlas003
    # group: zp
    user::rwx
    user:atlas003:rwx               #effective:rwx
    user:atlas004:rwx               #effective:rwx
    group::r-x              #effective:r-x
    mask::rwx
    other::r-x
    default:user::rwx
    default:user:atlas003:rwx
    default:user:atlas004:rwx
    default:group::r-x
    default:mask::rwx
    default:other::r-x

Correcting LFC ACLs

If the LFC direcotry of the question is owned by Kors DN, then DDM Admins can correct it by themselves

  • login to the specific account (see DDMOperationProceduresInternalNotes)
  • set LFC_HOST
    export LFC_HOST=$(/afs/cern.ch/atlas/offline/external/GRID/ddm/Operations/dq2-get-site-lfc DQ2-SITE)

  • to make the directory atlas-wide writable
    lfc-setacl -m 'g:atlas:rwx,m:rwx,d:g:atlas:rwx,d:m:rwx' LFC_DIRECTORY

Lost Dataset

file a ticket with task id https://savannah.cern.ch/bugs/?group=adco-support

lost/corrupted files at the TZERO

merge.RAW can be reproduced within ~ a week

  • see below

data09_calib.*.calibration_LArElec-*.daq.RAW

calibration_muon_all.*\.RAW, merge.NTUP_MUONCALIB

see some memo at DDMOperationProceduresInternalNotes#Lost_Corrupted_files_at_TZERO

Recovering lost files at the TZERO

On Mar 23, 2010, at 4:37 PM, Simone Campana wrote:

  1. ) Get the infos below (from Armin).
    • eg.
      Dataset: data10_900GeV.00149907.physics_CosmicCalo.merge.RAW
      location: CERN-PROD_TZERO
      
      Bad file:
      LFN:       data10_900GeV. 
      00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.1
      PFN:       /castor/cern.ch/grid/atlas/tzero/prod1/perm/data10_900GeV/ 
      physics_CosmicCalo/0149907/data10_900GeV. 
      00149907.physics_CosmicCalo.merge.RAW/data10_900GeV. 
      00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.1
      GUID:      8C06DE72-522A-DF11-8596-003048D3C928
      checksum:  ad:c22efab5
      size:      2554582892
      
      Regenerated file:
      LFN:       data10_900GeV. 
      00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.2
      PFN:       /castor/cern.ch/grid/atlas/tzero/prod1/perm/data10_900GeV/ 
      physics_CosmicCalo/0149907/data10_900GeV. 
      00149907.physics_CosmicCalo.merge.RAW/data10_900GeV. 
      00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.2
      GUID:      F866CDF0-6636-DF11-9498-001617C3B724
      checksum:  ad:1995fb98
      size:      2554582900
  2. ) Remove all subscriptions for this datasets (keep note of them). In this case there where none
  3. ) Ask Vincent to re-open the datasets if it is frozen
  4. ) Create a new dataset version
    $ dq2-register-version data10_900GeV.00149907.physics_CosmicCalo.merge.RAWDataset 
    
    data10_900GeV.00149907.physics_CosmicCalo.merge.RAW version 3 is open.
  5. ) Ask Armin to register the good file both in DDM and LFC (This case I did it to test the full flow).
  6. ) Find the GUID for the BAD file in the central catalog
    $ dq2-list-files data10_900GeV.00149907.physics_CosmicCalo.merge.RAW | grep data10_900GeV.00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.1
    
    data10_900GeV.00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.1    8C06DE72-522A-DF11-8596-003048D3C928    ad:c22efab5    2554582892
  7. ) Remove the bad file. You could do with lcg-del but
    $ lcg-lr --vo atlas guid:8C06DE72-522A-DF11-8596-003048D3C928
    
    srm://srm-atlas.cern.ch/castor/cern.ch/grid/atlas/tzero/prod1/perm/data10_900GeV/physics_CosmicCalo/0149907/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.1srm
    
    $ lcg-del --vo atlas srm://srm-atlas.cern.ch/castor/cern.ch/grid/atlas/tzero/prod1/perm/data10_900GeV/physics_CosmicCalo/0149907/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.1
    
    srm://srm-atlas.cern.ch/castor/cern.ch/grid/atlas/tzero/prod1/perm/data10_900GeV/physics_CosmicCalo/0149907/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.1: [SE][srmRm][SRM_AUTHORIZATION_FAILURE] Permission denied
    This is because T0 files and not group writable (and this should remain such). So, I did (as ddmusr03@atlddm16):
    $ export LFC_HOST=lfc-atlas.cern.ch
    $ nsrm /castor/cern.ch/grid/atlas/tzero/prod1/perm/data10_900GeV/physics_CosmicCalo/0149907/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.1
    $ lcg-uf --vo atlas guid:8C06DE72-522A-DF11-8596-003048D3C928 srm://srm-atlas.cern.ch/castor/cern.ch/grid/atlas/tzero/prod1/perm/data10_900GeV/physics_CosmicCalo/0149907/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW/data10_900GeV.00149907.physics_CosmicCalo.merge.RAW._lb0039._0001.1 
  8. ) Delete the bad file from the dataset definition
    [atlddm16] /afs/cern.ch/user/d/ddmusr03 > dq2-delete-files data10_900GeV.00149907.physics_CosmicCalo.merge.RAW 8C06DE72-522A-DF11-8596-003048D3C928
  9. ) Freeze the dataset
    $ dq2-freeze-dataset data10_900GeV.00149907.physics_CosmicCalo.merge.RAW
    
    Dataset data10_900GeV.00149907.physics_CosmicCalo.merge.RAW frozen
  10. ) Check consistency for every location
    $ dq2-list-dataset-replicas data10_900GeV.00149907.physics_CosmicCalo.merge.RAW
    INCOMPLETE: BNL-OSG2_DATADISK,BNL-OSG2_DATATAPE,CERN-PROD_TZERO
    COMPLETE:
    
    $ dq2-check-replica-consistency data10_900GeV.00149907.physics_CosmicCalo.merge.RAW CERN-PROD_TZERO
    Refresh request took into account for Dataset data10_900GeV.00149907.physics_CosmicCalo.merge.RAW 
    
    $ dq2-list-dataset-replicas data10_900GeV.00149907.physics_CosmicCalo.merge.RAW
    INCOMPLETE: BNL-OSG2_DATADISK,BNL-OSG2_DATATAPE
    COMPLETE: CERN-PROD_TZERO
  11. ) Re-insert the subscription and (more in general or in addition) place a subscription for each incomplete replica:
    $ dq2-register-subscription --archive=primary data10_900GeV.00149907.physics_CosmicCalo.merge.RAW BNL-OSG2_DATADISK 
    Dataset data10_900GeV.00149907.physics_CosmicCalo.merge.RAW subscribed (archived: None) to BNL-OSG2_DATADISK.
    
    $ dq2-register-subscription --archive=custodial data10_900GeV.00149907.physics_CosmicCalo.merge.RAW BNL-OSG2_DATATAPE
    Dataset data10_900GeV.00149907.physics_CosmicCalo.merge.RAW subscribed (archived: None) to BNL-OSG2_DATATAPE.
    • Need to set the archived properly...

eg. bug #65017: data10_900GeV.00149907.physics_CosmicCalo.merge.RAW

AMICollector

[atlddm31] /afs/cern.ch/user/d/ddmusr01 > /opt/dq2/bin/python /usr/lib/python2.5/site-packages/dq2/info/server/collectors/AMICollector.py
Fri May  7 13:31:29 2010-ami collector >> start
Fri May  7 13:31:29 2010> Added namespace cond09_mc

The cron runs everyday at midnight :

> crontab -l
...
5 0 * * *     /opt/dq2/bin/python /usr/lib/python2.5/site-packages/dq2/info/server/collectors/AMICollector.py > /tmp/amicollector.log 2>&1

HOTDISK decommission

The files from cond* and ddo. * .DBrelease datasets are also put in CVMFS. So T2s with working CVMFS can decommission the HOTDISK endpoint. The procedure to follow when such decommission is done:

  1. create a savannah ticket to follow the procedure
  2. ask AK47 maintainers (Alexei Klimentov) to exclude the endpoint from dataset distribution (otherwise AK47 will start subscribing DB and SW releases to DATADISK)
  3. blacklist the endpoint as source and destination (or set permission group to hotdiskDecommission in AGIS - this would leave only delete and administer permissions on for Role=production)
  4. remove the endpoint from Site service configuration (+restart the agent) => no datasets transferred
  5. delete all existing datasets
  6. remove the endpoint from Deletion service when the deletion is complete (check in deletion monitoring) and restart the agent
  7. remove the subscriptions to this endpoint
  8. disable the endpoint in AGIS

Some of these endpoints were already decommissioned: 100428, 100815, 101299

Procedures to be done by DDM Ops and sites/clouds (with production role)

Procedures for Cloud Squads

Missing files

There are two things to do;

  • To confirm the file is supposed to be at the site
  • To confirm if the file is lost at the site, or unavailable temporarily

To confirm the file is supposed to be at the site

  • If the missing file was found by a production job, follow the instruction on ADCoS#Missing_files
  • If the missing file was found by dq2-get, then find the file is registered for the site:
        dq2-ls -f -L DQ2-SITE_NAME DATASET | grep FILE
    You should see [X] at the beginning of the line if the file is registered for the site.
        dq2-ls -fp  -L DQ2-SITE_NAME DATASET | grep FILE
    You should get the SURL for the file if the file is registered for the site.

To confirm if the file is *lost at the site, or unavailable temporarily*

  • First, try to get the file yourself :
    1. You can get the SURL (something like URL starting with "srm://") of the missing file(s) by navigating the Panda Monitor (see the slides at ADC Operations Weekly Meeting 30 July 2009)
      1. on the job information page, click the file name whose status is 'missing'.
      2. then you will be lead to the file search page; look at the bottom of the page if the file is registered in the LFC, you will see their SURLs.
    2. then try to download the file
      lcg-cp --vo atlas <SURL> <LOCALFILE> 
      • If it is accessible, it was a temporary problem. Nothing to do unless the same failure keeps occurring.
      • If you cannot get the SURL, then follow the procedures below

  • If lcg-cp failed, then verify that the replica at the site is really listed in the cloud LFC :
    1. In order to get the LFC hostname,
      • If you have the Panda Site ID; use the script /afs/cern.ch/atlas/offline/external/GRID/ddm/Operations/panda-info-site-lfc PANDA-SITEID
        • if you are not at CERN, simply copy the file to your local system and execute it there
      • If you have the DQ2 Site Name; use the script /afs/cern.ch/atlas/offline/external/GRID/ddm/Operations/dq2-get-site-lfc DQ2-SITE
        • if you are not at CERN, simply copy the file to your local system and execute it there after setting up dq2 environment
      • If you are not sure about either or them; please ask the expert either at the #ADC_Virtual_Control_Room, or via the ADCoS team ML
    2. set the LFC_HOST
      • for bash/zsh :
        export LFC_HOST=$(<the_script> <SITE_NAME>)
      • for csh/tcsh :
        setenv LFC_HOST `<the_script> <SITE_NAME>`
    3. check the replica with =lcg-ls"
      • if you know the SURL
        lcg-ls -l <SURL>
      • if you know only the GUID:
        lcg-lr --vo atlas guid:<GUID>
        and then find the corresponding SURL and
        lcg-ls -l <SURL>
        • How to find the corresponding SURL? Maybe you can guess from the hostname (right after "srm://") . If not consult the expert (see above).

    • If a replica for the site investigated is not listed the it is not a site problem
    • If a replica is listed you should try to access it using lcg-cp
      lcg-cp --vo atlas <SURL> <LOCALFILE> 
      (if you have not done so as the first step)

    • If it is accessible it was a transitory problem
    • If it is unaccessible it is a site problem

If it is a site problem; go to the next section. ( #In_case_some_files_have_been_los)

The recovery procedures for the expert shifters

  1. wait for the site response in case it is the site problem
  2. Check if the file is available at the original T1 (following this instruction)
  3. if the file is corrupted/missing at the original T1; follow the procedure to remove files everywhere and from the DQ2
  4. if the file is NOT corrupted/missing at the original T1, but just at the detected site; follow the procedure to recover the file at the site.

In case some files seems to be lost in a site

If the problem is detected by the site:

  • the site has to provide a list of surls (srm://....). It is important that the files are completely deleted from the SE (physically and in the storage catalog)
  • If the site can post a Savannah ticket: this list has to be posted as a Savannah ticket in DDM-Ops Savannah and the ticket assigned to the cloud squad. Attach to the ticket the list of lost files.
  • If the site cannot post a Savannah ticket, he/she should ask a member of the cloud support to do it (if the site doesn't know the cloud support persons, contact the cloud support email address: atlas-adc-cloud-@cern.ch, where =['us','ng','es','de','fr','ca','it','tw','nl','uk'] )

If the problem is detected by an analysis or production tool or by a shifter:

  • the shifter (DAST or ADCoS) has to validate the problem :
    • site not in downtime
    • check one or 2 files with dq2-get
If the problem is confirmed, the shifter has to issue :
  • a GGUS ticket to the site (copy to the cloud support mailing list : atlas-adc-cloud-fr@cernNOSPAMPLEASE.ch)
    • to provide to the site a list of unaccessible files
    • to ask the site if the files can be recovered (in case a disk server is down)
    • to ask the site if other files are affected by the same problem
  • a Savannah ticket in DDM-Ops Savannah refering to the GGUS ticket. The ticket should be assigned to the squad support.
When the GGUS ticket is answered, the squad support should update the Savannah ticket with the list of lost files.

When the list of problematic surls is defined, they will be submitted to the recovery service (right now can be done only by ddm central ops team) that will clean the LFC, refresh the DQ2 location catalog and eventually recover the files if possible or remove them from the dataset definition if not.

Find and delete dark data

SE/LFC/DDM comparison

Dark files can be detected as they are not associated to an LFC or DDM entry. Their detection and deletion (through central deletion service) is described in DDMOperationsScripts#Consistency_check_and_Dark_data But the central deletion service is not deleting directories and it happens that some files cannot be deleted (problem observed at least in DPM). This chapter provides the list of directories which can be completly cleaned by site admin (including their content)

Bulk erase in SE

The following paragraphs provides a list of directories in SE which must be completly cleaned. This usually cannot be done centrally through central deletion (directories or dark files)

  • All directories which are not related to ATLAS space tokens are not under ADC responsability. Sites have to decide by themselves if they should be kept. But they are not part of pledge resources.

  • ATLASMCDISK
    • .../atlasmcdisk in all ATLAS Grid sites
    • ATLASMCDISK space token should disappear

  • ATLASDATADISK
    • .../atlasdatadisk/panda (Now running in PRODDISK)
    • .../atlas/atlasdatadisk/ccrc08_run2

  • ATLASPRODDISK
    • .../atlasproddisk/mc08 (mc08 project not used anymore)
    • .../atlasproddisk/mc09_10TeV (mc09_10TeV project not used anymore)
    • ../atlasproddisk/mc09_14TeV (mc09_14TeV project not used anymore)
    • ../atlasproddisk/mc09_900GeV (mc09_900GeV project not used anymore)
    • .../atlasproddisk/panda/01 , .../atlasproddisk/panda/02 , .../atlasproddisk/panda/03 , ..... , .../atlasproddisk/panda/31 (The pattern of input panda datasets has changed and do not write into these directories)
    • If the SE technology provides the functionnality, all files older than 60 days in .../atlasproddisk should be deleted

  • ATLASHOTDISK (Cleaning only for non-US T2 and T3)
    • ...../atlashotdisk/ (No more used and replaced by cvmfs)

Bad files notification

From time to time, notifications are sent to cloud supports about bad files. These notifications look like :

[DQ2 notification] Potentially bad files

This is an automatic message from DQ2 recovery service. Please do not reply to this mail. If you have questions, please send them to atlas-dq2-ops@cern.ch
In the last day some files in sites from your cloud have been identified as potentially lost or corrupted. You can find the list below :

UAM-LCG2_DATADISK data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00 NTUP_SUSYSKIM.607852._000174.root.1 srm://grid002.ft.uam.es/pnfs/ft.uam.es/data/atlas/atlasdatadisk/data11_7TeV/NTUP_SUSYSKIM/f403_m975_p832/data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00/NTUP_SUSYSKIM.607852._000174.root.1__DQ2-1324344973 File corrupted
CERN-PROD_LOCALGROUPDISK data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00 NTUP_SUSY.277004._000093.root.1 srm://srm-atlas.cern.ch/castor/cern.ch/grid/atlas/atlaslocalgroupdisk/data10_7TeV/NTUP_SUSY/r1647_p306_p307_p428/data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00/NTUP_SUSY.277004._000093.root.1 Source file doesn't exist
...

These notification comes from the DQ2 recovery service that collects errors reported by the DDM dashboard or by the pilot. 2 kinds of errors are collected :

  • Potentially lost files : source file doesn't exist , No such file or directory, ...
  • Potentially corrupted files : Source file/user checksum mismatch

The list of potentially lost or corrupted files are also available via http or an API. See DDMOperationsScripts#SuspiciousFiles.

The cloud squad need to check if these files are really bad or if they are false positive. The checking procedure are the following :

  • For lost files, use dq2-get specifying the files and the DDM endpoint, which tries to download the files and tells success or failure.
    dq2-get -s CERN-PROD_LOCALGROUPDISK -f NTUP_SUSY.277004._000093.root.1 data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00 
    Querying DQ2 central catalogues to resolve datasetname data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00
    Datasets found: 1
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00: Directory data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00 already exists
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00: Querying DQ2 central catalogues for replicas...
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00: Using complete replica at given site
    Querying DQ2 central catalogues for files in dataset...
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00: Complete replica available at local site
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00: Using site CERN-PROD_LOCALGROUPDISK
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00: Querying local file catalogue of site CERN-PROD_LOCALGROUPDISK...
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00/NTUP_SUSY.277004._000093.root.1: Getting SRM metadata
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00/NTUP_SUSY.277004._000093.root.1: might not be cached. Staging will take some time.
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00/NTUP_SUSY.277004._000093.root.1: Starting transfer
    
    data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00/NTUP_SUSY.277004._000093.root.1: external failed:
    stdout:
    
    
    stderr:Using grid catalog type: UNKNOWN
    Using grid catalog : (null)
    VO name: atlas
    Checksum type: None
    Trying SURL srm://srm-atlas.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/grid/atlas/atlaslocalgroupdisk/data10_7TeV/NTUP_SUSY/r1647_p306_p307_p428/data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00/NTUP_SUSY.277004._000093.root.1 ...
    [SE][Ls][SRM_INVALID_PATH] No such file or directory
    lcg_cp: No such file or directory
    ...
    
    ** Some files failed transfer, reached maximum re-trial times, please try again later **
    
    Download Summary:
    File: data10_7TeV.00165632.physics_JetTauEtmiss.merge.NTUP_SUSY.r1647_p306_p307_p428_tid277004_00/NTUP_SUSY.277004._000093.root.1, FAILED
    
    Number of datasets requested: 1
    Total number of files in dataset: 320
    Number of files or lfn patterns specifically requested by user: 1
    Number of file download attempts by dq2-get (not including retrials): 1
    Number of successful file download attempts: 0
    Number of failed file download attempts: 1
    Number of files where validation was skipped: 0
    Number of download retrials: 6
    Finished

  • For corrupted files, use dq2-get specifying the files and the DDM endpoint, which downloads the files and verifies their checksum:
    dq2-get -s UAM-LCG2_DATADISK -f NTUP_SUSYSKIM.607852._000174.root.1 data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00 
    Querying DQ2 central catalogues to resolve datasetname data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00
    Datasets found: 1
    data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00: Querying DQ2 central catalogues for replicas...
    data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00: Using complete replica at given site
    Querying DQ2 central catalogues for files in dataset...
    data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00: Using site UAM-LCG2_DATADISK
    data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00: Querying local file catalogue of site UAM-LCG2_DATADISK...
    data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00/NTUP_SUSYSKIM.607852._000174.root.1__DQ2-1324344973: Getting SRM metadata
    data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00/NTUP_SUSYSKIM.607852._000174.root.1__DQ2-1324344973: is cached.
    data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00/NTUP_SUSYSKIM.607852._000174.root.1__DQ2-1324344973: Starting transfer
    data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00/NTUP_SUSYSKIM.607852._000174.root.1__DQ2-1324344973: checksums do not match, deleting file (local disk: ad:83969465, DQ2 catalogues: ad:594b362f)
    ...
    
    ** Some files failed transfer, reached maximum re-trial times, please try again later **
    
    Download Summary:
    File: data11_7TeV.00189049.physics_Muons.merge.NTUP_SUSYSKIM.f403_m975_p832_tid607852_00/NTUP_SUSYSKIM.607852._000174.root.1__DQ2-1324344973, FAILED
    
    Number of datasets requested: 1
    Total number of files in dataset: 366
    Number of files or lfn patterns specifically requested by user: 1
    Number of file download attempts by dq2-get (not including retrials): 1
    Number of successful file download attempts: 0
    Number of failed file download attempts: 1
    Number of files where validation was skipped: 0
    Number of download retrials: 6
    Finished

Warning, important Be carefull, the file is downloaded locally to compute the cheksum. Check that you have enough disk space.

If the files are confirmed to be bad, please follow the instruction in recovery procedure . If they are false positive, you don't need to do anything.

In case some files were confirmed to be lost or corrupted in a site

Bad files need to be declared to the DQ2 recovery service : http://bourricot.cern.ch/dq2/recovery/ . This service will automatically clean the LFC, refresh the DQ2 location catalog and eventually recover the files if possible or remove them from the dataset definition if not and finally notify the dataset replica's owner about the files he lost.

Warning, important The service is not optimized for big incidents (it can consume approx. 50k lost files per day). In case of big loss it is important to coordinate (and prioritize) recovery with other ADC groups. It is also useful to change the cronjob sending notification emails every day: (frequency in voatlas264:/etc/cron.d/other.cron and the timerange in voatlas264:/data/ddmusr01/consistency/informUsersNew.py)

You should set the DQ2 environment and initialize your proxy. You need to have general production role (/atlas/role=production) or be part of team group (/atlas/team/role=null).

Then, execute the script described at DDMOperationsScripts#Recovery_of_lost_or_corrupted.

Warning, important You don't have to clean the LFC from the lost files before you declare them to the Recovery Service and it's important that you don't do it ! LFC is used to extract the GUIDs of the files. These GUIDs will be then used to identify all the datasets the files belong to. If you clean the LFC, this operation won't work. LFC clean-up will be done automatically by the Service.
Tip, idea You can follow the recovery procedure on this page : http://bourricot.cern.ch/dq2/recovery/

Verify corrupted files at the source T1

In order to check if a corrupted file is corrupted at the 'original' T1 (where the dataset was produced); On 8 October 2010 (following recommandation from https://savannah.cern.ch/bugs/index.php?69616, no more general script is provided )

  1. find the 'original' cloud using dq2-get-metadata $DATASET | grep origin | awk '{print $NF}'
    • If you cannot find the 'original' cloud in the metadata, fill a Savannah ticket to DDM Ops providing the dataset name and the file name. Somebody with special priviledges will do the cleaning (not automatic yet). Consider the file as lost.
  2. find the T1 DQ2site of the cloud using dq2-ls -r $DATASET
    • If the 'original' T1 DQ2site is _*TAPE, do not try the following but simply submit a Savannah ticket to DDM Ops providing the dataset name and the file name.
  3. go to a directory with enough space and where it does not get affected by others (eg. /tmp/your_username)
    mkdir -p /tmp/$(whoami) && cd /tmp/$(whoami) && df -h .
  4. use dq2-get -f $FILE -s $T1_DQ2SITE $DATASET which will copy the file locally, compute the checksum and report inconsistency

Be carefull, the file is downloaded locally to compute the cheksum. Check that you have enough disk space. and also that the downloaded file does not get disturbed by others.

If you see the messages ERROR : Checksum in DQ2 (xxxxxxxx) and on the SE (xxxxxxxx) are different !!! as in the above example, please report to DDM Ops Savannah providing the dataset name, the production T1 and the file name(s) (but not the job number) and the fact that you checked that the file was corrupted at the source T1, then the DDM Ops experts will delete the files from DDM and sites following the deletion procedure.

Handling lists of transfer errors from the dashboard

There can be certain errors reported on the DDM dashboard for which you would like to handle all the files at once (e.g. declaring all lost).

  • In the dashboard Matrix view select the appropriate sites or cloud, then click on the box and the last number in the popup to give the list of errors at the bottom of the page
  • Either select the particular error code you are interested in or the total number
    • Beware that the same error can be reported as different error codes for different files/sites
  • Clicking on the error number brings up a table of files
  • On the top right of the page is a disk icon with a pull down menu to get the data in JSON or XML format - click on JSON
  • This gives the first 50 errors, there are more you can edit the URL setting limit=x where x is the maximum number of errors you want. There is a limit on how big x can be, if you need more then also change the offset value and do multiple calls
  • Save this page locally, or download it directly for example with python urllib

To declare all these files as bad:

errors = json.loads(open('details.json').read())
surls = []
for err in errors['details']:
    surls.append(err['src_surl'])
surls = list(set(surls)) # to eliminate duplicates

from dq2.clientapi.DQ2 import DQ2
dq2 = DQ2()
dq2.declareBadFiles(surls, reason='files got corrupted')

Data transfer errors

Triggered by FTS

Deletion of user datasets requested by users

Information in this link

Links


Major updates:
-- StephaneJezequel - 04 May 2008

Responsible: StephaneJezequel
Last reviewed by: Never reviewed

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf 20090610_Dashboard_errors.pdf r1 manage 90.3 K 2009-09-16 - 23:04 UedaI 20090610 Dashboard Errors
Edit | Attach | Watch | Print version | History: r114 < r113 < r112 < r111 < r110 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r114 - 2016-07-26 - CedricSerfon
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Atlas All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback