DDM scripts developed for the NYU Tier3

Introduction

To handle data files efficiently on the NYU Tier3, a number of scripts have been developed. These have been added to the group's SVN repository for better code management by now, which can be accessed here.

At the moment there are 5 python scripts in the package, which are all installed under /export/share/atlas/ddm/ at NYU. Installing the scripts doesn't require anything, you just need to check out the package into a local directory like this:

svn co svn+ssh://svn.cern.ch/reps/atlasgrp/Institutes/NYU/Tier3/ddm/trunk ddm

Running the different scripts requires different environments to be configured. These are detailed further down on the page. All the scripts can receive a few options. If I'm not mistaken, all scripts accept the "-h" command line option and print some information about themselves as a response. In the following I describe how to use all of the scripts.

Dataset directory

The

/export/share/atlas/ddm/decodeFilepath.py

script is basically just a slight development over Doug's original code. It needs the "DQ2 environment" to run:

setupATLAS
localSetupDQ2Client
voms-proxy-init -voms atlas

It works the same way as his script. You specify a DQ2 dataset name for the script, and it returns the directory in which this dataset should end up on the XRootD servers.

Downloading large amounts of data

FTS (File Transfer Service) is now available for the NFS node. This means that transfers can be initiated using dq2-get's FTS plugin, and they get downloaded in a more server friendly way than using dq2-get in the usual way.

The local DQ2 client is configured to use the FTS plugin. So setting up the environment for an FTS transfer is the same as setting it up for using dq2-get in "regular" mode.

setupATLAS
localSetupDQ2Client
voms-proxy-init -voms atlas

Note that to write data to the GridFTP server running on the NFS node, your certificate has to be added to the privileged certificates, which are allowed to do this. (Contact the cluster administrator if the following instructions don't work for you.)

Manual method

The file transfer can be initiated with this command:

dq2-get -Y -q FTS -o https://fts.usatlas.bnl.gov:8443/glite-data-transfer-fts/services/FileTransfer -S gsiftp://t3nfs.physics.nyu.edu/atlas <dataset name>

Note that you shouldn't specify a sub-directory on the GridFTP server. dq2-get will take care of copying the files into a directory structure.

The recommendation is that the transfer should be initiated from a specific site whenever possible. So check which sites have the desired dataset, and initiate the transfer from them like this:

dq2-get -Y -L <site name> -q FTS -o https://fts.usatlas.bnl.gov:8442/glite-data-transfer-fts/services/FileTransfer -S gsiftp://t3nfs.physics.nyu.edu/atlas <dataset name>

You can find some more information about the FTS plugin here.

Automated method

You can start the download of a specific dataset or a list of datasets using the

/export/share/atlas/ddm/dq2_fts_get.py

script. For instance to download the latest datasets taken by ATLAS, and turned into NTUP_2LHSG2 ntuples, you can do the following:

/export/share/atlas/ddm/dq2_fts_get.py -d data11_7TeV.*.physics_Egamma.*.NTUP_2LHSG2.*p600/
/export/share/atlas/ddm/dq2_fts_get.py -d data11_7TeV.*.physics_Muons*.NTUP_2LHSG2.*p600/

Data replication to the worker nodes

The script

/export/share/atlas/ddm/xrd_replicate_to_workers.py

can be used to replicate files downloaded to the NFS node to the worker nodes. It needs a ROOT environment to be set up to work.

setupATLAS
localSetupROOT

For instance I'm copying the NTUP_HSG2 ntuples to the worker nodes with the command:

/export/share/atlas/ddm/xrd_replicate_to_workers.py /atlas/dq2/data11_7TeV/NTUP_HSG2/

As you could guess, you have to specify the directory that you want to replicate (recursively) to the workers. If you've uploaded some files to the NFS node to your personal directory, you can use this script to easily replicate those to the worker nodes as well.

Unfortunately the NTUP_HSG2 ntuples are a data management nightmare. Currently we have about 15k files of this type on the NFS node. My script has to copy them file-by-file to the workers, because it checks for every file if it's already there. But opening a new connection with xrd is quite costly.

Luckily the script can be simply stopped and restarted, but it would still make a lot of sense if we wouldn't have to deal with files ~2 MB in size...

Note that in once the FRM daemons are set up on the cluster, the data replication procedure will change quite a bit.

PROOF dataset creation

The idea with the

/export/share/atlas/ddm/pq2_file_list_maker.py

scrip is to make it simple to define PQ2 datasets (http://root.cern.ch/drupal/content/pq2-tools) on a PROOF farm. To use this script, you need to have the "DQ2 environment" set up.

setupATLAS
localSetupDQ2Client
voms-proxy-init -voms atlas

To define a PQ2 dataset, you first need a text file with the names of the files that you want to put into the dataset.

This script can be used for instance to create a list of all the NTUP_TOP D3PD files from the Egamma stream that are already downloaded, like this:

/export/share/atlas/ddm/pq2_file_list_maker.py -d data11_7TeV*physics_Egamma*NTUP_TOP*p569/ -s head -o data11_7TeV.physics_Egamma.NTUP_TOP.p569.AllYear.HEAD

This creates a text file called data11_7TeV.physics_Egamma.NTUP_TOP.p569.AllYear.HEAD, which can then be used to create a PQ2 dataset with. For that you would use the command:

pq2 put -d data11_7TeV.physics_Egamma.NTUP_TOP.p569.AllYear.HEAD

Remember that you need to run the dataset file creation script in a different environment than the PQ2 scripts. (The latter need a ROOT environment.)

Recursive file removal

The final script

/export/share/atlas/ddm/xrd_remove_recursive.py

should be used with extreme care. I wrote it to make it simpler to remove a whole directory from the XRootD servers. (It's like "rm -rf".) To use it, you need to have a ROOT environment set up:

setupATLAS
localSetupROOT

Luckily by default regular users are not allowed to remove central datasets themselves, but you could still easily remove all your personal files from the XRootD servers with it. Because of this, the script asks for confirmation (typing "yes" to a question) after it tells you how many files and directories it's about to delete from the specified server. (To delete files from the worker nodes, you have to use this script on the worker nodes one-by-one.)

-- AttilaKrasznahorkay - 01-Aug-2011

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-08-02 - AttilaKrasznahorkay
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback