MiddlewareReadinessWorkingGroup

ATLAS participates in the WLCG Middleware Readiness Working Group. This group aims to test new versions of middleware products to evaluate their suitability for experiments. ATLAS' role is to dedicate some computing or data resources at certain participating sites which will be used in this testing. The testing framework should be HammerCloud.

The following sites will actively deploy test versions of middleware and agree to dedicate some resources to ATLAS. Resources means both in terms of CPU, disk space and hardware for services, and technical support:

Site Product ATLAS Component Importance to ATLAS Contact
Triumf dCache SE Panda pilot/DQ2/Rucio High Simon Liu, Di Qing
NDGF-T1 dCache SE Panda pilot/DQ2/Rucio High Gerd Behrmann
Edinburgh DPM SE Panda pilot/DQ2/Rucio High Wahid Bhimji
QMUL StoRM SE Panda pilot/DQ2/Rucio High Chris Walker
INFN-T1 StoRM SE Panda pilot/DQ2/Rucio High Salvatore Tupputi
OSG Xrootd Panda pilot High OSG contacts
CERN FTS3 DQ2/Rucio High CERN contacts
Napoli CREAM-CE Pilot factory High Alessandra Doria
- HTCondor Pilot factory High Peter Love
- VOMS Client Panda pliot Low -
- LFC Panda pilot/DQ2/Rucio Very low -
Notes: LFC is no longer used by ATLAS.

Configuration and Monitoring

Site Product Panda Queue(s) DDM Endpoint(s) Jobs Data Transfer
Triumf dCache SE TRIUMF_PPS
ANALY_TRIUMF_PPS
TRIUMF-LCG2-MWTEST_DATADISK
TRIUMF-LCG2-MWTEST_SCRATCHDISK
TRIUMF-LCG2-MWTEST_DATATAPE
TRIUMF_PPS
ANALY_TRIUMF_PPS
Rucio transfers
NDGF-T1 dCache SE ARC-TEST NDGF-T1-MWTEST_DATADISK ARC-TEST Rucio transfers
UKI-SCOTGRID-ECDF DPM SE UKI-SCOTGRID-ECDF_TEST UKI-SCOTGRID-ECDF-RDF_DATADISK
UKI-SCOTGRID-ECDF-RDF_PRODDISK
UKI-SCOTGRID-ECDF_TEST Rucio transfers
QMUL StoRM SE UKI-LT2-QMUL_TEST UKI-LT2-QMUL-MWTEST_DATADISK UKI-LT2-QMUL_TEST Rucio transfers
INFN-T1 StoRM SE INFN-T1_TEST INFN-T1-MWTEST_DATADISK INFN-T1_TEST Rucio transfers
INFN-NAPOLI-ATLAS CREAM CE INFN-NAPOLI-RECAS-TEST   INFN-NAPOLI-RECAS-TEST
CERN HTCondor Most of the above   Pilot factory monitor  

Notes:

  • The pilot factory serves all the test Panda queues except ARC-TEST

Testing Procedure

Services

At each testing site a Panda resource should be set up linked to the test product (eg SE or client library). When a new product is released for testing, the site should install it. A constant stream of HammerCloud jobs will run at the site, but this may be ramped up during the testing period. For Storage Elements under test there will be DDMFunctionalTests transfers directed to and from the SE. The regular ATLAS pilot can be used.

Clients

For testing clients a new version can be installed in CVMFS and the pilot can be directed to use the test version on the test resource. The current location of grid clients that ATLAS uses is in the ATLAS Local Root Base (ALRB) area of the ATLAS CVMFS. Here there are special set up scripts to allow the grid clients to interact nicely with the ATLAS software. The maintainer of ALRB tests new versions of middleware for compatibility with ATLAS. Eventually ATLAS may use grid.cern.ch for grid clients but this is not foreseen in the immediate future.

In addition to the ALRB testing an automatic procedure will send to test panda queues dev versions of the ATLAS pilot which is configured to use test versions of clients. This makes it easier to monitor the jobs rather than having a small fraction of dev pilots running on the regular queues.

HTCondor

HTCondor is a special case. HTCondor can refer to the CE service or the software used by pilot factories to submit jobs to (any kind of) CEs. Here we only talk about the software used by the pilot factories. The ATLAS viewpoint was summarised by Simone Campana:

  • The WLCG MW officer defines baseline releases for Condor clients plus Grid Computing Elements (of any flavour).
  • When a new release comes up (of Condor or one of the Grid CEs), the MW officer and the MV working group define if this is just a cosmetic change or something worth to be tested.
  • The MW working group negotiates with the sites in case a CE with the latest version needs to be installed/configured. Or, the MW working group asks ATLAS to deploy the whatever new version of Condor in one of the factories.
  • ATLAS launches some verification test (basically sent pilots, verify they pick up a job and they terminate gracefully) and reports on the outcome. ATLAS will also provide all infos needed to debug the problem in case it happens.
  • The MV working group drives the process of trying to understand where the issue is and gets in touch with various parties (Condor people, CE devels, ATLAS) to solve the issue.
  • The process iterates till there is a good understanding (some bug fixed, some suggested configuration or even a declared incompatibility)
  • The information is passed back to the MW officer who defines a new baseline release or anyway takes the outcome into account
  • WLCG operations propagates the information and possible actions.

To sumarize, I see the MV working group and the MW officer as the ones driving the process. ATLAS would have a role similar to the other stakeholders (we provide pilot factories, panda, the monitoring, effort in reporting the issues in a non ATLAS specific manner) and contributes to the effort.

In addition we would like to make the distinction between those products where we have volunteer sites which have a close link with the product team, and those products like HTCondor where the testers and the product team are separated. It is natural in this latter case that the validation procedure may take longer.

Adding test endpoints to ATLAS

There must be test Panda Resource(s) and/or DDM Endpoint(s) associated to the site. These are set up in AGIS.

DDM Endpoints

  • The site should set up:
    • Storage service with test version of product
    • Corresponding SRM space tokens and directories as in the production endpoints (see StorageSetUp)
      • Required tokens as ATLASDATADISK, ATLASPRODDISK and ATLASSCRATCHDISK
    • Register the endpoint in GOCDB (with production and monitored true for now)
    • Between 1 and 10TB of total space is necessary (T1 SCRATCHDISK must be > 10TB to avoid blacklisting)
  • ADC Central operations set up DDM endpoints in AGIS:
    • SITE-MWTEST_DATADISK (for input files)
    • SITE-MWTEST_PRODDISK (T2 only, for MC production output files)
    • SITE-MWTEST_SCRATCHDISK (for analysis output files)
    • The endpoints should be automatically picked up by Rucio

Panda Resource

  • The site should:
    • Ensure that the storage is accessible from worker nodes in the same way as the normal storage
    • Allow that 1-2% of worker node resources can be dedicated to testing during testing periods
  • ADC Central operations set up:
    • For production jobs: SITE-TEST, DDM Endpoint SITE-MWTEST_PRODDISK (DATADISK for T1)
    • For analysis jobs: ANALY-SITE-TEST DDM Endpoint SITE-MWTEST_SCRATCHDISK
    • They can use the same CEs as production resources
      • Unless the CE is under test, in which case the test CE should be used
    • When created queue status should be set to test
    • Subscribe HC input datasets to SITE-MWTEST_DATADISK as described here
    • HC jobs will be sent automatically


Major updates:
-- DavidCameron - 04 Mar 2014

Responsible: DavidCameron
Last reviewed by: Never reviewed

Edit | Attach | Watch | Print version | History: r37 | r31 < r30 < r29 < r28 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r29 - 2015-06-16 - DavidCameron
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Atlas All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback