-- JamieShiers - 07 Mar 2006

Experiment Reports


  • We would like to replicate on the PPS the same functionality we are using now for the production on SC3.
  • So, we need the same set of services. The only possible difference is that, if the VOMS were the standard for authentication, we would switch to it.
  • As for the sites, we would like to start with some T1s and T2s. The priority is in making our jobs work on a few sites, before increasing the scale. CERN, CC-IN2P3, CNAF, Grid.Ka + GSI, Torino would be ok with us.

Piergiorgio Cerello


At the moment, only 6 or 7 CEs are passing the site functional tests. The killer test is the replica management one. I do not believe every site in the grid is having rm problems, but I fear it is some "central" problem i.e. BDII not responding etc ... I would like to ask a statement about that and some action to be taken ASAP, like rerunning the tests. Moreover I would suggest that the CIC on DUTY in this case should have a critical look at the results and wait to publish them, since to me it is quite obvious there is something pathological going on at the time of the test.

Simone Campana


Status and Upcoming items for this week

-Data Transfers:

FTS Integration, CMS hopes to begin validating FTS transfers driven from PhEDEx this week

-Analysis Activities:

CMS has validated the new data management infrastructure from CRAB and should move to larger scale testing with the job robot by the end of next week. Beginning rate of 1k jobs per day.

-Simulation Activities:

First 500k events were produced with new framework driven from new production agents, primarily to local sites. Grid validation continues.


LHCb would like pre-production services (CE, FTS servers, SRM-enabled SE) at their Tier-1 sites (CERN, CNAF, FZK, PIC, IN2P3-Lyon). It was noted RAL & Nikhef/SARA were not offering a PPS service. LHCb will require the following central services for the PPS: a Resource Broker, an LFC service. The corresponding client tools will be shipped and installed at run time during the DIRAC installation. VO_boxes should be made available at all sites (including non-PPS Tier-1s). The VO boxes will run a LHCb "Request DB" (a flat file) and the LHCb TransferAgent.

It would be extremely beneficial that the PPS CE accesses the production batch system and that the PPS SE allows accessing the currently stored data.

In SC4 LHCb would like to have a read-only local LFC for redundancy purposes at each Tier-1, we would like to test this at least at one Tier-1 centre during the PPS. This read-only LFC will be a mirror of the central master which involves database replication and not just another instance of the LFC.

Nick Brook

Status of Pre-Production System

Feedback from experiments on required services


For the pre-production services (PPS):

- LFC catalog required at all pre-production Tier1s

- FTS server at all pre-production Tier1s and at the pre-production Tier0 - FTS channels between all pre-production Tier1s. FTS channels to/from a pre-production Tier1 to all its associated pre-production Tier2s. CERN should have FTS channels defined to all pre-production Tier1s.

- (there is an "SRM" column on the excel sheet with the list of services: I guess this is required by default right?)

[ for more details, please check ATLAS SC4 plans during Mumbai workshop ]

For the production services (PS): - exactly the same components as for PPS - BUT we do not need the new FTS Tier1 servers/channels for Production before June (SC4 start). We just need what we have today (FTS server at Tier0 with channels to all Tier1s). What we also have today - local LFC production catalogs at Tier0 and Tier1s and VO BOXes at Tier0 and Tier1s - is all we need up until SC4 start. (no new FTS requirements for the production service until SC4 start/June)

> An important point on VO BOXes < ATLAS requires VO BOXes at Tier0 and Tier1s only. BUT we do not require sites to deploy a separate VO BOX for the production or pre-production, as our code can easily coexist in the same box and is not CPU/disk intensive. So, for ATLAS, sites may claim the VO BOX PPS and VO BOX PS to be the same machine with the same endpoint. If sites or LCG decide to deploy two instance that's fine of course but there's no real gain in doing so. * The catch here is updating gLite FTS client libraries - there may conflicts between multiple versions! * The two separate instances of our code can point to either PPS or PS.


-BDII CMS current relies on the GIP for software publishing for RB. Unless gLite 3.0 RB uses another technique we probably need this at all sites

-FTS CMS expects to have access to FTS servers at all LCG Tier-1 centers

-RGMA RGMA is not currently used by the experiment monitoring

-RB A sufficient number of RBs are needed to hit the desired goal of 25k batch slots occupied in a 24 hour period. Expect 2-3

-WMS Only needed for central instances. New components in the WMS including Condor-C for submission will be tested

-CE Needed at all sites

-gLite CE Needed at enough sites for scale tests, if successful needed at all sites

-UI Needed for client submissions. Any place were users run CRAB. For testing CERN, FNAL, Bologna, etc. should be sufficient

-LFC (Central and Local) Only a central instance for the DLS is required. Sites wishing to test LFC as an alternative to the trivial file catalog can deploy it.

-VOMS One central instance

-VOBOX CMS services will be needed to reasonably test gLite 3.0. PhEDEx needed at Tier-1 and Tier-2 sites for gLite 3.0 testing

Metrics for experiment testing of PPS


- Pre-production service will be used as soon as it is available and its usage won't go away when SC4 starts. There may be periods where the pre-production service is not extensively used, but the goal is from now on to always develop against the pre-production service.

- The first usage of the PPS will be for an intense Tier0 export test on March/April (one week). LCG&sites are free to propose the most convenient week. The goal is for all intervenient sites to accomplish their MoU rates. This exercise would ideally run on the production service but the goal is to exercise data management using new m/w (and we expect to have faster upgrade cycles in case of problems, by using the PPS). We expect data to be stored on tape during this exercise, but may be scratched (along with all catalog/disk entries) at a later date - to be discussed with sites. This will be very much an exercise on the Tier1 storage elements! More details on this exercise will be sent out soon but again, please take a look at our Mumbai presentation for rates, etc - Before SC4 starts, the PPS will also be used for a distributed production exercise, to test the integration of data management and workload management.


The first tests CMS will perform with the PPS is successful submission of CRAB (CMS Remote Analysis Builder) Jobs. For these to be successful the PPS resources need access to the local mass storage at a reasonable level (1MB/s per batch slot is sufficient). The goal is to demonstrate that the bottlenecks in the speed of the current RB submissions have been removed. By the end of March/ beginning of April CMS would like to be submitting with the Job Robot.

The second test will be the Integration of the new production agent to submit to the PPS. By the beginning middle of April we wish to have a validated module for submission to gLite 3.0

Review of Experiment Plans and Site Setup

ATLAS WMS plans for 2006

ATLAS is currently running productions on the EGEE Grid using two job submission systems: the EDG Resource Broker and the Condor-G based system. Both of them are interfaced to the ATLAS production system (ProdSys) and share part of their code base.

We tested during the last few months of 2005 and the beginning of 2006 the functionality and performance of the new gLite WMS, and adapted our ProdSys executor Lexor to use it. The gLite WMS has new features with respect to the old RB that make it more attractive, and usable: faster response time, interface to the data management system, possibility of bulk submission. Although those tests were performed in a restricted environment, they showed that, already within that environment, the new WMS performs better than the old RB.

We therefore expect the new WMS to be available for large-scale testing in the context of the gLite 3.0 release, first on the pre- production service, then in the SC4 setup.

During 2006, in the context of SC4 but also for other distributed productions, we plan to submit jobs to the EGEE Grid using the gLite WMS and compare its global performance for production and analysis job submission with the Condor-G based system. Without such a large- scale test and comparisons, it will not be possible to take informed decisions on the best way to submit Grid jobs for ATLAS.

Therefore we intend to make intense use of both gLite and Condor-G based systems during the full year 2006; we assume that their performances will confirm the present results and continue to evolve in line with the ATLAS requirements. In this assumption only these two systems will be used in the ATLAS productions on the LCG resources in the EGEE Grid in 2006.

Dario Barberis

This topic: LCG > WebHome > LCGServiceChallenges > ServiceChallengeMeetings > SCWeeklyPhoneCon060313
Topic revision: r5 - 2006-03-13 - JamieShiers
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback