--
JamieShiers - 21 Feb 2006
Summary of SC4 Workshop in Mumbai
A successful - and sometimes animated - SC4 workshop was held in Mumbai prior to CHEP 2006. Good progress was made on agreement in a number of areas, including storage management, services (support and operation) and the overall goals of the experiments for the SC4 service phase.
The slides - and conclusions of some areas - can be found on the agenda page (see under 'Hot Links' of the
SC Wiki).
See also the file "sc4-expt-plans.ppt" on Service Challenge talks and documents page, which includes the issues listed below.
Decisions
- For support issues, it has been agreed that we will use helpdesk@ggusNOSPAMPLEASE.org (or www.ggus.org) from now on. The existing support lists will be closed down latest May 2006.
Outstanding Issues
- The details of the T1<->T1 transfers still need to be finalised. A "dTeam" phase should be foreseen, to ensure that the basic infrastructure is setup. Similarly for T1->T2. A possible scenario follows:
- All Tier1s need to setup an FTS service and configure channels to enable transfers to/from all other Tier1s.
- dTeam transfers at 5MB/s (10MB/s?) need to be demonstrated between each T1 and all other T1s
- ATLAS:
- It is understood that the Tier0 exercise in March does not include any transfers to external sites - this is foreseen for June.
- The data rates presented for data export to Tier1s - 720MB/s - do not include the full ESD to BNL. This was mentioned verbally at the workshop. If confirmed, this would add 80MB/s to the total rate and that to BNL. This brings the total export rate for ATLAS to 800MB/s.
- Job submission rates per target grid need to be defined.
- ALICE:
- An LFC service is required at all sites serving ALICE as a local file catalog (task force communication).
- The "proof@caf" issue has not been discussed at the Tier0.
- xrootd is requested at all sites. This is being negotiated on a site by site basis.
- CMS:
- A new schedule needs to be produced taking into account the offical dates for SC4 and gLite release schedule.
- The exact implications of the "trivial file catalog" implementation - in terms of the precise service(s) that sites need to deploy - needs to be defined urgently.
- There has been some clarification of the job submission plans, namely gLite 3.0 WMS for LCG sites and Condor-G for OSG sites. The split of the job load across the grids / sites also needs to be defined and agreed.
- CMS stress the need to test the entire data management chain with files >2GB, to ensure that these are fully supported by all relevant components and services.
- LHCb:
- The mention of "xrootd" post-SC4 is assumed to imply an evaluation and does not imply a specific service request.
- General:
- The detailed schedule and resource requirements need to be discussed and agreed once the above issues are resolved.
Experiment Production Plans
ALICE
The first point of this year?s PDC?06/SC4 plan is the scheduled rerun of SC3 T0 disk ? T1 disk transfers (max 150MB/s). These will be scheduled transfers through the FTD-FTS system and the target T1s are CNAF,
IN2P3 Lyon,
GridKa and
RAL. Data generated during PDC?05 and available at CERN will be used. The amounts of data to be transferred to each centre will depend on the available storage capacity; however a possible scenario is to remove the transferred data on the target SE after it has been successfully transferred. The target duration of the exercise is 150 MB/s aggregate throughput during 7 days.
In parallel to the file transfers, we will continue to run jobs to test the stability of the complete system.
The requirement for LFC as a local catalog at all sites was clarified.
ATLAS
ATLAS' SC4 requests are summarised as follows:
- March-April (pre-SC4): 3-4 weeks in for internal Tier-0 tests (Phase 0)
- April-May (pre-SC4): tests of distributed operations on a ?small? testbed (the pre-production system)
- Last 3 weeks of June: Tier-0 test (Phase 1) with data distribution to Tier-1s (720MB/s + full ESD to BNL)
- 3 weeks in July: distributed processing tests (Part 1)
- 2 weeks in July-August: distributed analysis tests (Part 1)
- 3-4 weeks in September-October: Tier-0 test (Phase 2) with data to Tier-2s
- 3 weeks in October: distributed processing tests (Part 2)
- 3-4 weeks in November: distributed analysis tests (Part 2)
CMS
CMS emphasised the requirement to test the entire chain using files large than 2GB (to make sure that there are no hidden limitations still remaining...)
The timeline presented below needs to be aligned with the official SC4 schedule
The overall timeline (copied from previous minutes - see
CMS SC4 workshop presentation
for details) is as follows:
March 1st
CMS expects to be able to integrate the analysis batch submit (CRAB)
into gLite 3.0 pre-production as it's available. Plan for 6 weeks
of functionality and stability testing. Total resource
requirements are modest and can be met by the available pre- production sites
Integration of new CMS Production environment to submit to gLite 3.0 is expected on the same time frame.
This should allow CMS to exercise the two main processing applications needed for the remainder of SC4
March 15
CMS expects the release of
PhEDEx that can utilize FTS to drive transfers.
April 1
CMS would like to begin low level continuous transfers between sites
that support CMS. The goal is 20MB/s (2TB/day) continuous
running. There are three groups identified to supervise the
transfer systems. CMS has also developed a heartbeat monitor for
PhEDEx.
There is also a ramp to demonstrate particular numbers of TB per day
between tiers. Numbers should be agreed by next week.
April 15
Begin production scale running on gLite 3.0 with simulation and
analysis applications. The goal by the end of the year is to have
successfully demonstrated 50k-100k jobs submitted per day
June 1
We expect a 10TB sample of the new CMS Event Data Model data for transfer and analysis access
May 29 - June 12
Two week period of running to demonstrate the low level functionality of all elements of the CMS computing model.
July-August
CMS Expects Production for the 2006 Data challenge at the rate of 25M
events per month. Should not require more than the CMS share of
computing facilities
September
Preparations for Computing Software Analysis Challenge 2006 (CSA06)
October
Execute CSA06.
LHCb
In preparation for SC4 production phase (June on), LHCb foresee generating 100M B-physics + 100M min bias events (event generation, detector simulation & digitization). This will require 3.7 MSI2k · month required (~2-3 months) and 125 TB on MSS at Tier-0 (keep MC True). It is foreseen to start mid-March ramping up to full production by end-March.
Preparing for SC4 Disk-Disk and Disk-Tape Throughput Tests in April
These are the well-known rates that should be achieved in MB/s.
It is important to emphasise that these are daily averages sustained over extended periods - not one-time peaks.
Site |
Disk-Disk |
Disk-Tape |
ASGC |
100 |
75 |
TRIUMF |
50 |
50 |
BNL |
200 |
75 |
FNAL |
200 |
75 |
NDGF |
50 |
50 |
PIC |
100 |
75 |
RAL |
150 |
75 |
SARA |
150 |
75 |
IN2P3 |
200 |
75 |
FZK |
200 |
75 |
CNAF |
200 |
75 |
As usual, we will first run the disk-disk throughput test and then disk-tape.
(The rate for BNL assumes that a full copy of the ESD is exported there.)
(In July, the disk-tape rates go up to full-nominal, i.e. the disk-disk rates in the table above).