-- JamieShiers - 01 Mar 2006

The Pre-Production Service

  • gLite 3.0 timeline and high-level pps plans (presentation to Feb 28 MB).

The final gLite 3.0.0 Beta, including the new CE and WMS/LB, has been frozen and will now be passed to PPS for further validation.

The distribution includes components from gLite 1.5 and LCG-2_7_0 along with numerous updates. Patches and configuration updates will be released periodically after they have been tested.

Scope and Requirements from Experiments

The currently planned initial deployment of gLite 3.0 services on the pre-production service (PPS) is shown in the spreadsheet linked here. Those boxes marked with a "?" show where the service does not currently exist at a particular site but where we expect one or more experiments will want the service to be made available. At the meeting we will ask the experiments to give their updates to this spreadsheet. Please bear in mind that once gLite 3.0 has been deployed on the PPS, it will be working more or less at the maximum capacity of the people and nodes involved. Thus any additional services required by an experiment (including those marked with "?") will require that experiment to negotiate with the relevant site for extra PPS resources.

The assumption is that testing of new releases will be done in a number of stages, including:

  • Initial testing of the software/services, including any new interfaces / functionality;
  • Feedback loop of bug-fixing, documentation, deployment of fixed versions as required.

Once a reasonable level of confidence is reached, 'stress-testing' at an agreed level, e.g.

  • Defined rate of job submission;
  • Agreed number of sites;
  • Agreed data rates / volumes across specified number of sites.

It is clear that there will never be sufficient resources to fully duplicate the production environment, but both basic functionality and limited scale tests need to be performed in as close-to-production-like conditions as is feasible.

This may well require some reconfiguration of some of the services and scheduling of test activities.

Timeline

  • 2nd half of March: initial testing by experiments, resolution of critical bugs;
  • 1st half of April: build up in scope / scale of tests, resolution of critical bugs;
  • 2nd half of April: 'final' testing, shake-down at as large a scale as possible, final fixing and preparation of distribution kits.

It is inevitable that some problems will still slip through this procedure - which we need to continuously improve - and will require scheduled quick-fixes of the production system.

Experiment Production Plans and Issues

The monthly outline listed below is based on the SC4 Workshop in Mumbai and subsequent discussions.

We will update the schedule as required. Detailed resource allocation / coordination will be handled by Harry.Renshall@cernNOSPAMPLEASE.ch in conjunction with the sites / experiments and Bernd.Panzer@cernNOSPAMPLEASE.ch for the Tier0.

This will be discussed in more detail at the Management Board on March 7th, including level of detail required by sites etc.

March

  • ALICE: bulk production at T1/T2; data back to T0
  • ATLAS: 3-4 weeks Mar/Apr T0 tests
  • CMS: PhEDEx integration with FTS
  • LHCb: start generation of 100M B-physics + 100M min bias events (2-3 months; 125 TB on MSS at Tier-0)

April

  • ALICE: first push out of sim. data; reconstruction at T1s.
  • ATLAS: see above
  • CMS: 10TB to tape at T1s at 150MB/s
  • LHCb: see above
  • dTeam: T0-T1 at nominal rates (disk); 50-75MB/s (tape)
  • Sites: setup of FTS (or equivalent) services (see under Tier1-Tier1 transfers) & end-points; testing of all combinations of T1<->T1 and T1<->supported T2 transfers

  • Detailed timeline for T0->T1 throughput tests can be found here.

May

  • Deployment of gLite 3.0 at major sites for SC4 production
  • Sites: see above

June

  • ALICE:
  • ATLAS: Tier-0 test (Phase 1) with data distribution to Tier-1s (3 weeks)
  • CMS: 2-week re-run of SC3 goals (beginning of month)
  • LHCb: reconstruction/stripping: 2 TB/day out of CERN - 125 TB on MSS @ Tier1s

July

  • ALICE: Reconstruction at CERN and remote centres
  • ATLAS:
  • CMS: bulk simulation (2 months)
  • LHCb: see above
  • dTeam: T0-T1 at full nominal rates (to tape)

August

  • ALICE:
  • ATLAS:
  • CMS: bulk simulation continues
  • LHCb: Analysis on data from June/July until spring 07 or so

September

  • ALICE: Scheduled + unscheduled (T2s?) analysis challenges
  • ATLAS:
  • CMS:
  • LHCb: see above

October

  • Start of WLCG production service

Status of Production Services

We have agreed with the experiments how to configure their dedicated Castor-2 instances, and are deploying them. Status:

  • Alice: in production since a few weeks
  • CMS: ready
  • LHCB, Atlas: to be delivered this week

We also want to deliver a setup for SC4 throughput tests this week.

Lastly, we are working on setting up an SRM endpoint for the throughput tests (v1.1), and a v2.1 SRM endpoint for interoperability tests. With the discussions about durable/permanent storage classes ongoing, these endpoints will not be the Mother Of All SRM Endpoints...

Jan van Eldik IT-FIO

Oracle Server Deployment at Tier1s

  • What backend database services are required at which sites according to the agreed plan?
  • What are the licensing and support issues if Oracle is the backend database?
  • Which Tier1 sites do not currently have appropriate Oracle licenses and corresponding support contract?

  • More info...
Dear All,

to prepare for our discussions with Oracle for s/w and support license coverage I'd like to collect the information you have for Oracle database server CPUs which you use (or plan to use) for FTS, LFC and VOMS deployment. Please note that I have got information already for the experiment databases and for Castor-2, so the numbers I'd get your feedback only concern the basic middleware components.

Based of what I have seen it looks to me as if 4 CPU licenses (eg 2 dual cpu boxes or 1 two node (dual cpu) cluster) would cover the license need for these components for the next year. Please note that this does not mean that sites who don't deploy LFC/FTS/VOMS on Oracle will have to. I'm just trying to insure that we are fully covered.

If you think this proposal (4 licensed CPUs) does not cover you then please let me know preferably before Monday as I would like to propose this number in Jamie's SC co-ordination meeting (Monday afternoon).

Cheers, Dirk

Site Services Oracle (Y/N)
ASGC LFC, FTS  
CNAF LFC, FTS, VOMS(?)  
PIC LFC, FTS  
IN2P3 LFC, FTS  
FZK LFC, FTS  
RAL LFC, FTS  
BNL    
FNAL    
TRIUMF LFC,FTS  
SARA LFC, FTS  
NDGF    

Check-point on SC4 Infrastructure Preparations

Maybe just covered in the roundtable below?

Roundtable

Important issues not already addressed by the above and / or weekly status reports.

Experiments

ALICE

ATLAS

CMS

LHCb

A quick heads up for the LHCb T1 site admins. I have begun submission of transfers on the matrix of channels connecting LHCb T1 centers. I will maintain two transfer jobs on each channel until the 1.5TB present on dedicated disk at each of the sites is replicated to the others.

Throughput plots can be seen here.

Andrew Cameron Smith

Sites

CNAF

(Mar 02 2006)

  • FTS

The FTS server version is still 1.3, we plan to upgrade it to 1.4 in April. We currently have the following list of channels configured: T1-T1: CNAF-PIC, CNAF-GRIDKA, CNAF-SARA, CNAF-IN2P3 T1-T2: CNAF-Bari, CNAF-Catania, CNAF-Legnaro, CNAF-Milan, CNAF-Pisa, CNAF-Torino

The T1-T2 channels configuration is complete.

  • CASTOR2

  1. Castor2 is under testing at CNAF, and stress tests from the local WNs to Castor2 will continue over March.
  2. The client part of the intallation (on the WNs) is complete now.
  3. The Castor2 stager suffers from a known memory leak problem.
  4. We will soon upgrade to 2.0.3-0.
  5. A backup for DLF and the DB will be added.
  6. We plan to involve one experiment in the Castor2 testing around the end of March. Castor2 will be put in production afterwards, depending on the test results gathered during March.

  • T1-T2 throughput testing
We need to internally discuss how to proceed with the disk-disk throughput tests involving CNAF and the T2 federation of INFN. The duability and the schedule of such tests depends on the pre-SC4 activities scheduled in April and on the results of the Castor2 tests.

PIC

(Mar 3)

FTS server deployed - currently being tested internally.

Re-installing services for preproduction, preparing the glite-3.0 deployment.

Deploying a Castor2 (for SRM-tape service). Hope to have it ready for testing in a month or so.

Deploying a new dCache infrastructure (for SRM-disk service). Hope to have it ready for testing in a month or so.

Tier2 Workshop

Any feedback on the Tier2 workshop much appreciated (including proposals for contributions, session chairs, parallel sessions / tutorials for the Thursday / Friday or 9-11 Monday / Tuesday).

AOB

Next con-call, Monday 13th March, 16:00 Geneva time, +41227676000 access code 0164222

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r12 - 2006-03-06 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback