-- JamieShiers - 22 Feb 2006

Preparation for SC4

Tier0 - Tier1 Throughput tests

Disk - disk Tier0-Tier1 tests at the full nominal rate are scheduled for April.

The proposed schedule is as follows:

  • April 3rd (Monday) - April 13th (Thursday before Easter) - sustain an average daily rate to each Tier1 at or above the full nominal rate. (This is the week of the GDB + HEPiX + LHC OPN meeting in Rome...)
  • Any loss of average rate >= 10% needs to be:
  1. accounted for (e.g. explanation / resolution in the operations log)
  2. compensated for by a corresponding increase in rate in the following days
  • We should continue to run at the same rates unattended over Easter weekend (14 - 16 April).
  • From Tuesday April 18th - Monday April 24th we should perform the tape tests at the rates in the table below.
  • From after the con-call on Monday April 24th until the end of the month experiment-driven transfers can be scheduled.

  • To ensure a timely start on April 3rd, preparation with the sites needs to start now (March). dTeam transfers (low priority) will therefore commence to the same end-points as in the SC3 disk-disk re-run as soon as sites confirm their readiness.

Site Disk-Disk Disk-Tape
ASGC 100 75
TRIUMF 50 50
BNL 200 75
FNAL 200 75
NDGF 50 50
PIC 60* 60
RAL 150 75
SARA 150 75
IN2P3 200 75
FZK 200 75
CNAF 200 75

  • The nominal rate for PIC is 100MB/s, but will be limited by the WAN until ~November 2006.

The above plan was agreed.

Tier1 - Tier1 and Tier1 - Tier2 Transfer Tests

Prior to experiment driven transfers of this nature in the SC4 production phase (June 1st on), we need to verify that the basic infrastructure is setup and ready.

This requires:

  • All Tier1s to setup an FTS service (or state alternative plans which need to be agreed with the other Tier1 sites to/from which transfers will be driven, including the possible of sites to control transfers to/from their site.)
  • All Tier1s need to define channels pulling FROM all other Tier1s. i.e. the FTS server at a given site is responsible for running transfers for which it is the destination. Site admins on both ends will be authorised to control the channel.
  • All Tier1s need to provide the list of Tier2s to/from which transfers will be driven and define the corresponding channels on their FTS.

This setup should preferably be completed in March (April and May are going to be very very busy - we'll come to that later...)

Service team collect and add to wiki a list of manager DNs for each site.

As targets, the following is proposed:

  • Successful transfers need to be demonstrated for the full matrix (just a tick or a cross in a table).
  • Each Tier1 needs to demonstrate that it can send and receive 10MB/s to any other Tier1 (to be defined, so its not always the same one...)
  • Each Tier1 needs to demonstrate 5MB/s to/from supported Tier2s.

The above plan was agreed. Service team or site phonecon will be available to answer any questions or clarify the plan.

Registration of sites in GOCDB

All sites in the EGEE infrastructure must be registered in the GOC database (https://goc.grid-support.ac.uk/gridsite/gocdb2/index.php)

The Site Registration Procedure is explained in http://edms.cern.ch/document/503198

Click here to view a file listing those Tier-1/SC sites in the GOC database along with the information that is missing for each site. Of course, if a site has not been registered in the GOC database then it won't appear here. Also, some of the site names in the GOC database are rather cryptic so I may have missed them.

It was agreed that the "service challenge contact" should not be used - we will move all SC4 operational issues to the standard GOC DB site contact.

All sites agreed to update the GOC DB as above.

{Site, User} Support Issues

Flavia outlined the benefits of using the GGUS support portal, e.g. the ease of tracking a problem that is shared across multiple middleware components, or where the source of the problem is not initially obvious.

It was agreed by all experiments that issues will be submitted via the GGUS portal. Direct mails to individuals should not be used.

Preparation for SC4 Production Phase

All participating sites need to foresee upgrading services to gLite 3.0 during May, to be fully ready for a production start June 1st.

Operations procedure

Maite outlined the operations procedure.

  1. GGUS should be used to report problems.
  2. The EGEE broadcast tool should be used to announce site problems or maintenance - the mails are sent to the GOC DB site contacts. As a temporary measure they will alos be sent to the "service challenge contacts" until these are phased out fully.

Experiment Testing of gLite 3.0 on the Pre-Production System

gLite 3.0 will be available on the pre-production system, which includes the following sites (see table below), as from mid-March.

It is essential that it is well tested under production conditions by the experiments, prior to final packaging for distribution to sites end-April.

CERN PPS-CERN
France IN2P3-CC-PPS
Germany FZK-PP
Greece EGEE-SEE-CERT
Greece PreGR-02-UPATRAS
Italy PPS-PADOVA
Italy PPS-CNAF
Portugal PPS-LIP
Poland PPS-CYFRONET
Spain PPS-PIC
Spain PPS-IFIC
Spain CESGA-PPS
Taiwan Taiwan-PPS
UK UKI-ScotGrid-Gla-PPS
UK IC-HEP-PPS (Candidate)
UK Birmingham Pre-Production Site (Candidate)

How this fits with the production efforts going on at some of these sites needs to be discussed further.

Preparation for Experiment Production

The current plans can be found in "sc4-expt-plans.ppt" attached to the talks and documents page. Some small changes in schedule are foreseen (e.g. to match the SC4 / gLite schedules, to accomodate the requirements of the different experiments), with a final plan ready for endorsement at the March GDB.

Once these plans are 'finalised', we will have to schedule the actual work...

Scheduled Events

  • PIC will be in Scheduled Downtime from 10th till 12th April. This is because the yearly electrical maintenance works in our building are happening those days. Gonzalo Merino

  • T2 workshop at CERN June 12 - 14 followed by tutorials. This will be in the Council Chamber and will start late (11:00) on both Monday and Tuesday (due to room availability, but will allow time for travel and/or side meetings, e.g. between sites and their supported experiments).

Roundtable

Experiments

ALICE

Local LFC needed on every site. Sites which are supporting Alice - it should be installed on a machine that can support the agreed load.

Testing FTS service with CNAF. Want to test with Lyon, FZK, GridKA and RAL.

Q. What infrastructure remains from SC3 rerun?

A. Transitory phase for SRM endpoints at CERN at the moment. Still maintained are the SRM endpoints used in (pre-rerun) SC3. FIO are setting up SC4 setup now. Donít use the current SRM endpoints for a really heavy throughput testing; it should be able to run 150MB/s fairly well.

Alice are running these transfer tests now. T1s should be ready to receive this traffic now (150 MB/s total apread over the four T1 sites).

GSSDATLAS

Ongoing production. For SC4 - preparing tests for March.

Q. What will be available for T0 to T1?

A. Same answer: SC4 castor instance for Atlas will be available. Will finalise mini-testbed setup soon for testing in March.

Want to test T1 to T2 sites. This means the T1 sites need to setup FTS tests to form a mini testbed - this will be done on the preproduction tested. It should be discussed in more detail next Monday.

LHCb

Production is currently suspended due to problems with VOMS and the mapping of LHCb users to special local accounts. A full detailed description of problem will be submitted to GGUS. T1-T1 part of SC3 ongoing - last channel between CNAF & PIC has been established. Large data transfer in T1-T1 should now start.

Preparation for SC4: LHCb production due to start in earnest 1st April but will ramp up production during March.

Willing to look at pre-production system but seek clarification on how thing will interact with current production system.

CMS

[ awaiting update via email ]

Sites

KNU

  1. We have installed DPM-1.3.8 which includes SRM interface. We are ready to exchange data with CERN and to join the data transfer test.
  2. The current available size of the SRM storage is 1.8 TByte. We are going to install 6 TByte storage devices more for the SRM. In addition to these, 14 TByte disk will be added to the SRM storage device.
  3. We will transfer data by 10Gbps GLORIAD (Korea-USA-Europe-CERN).
  4. For the file transfer test, it will be much better if some one in CERN helps us transfer the data and test our storage elements. Mr. Daehee Han (hanbi@knuNOSPAMPLEASE.ac.kr <mailto:hanbi@knu.ac.kr>) in KNU will be responsible for the data transfer test.

TRIUMF

About to upgrade to latest dCache version. Ordering two tape drives for SC4.

NDGF

Setting up distributed dcache system (central DB and pools at all NDGF sites) - 2 weeks work. Tape backend based in Copenhagen and a couple of sites in Sweden.

PIC

FTS server deployed - currently being tested internally.

Re-installing services for preproduction, preparing the glite-3.0 deployment.

Deploying a Castor2 (for SRM-tape service). Hope to have it ready for testing in a month or so.

Deploying a new dCache infrastructure (for SRM-disk service). Hope to have it ready for testing in a month or so.

ASGC

ASGC: Currently using Castor 1 server - having problems with dCache server.

Various service problem caused by network problems.

SARA

Configured disk only area and tape area on dcache. Added these to BDII. Upgraded FTS to version 1.4.

RAL

In the process of doing aggregate transfer tests to T2 sites; undertanding network problems in this. Waiting for OPN link to come back for link to CERN (surfNet). Working on Castor2 deployment - this is going well - will start interoperability testing son.

SC4 for April will likely still use dcache to tape. But would like to test with Castor 2 as well if possible.

New STK robot being installed.

BNL

FTS upgrading to 1.4. New tapes fbeing ordered or SC4.

FNAL

CMS ready to start cosmic challenge (relativiely small data rate, small file sizes). Want to transfer that at high priority from CERN to FNAL. Upgrading to LCG2.7.0 next week. Robot facility for experient data is in process of being purchased.

CNAF

(Mar 02 2006)

* FTS

The FTS server version is still 1.3, we plan to upgrade it to 1.4 in April. We currently have the following list of channels configured: T1-T1: CNAF-PIC, CNAF-GRIDKA, CNAF-SARA, CNAF-IN2P3 T1-T2: CNAF-Bari, CNAF-Catania, CNAF-Legnaro, CNAF-Milan, CNAF-Pisa, CNAF-Torino

The T1-T2 channels configuration is complete.

* CASTOR2

1. Castor2 is under testing at CNAF, and stress tests from the local WNs to Castor2 will continue over March. 2. The client part of the intallation (on the WNs) is complete now. 3. The Castor2 stager suffers from a known memory leak problem. 4. We will soon upgrade to 2.0.3-0. 5. A backup for DLF and the DB will be added. 6. We plan to involve one experiment in the Castor2 testing around the end of March. Castor2 will be put in production afterwards, depending on the test results gathered during March.

* T1-T2 throughput testing We need to internally discuss how to proceed with the disk-disk throughput tests involving CNAF and the T2 federation of INFN. The duability and the schedule of such tests depends on the pre-SC4 activities scheduled in April and on the results of the Castor2 tests.

GRIDKA

Announce 8th March: We'll work on our network configurations at the 8th of March and thus we will not be reachable on this day

Currently we are working on a dCache upgrade (adding Pools, working on our tape connection) as well as preparing the update of our FTS server to 1.4.

Quesrion on Storage types/classes: how do we do this? With different endpoints? The agreement isn't clear yet. This needs to be followed up.

CCIN2P3

Scheduled downtime on 22nd March: all services will be stopped. SRM/dCache will probably be restarted 23rd March in the morning. dCache will be upgraded to 1.6.6.5 and the pnfs DB will be migrated to PostgreSQL.

Both SC3 disk-disk and disk-tape endpoints are still available for background transfers.

DESY

Following the re-run of the SC3 throughput test DESY is currently focussing on providing services for the experiments. In particular DESY is participating in time-critical Grid/LCG-based event simulation for CMS in preparation of the Physics TDR. Having that completed over the weekend the majority of the computing and storage resources that are part of the LCG environment is now devoted to running digitization with Pile-up of the simulated events. This is a very I/O demanding task that is taxing on the dCache pool nodes and the network in between the dCache servers and the worker nodes. Together with the on-going CMS analysis based on the Crab tool a network bandwidth of more than 400 MB/s is achieved on a regular basis.

GSI

  • We have an SRM operable with dCache backend, currently only disk based.
  • We upgraded to a network bandwidth to the outside world of 100 Mb/s.
  • We are currently upgrading to gLite 1.5 and LCG 2.7.

Scotgrid

Upgraded to 2.7.0 on production.

Trying to work out what best role for Pre-production service is for Scotgrid - if the FTS needs to be run at ~full production rates on the pre-production, this is not something that we can support.

AOB

Noted issue of versioning of pre-production FTS. The production SC4 setup will run the current production version of FTS (1.4). The pre-production will be testing out 1.5. What do sites that are participating in both production and pre-production do?

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2007-02-02 - FlaviaDonno
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback