-- JamieShiers - 30 Jan 2006

Present

  • Sites: (all T1s + DESY)

  • Experiments: (all LHC experiments)

Absent

(Noone)

CERN Power Cut Summary

Due to an incident in the new 513 sub-station there was a power failure during the night at 22.30. Batch and data services and the castor system remain down this morning. (25/01/06)

...

A full explanation posted 27.1.2006

The computer centre power failure on Tuesday night has been traced to a manufacturing fault with a circuit breaker in one of the main 2000A switchboards in the new substation. This problem was entirely unexpected---no equivalent failure has been seen at CERN in over 20 years. TS has asked the switchboard suppliers for an explanation and will be undertaking a systematic review of the remaining 94 circuit breakers in the near future. This exercise can be done without any service interruption but will require short periods when the physics computing farms run without UPS coverage.

With many apologies for the inconvenience; we really hope that we are through our teething problems with the new substation.

...

The above incident has re-highlighted some holes in our operational procedures (how to efficiently notify sites and experiments) and a proposal for tightening this will be made at the Mumbai workshop.

(Mail was sent to info-experiments@cernNOSPAMPLEASE.ch mailing list).

SC3 Activity Report

The SC3 disk-disk throughput rerun officially ended on THU JAN/26. The LCG MB has declared the rerun a success, after it demonstrated unattended reliable running throughout the weekend at > 900 MB/s essentially at all times! Many thanks to the CASTOR-2 team and also David Gutierrez Rueda and Edoardo Martelli who debugged the dedicated network links to the destination sites.

This week we tried boosting the rates for individual sites, while having others switched off. Each site has its own peculiarities, so there is not a common recipe to get good performance. Sites that are nearby also need different tunings than sites in the US, Canada and Taiwan. The highest rate we saw was 250 MB/s for 5 hours to SARA (NL) (also later seen to FNAL).

Next week we will do disk-tape throughput tests for some of the sites, at the level their currently available hardware permits. In April these tests will be redone at the level desired for SC4.

The SC3 rerun has allowed us to identify some shortcomings in the FTS and in the storage systems at participating sites.

By SC4 those issues are expected to have been dealt with.

In summary, many sites are now reaching their "nominal" or even "recovery" throughput rates. We still need further work to be able to sustain these rates for the period of an LHC run, together with the addition of the tape backend, but these results are very encouraging - thanks to all!

Tape test preparations

The paths we currently have for sites that want to participate in the tape tests are listed below. The target rates are 50MB/s to each participating site.

  • BNL /pnfs/usatlas.bnl.gov/sc_tape/casey
  • DESY /pnfs/desy.de/dteam/generated/SC-3/maart
  • GRIDKA /pnfs/gridka.de/sc3/dteam
  • IN2P3 /pnfs/in2p3.fr/data/dteam/hpss/tape-sc3 (updated)
  • PIC /castor/pic.es/sc3/scratch
  • SARA /pnfs/grid.sara.nl/data/dteam/sc3tape (updated)
  • TRIUMF /atlas

Please send corrections and ensure the destination pools are configured for writing to tape. Tell us when we can start.

Continuing "Background" Disk-Disk Transfers

We would like to continue background disk-disk transfers as "dteam" to continue to iron out remaining operational and software problems and as part of the ramp-up to sustained operation (see LHC operation schedule in attached picture). The only way to be sure that we will be ready to run sustained transfers for LHC running periods is to practise, practise and practise. These transfers will be configured to take no more than 10% of the available bandwidth but will expand when there is no "production" load.

This was agreed by all.

(An LHC run is foreseen to be ( 1 day setup with beam, 20 days physics, 4 days machine development, 3 days technical stop ) repeated 7 times, mid-April to end-October).

Experiment Round-table and Issues

  • LHCb. Would like to redo service challenge exercises:
    • T1 to T1 exercise. Ready to start this very soon - it was agreed that the target rate of (< 10 MB/s) would likely not interfere with the tape tests. It should last < 2 weeks provided target rates are achieved. Most seed data is already at thr T1 sites - the remaining data will be trabsferred from T0 - again, this is sufficiently low rate that it should not interfere with the tape tests.
    • T0 to T1 exercise. Talk next week.

  • CMS. Workshop this week to discuss SC4 planning: would like to test new event data model.
    • Will continue background proiduction data rates of few MB/s.
    • LFC service required? Yes, central LFC instance will be required. Local catalogs not critical (though optional in model).
    • Would like to start making use of FTS infrasructure.

  • GSSDATLAS. Preparing for Mumbai workshop. Would like testbed of several T1s and T2s to test primary use cases. More details next week.

  • ALICE. Producing data and want to test FTS asap.
    • Preliminary solution for FTS being integrated into softare framework.
    • Desire for "central service" - to discuss at Mumbai workshop.

Site Round-table and Issues

  • ASGC. Are continuing tests to obtain the target rate.

  • BNL. Achieved good rates of 150MB/s. Ready for tape tests.

  • FNAL. Last week dCache was upgraded improving cost parameters / pool selection algorithm. This had led to better rates. Working to understand remaining issues.

  • TRIUMF. Ran over weekend with 30 concurrent files - got stable 150MB/s. srmcp is needed to improve the rates.

  • IN2P3. Interested in obtaining details of the new FNAL configuration.
    • FTS - Need to work to run srm-cp. Gavin -> Realistically this will only happen after CHEP.
    • Tape tests started - rate is too high for tapes - will incoming file rate.

  • GridKA. Tape tests starting.

  • CNAF.
    • Hope to have 10 gig connection in production by mid-February.
    • Castor2 installation is underway.
    • FTS 1.4 - is this recommended. Gavin --> wait for memory leak / restart fix (coming soon).
    • Question: T2 to T2 - is this a use-case the sites should support?
      • This will be discussed at the workshop.

  • RAL.
    • Requested if they can still transfer disk->disk traffic from CERN to RAL so that we run this concurrent with T1<->T2 transfers. In particular T1->T2 will have to happen concurrently in reality but also most T2->T1 transfers enter RAL on a different network to CERN so this will help us load RAL. No problem ot do this.
    • Maarteen said the load generator could be switched on and FTS channel disabled. RAL can control this from their end.

  • NIKHEF/SARA.
    • After the powercut the CERN-SARA link has been down. The exect source of the problem is unkown to us at the moment. Yesterday it has been repaired by Global Crossing and it is up and running again.
    • The disk-disk test last week went fine until the powercut.
    • At the moment we have set up special pools for the disk-tape test and we are ready for it.

  • DESY
    • DESY took part in the "tuning run" last week where we proposed a series of different tests regarding the number of simultaneous active files and the number associated streams respectively. Also part of the plan was to use srmcp rather than SRM get/put.
    • Because of the power cut last Tuesday at CERN and the fact that it took almost a day to recover from the loss of the services this plan could not be carried out and on Thursday morning the SC-3 was declared finished. Therefore no conclusive results were obtained.
    • DESY is ready to start the throughput test including transparent file migration to tape and asks to activate the channel a.s.a.p. DESY anticipates to sustain a rate close to 100MB/s.
    • However, a major maintenance is planned to be carried out starting on Tuesday morning at 7 AM which will last until Wednsday. Therefore the channel needs to be switched off at 7:00.

  • PIC. Currently reconfiguring pools. Hope transfers will start beginning of next week.

  • NDGF. Last week 150 MB/s. Working on dCache tape installation for April (tape backend still to be determined).
    • 10gig line hopefully by summer.

  • Korea KNU. Currently preparing for SC4 to install more file severs and install the tape library server.

  • NOTE: Sites should feel free to change the parameters on their FTS channels - but please inform the service team.

SC4 Preparations

We will start and regularly review weekly the actions required for SC4 preparation.

There is still an open action item from SC3, namely:

  • All sites to publish the name of their SE in the information system. (2 sites still do not do this...)

AOB

  • An update of the high-level POW for the MB can be found in the attached files.
Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt sc-mb-jan31.ppt r2 r1 manage 1226.5 K 2006-01-31 - 09:36 JamieShiers Update of POW actions for MB
Postscriptps schedule.ps r1 manage 125.3 K 2006-01-30 - 12:24 JamieShiers LHC Operating Schedule
JPEGjpg status_29_01_06.jpg r1 manage 146.8 K 2006-01-30 - 11:50 JamieShiers ALICE running jobs
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2007-02-02 - FlaviaDonno
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback