-- JamieShiers - 19 Jan 2006

Present

Experiments: ALICE (Patrica Mendez, Killian Schwarz), GSSDATLAS (Zhongliang Ren), CMS (Jon Bakken - also representing FNAL & US-CMS), LHCb (Nick Brook, Andrei Tsaregorodtsev)

Sites: BNL, CERN, DESY, FNAL, GridKa, IN2P3, NDGF, PIC, RAL, SARA, TRIUMF, Kyungpook National University

Absent

ASGC

Summary of Last Week's Activities

The SC3 throughput rerun really started on Tuesday, for ~10 days, using a new CASTOR-2 cluster, whose remaining issues were all fixed by Monday evening, thanks to a great effort of the CASTOR team.

Some problems with firewalling on the new high-performance networks kept hindering a few destination sites, but were then finally solved.

We have reached a rate of over 1 GB/s to the combined set of sites!!

We have had unattended stable running for over 24 hours at over 800 MB/s, and over 900 MB/s most of the time.

The total rate is limited to ~1 GB/s by a common switch. The combined power of the sites would be able to absorb over 1.3 GB/s. Next week we will try to tune individual sites to see the maximum rate per site, while keeping the others at a low level of background traffic.

Participating sites: ASCC, BNL, CNAF, DESY, FNAL, GRIDKA, IN2P3, NDGF, PIC, RAL, SARA, TRIUMF.

Future SC3 Rerun Activities

  • Disk - disk
  • Disk - tape
  • Experiment activities

Site Status Reports, including foreseen interventions

  • ASCC
  • BNL
  • CERN
    • LCG RAC + CASTOR2 DB intervention foreseen during Thursday slot.

  • CNAF
  • DESY
    • Following the reload of the firewall with proven software last Tuesday DESY has experienced stable running at an average rate of O(80)MB/s. This is what DESY had expected out of a shared 1Gbps DFN/GEANT link.
    • In addition to the throughput test the same dCache instance was used to import datasets from FNAL for CMS via PhEDEx. The average rate was >13MB/s at times bringing the overall rate close to 100MB/s.
    • For the tuning session tomorrow DESY asks to move to srmcp rather than continue using srm get/put.

  • FNAL
    • Is it planned to use staggered-in-time 3rd party SRM-to-SRM 3rd party transfers via srmcp in this week's running? If so, we would like to continue with these tests. (A: one more bug-fix to test - will do turn on srmcp Tuesday 24th after getting a baseline.)
    • Has there been any progress on the malloc errors that prevent running with 20 streams? (A: still investigating.)

  • GRIDKA
  • IN2P3
  • NDGF
    • Basically running OK, except for a 5% failure rate due to a timeout error that we haven't nailed down yet.

  • PIC
  • RAL
    • Managed to sustain ~100MB/s network rate over weekend without too much difficulty.
    • Imbalance in use of gridftp servers in particular gftp0444, appears to be down to other transfers (non SC3) to that host. On restart gftp0444 went from < 10 MB/s between 9-10 GMT to > 60 MB/s 12-13GMT
    • RAL-CERN link will be down for maintenance at some point on Tuesday, we believe it will be in the afternoon but don't have a definite start time or duration yet.
    • May participate in tape tests, but unsure at the moment.

  • SARA
    • There was and still is a networking problem that is still under investigation. ARP requests from the CERN router is filtered by our Force10 switch for some reason. We have "solved" this by letting the host involved ping the CERN router every two minutes. Apart from that everything ran perfectly fine up untill yesterday when the transfers were switched off for the site-by-site tests. On january 20st we exceeded the target data rate of 150 MB/s.

  • TRIUMF

Experiment Status Reports, Outlook and Issues

  • ALICE
    • RAS

  • GSSDATLAS
    • In GSSDATLAS we have been running the T0 operation test since last week. The T0 operation test mimics the raw data flow from Event Filter at Point 1 to CASTOR2 storage at about 320 MB/s, and data distribution from CASTOR2 to the T0 batch farm (LCF) for real time event reconstructions, and the distribution of both raw and ESD/AOD data from the reconstruction into various T1 centers. Currently to avoid the interference with the on-going SC3 rerun, data distribution to T1s have been switched off. Also the event reconstruction is a scaled down fake reconstruction (real GSSDATLAS software but with reduced functionalities), otherwise 3000 KSI2K CPUs would be needed for the full scale T0 operation test. The full T0 operation data flow rate has been reached or exceeded which proves that the current CERN-wide infrastructure can already meet the GSSDATLAS run time requirement.

  • CMS
    • We asked for information/plots on the success rate of initial transfers, that is, how many transfers require retries. Has there been any progress on this front? (A: will be reinstated Tuesday 24th.)

  • LHCb
    • Want to restart T1-T1 transfers, which require distribution of seed data to T1 sites. Agreed that this will start after Thursday mornings intervention on Oracle DBs.

SC3 Disk-Tape Rerun

The following sites are able to take part, dates permitting:

  • FZK (2 drives), IN2P3 (only if end-Jan / early Feb), PIC, TRIUMF (2 drives), DESY, SARA, BNL
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2007-02-02 - FlaviaDonno
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback