-- GavinMcCance- 16 Jan 2006

  • Present: Jan, Olof, Maarten, Harry, James, Gavin, Jamie (via phone)
  • Expts called in: LHCb, CMS
  • Absent: Alice, Atlas

Outlook for SC3 Disk-Disk re-run

Goal: At least 800MB/s minimum stable with current software, for 3-4 days.

If we can do that, then try SRM copy to see how that affects the rate.

Main work for next 24 hours is optimising the rates to obtain stable running.

  • For deleting files (to avoid filling up space), the recommendation is to do this locally.

  • Would be nice to see all the parameters for the sites.
--> nice web page. --> also document in the wiki to keep track of all the changes.

Reports from Sites


Since afternoon, rate is 50% what it should be - scaled equally across sites. CERN is investigating.

Threading problems on new DLF server, only seen at production loads. Castor team downgraded version of DLF server. A fix is available and will be deployed soon.

Noted that only a small subset of test-files are being used (only 370 of 8000), causing poor load distribution. --> problem in test-load generator.


Running well.


Doing 50 files with 10 streams - still need to find optimal values for this. BNL have two people on call for monitoring.


Out of memory problems on Castor1 - probably too many files / streams. We will reduce this numebr and start to tune it. 2 streams should be OK (previous tuning).

Currently running with asymmetric routing - CNAF->CERN using new 10 gig, CERN-CNAF previous network 2 x 1 gig. If the tuning doesn't work, maybe we should change the routing to fully symmetric 2 x 1 gig.


Working well then hit by a firewall issue. Fix underway - hopefully engineers will have it fixed tonight.

Michael: Changing to SRM copy -> we should get at least 50% more.


Running at 80 MB/s. Increasing the number of transfers should increase rate linearly. Unbalanced queuing issue on the pools - this is being looked at by experts. Reccommend 20 streams per transfer with 2MB TCP buffer.

#50 / #15 transfers issue. FTS asked to do 50 concurrent, but FNAL only sees 15 concurrent - CERN will check. If we do SRM copy, FNAL can monitor them.


Pool nodes got totally filled up. Files deleted and pool nodes restarted pool nodes had to be restarted. Fix is in 1.6.6-4 dCache release (currently running 1.6.6-1) - will deploy after rerun to maintain stability of system. There is a cron job running to clean files now.


80MB/s - good given 1-gig link.


Firewall problems being looked at. CERN + NDGF network experts investigating.


Running well.


Reconfiguring: Increased the number of disk servers 4 -> 9. gridFTP memory prpoblems: too many streams? Best effort at weekend, 2 people keepinmg an eye on it.


Reconfiguring. Almost done. Added 4 more pool nodes. Ready to start with more transfers. There was a DNS problem, now resolved (propagation delay).


Working well. 80MB/s. Can increase number of files. Suggest buffer size increase, and reduce # streams.

Reports from Experiments


CMS: no issues.


Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2006-01-17 - GavinMcCance
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback