-- JamieShiers - 22 Mar 2006

Achieved Rates - Daily Averages for Disk-Disk Transfers

  • Rates achieved as per Gridview - see SC4blog for more information on events, interventions, changes etc.

  • Note that the official goal is to meet or exceed the target rate every day (or make up for it shortly after - more information below). In addition, the startup (ramp-up to full nominal rates) needs to be rapid - essentially a step function at the beginning of each LHC running period.

  • BOLD - sites meeting or exceeding their nominal target for that day

  • ITALICS - sites within 15% of their nominal target for that day

Site-specific Issues and Plans

CNAF

  • As far as I understand, the CASTOR2 LSF plugin and rmmaster version you are running is leaking jobs. We had the same problem at CERN during the SC3 re-run in January. The symptom is that LSF dispatches some fraction of the jobs without the message box 4, which is posted from the plugin. It is briefly described on slide 4 in the CASTOR2 training session 5: http://indico.cern.ch/getFile.py/access?resId=33&materialId=0&confId=a058153. Unfortunately it only describes a workaround for 'get' requests, while you are seeing the problem for 'put'. The procedure should be similar but you have to decide the filesystem yourself (for 'get' it is much simpler since the list of target filesystems is already posted in msg box 1). Alternatively you may just post an empty or non-existing filesystem, which will cause the job to fail. Simply killing the job with bkill would also work but result in an accumulation of rows with STATUS=6 in the SUBREQUESST table. The problem has been fixed (a workaround for a timewindow in LSF) in 2.0.3 so the best would be to upgrade. Olof

  • We are upgrading CASTOR2 to the latest release and we should be ready to repeat the disk to disk transfer tests during the first half of May. luca

FZK

  • The link to FZK should be moved to the dedicated circuit in the timeframe of early June I think. For the moment the IP service of GEANT has a limitation of single streams <= 1Gb/sec dure to packet re-ordering in the M160 routers. However, I am not sure if this would be a problem, in reality, in the multi-stream usage in SC4. However, traffic via GEANT-IP will be using a shared network so performance may vary. David

  • We are in the progress of analyzing the limiting factors. Some key (network, dcache) people are not present this week. A report and throughput improvement planning will arrive next week (wednesday). First things that come to mind: There is a difference in the setup as compared the one used in SC3. Most notably the 2.6 kernel on the gridftp receivers. Another cause could be the costs computation dcache performs. New nodes received an inordinate amount of traffic and left the remaining nodes idling. This could be tuned better. I do not think the single streams limitation would be the real problem because as far as I know the link runs via the same path and hardware as with SC3. We did not see the problem then. Correct me if I'm wrong. Jos

  • Due to Easter holidays and the ongoing work on our tape connection at GridKa we won't be able to start the tape tests before mid of next week. Doris

  • ... so we will continue to monitor disk - disk transfers ... Jamie

FZK Apr20 Apr21 Apr22 Apr23 Apr24
Target: 200 142 113 139 140 128

IN2P3

  • No theories yet - configuration at IN2P3 should be able to handle up to 300MB/s.

  • No limitating factors have been found at IN2P3-CC during SC4 disk-disk transfers except a memory issue on the SRM node which lead to a drop in the rate every night at 21:40 GMT during the DB backup (20 min). Memory will be upgraded and IN2P3 will study the possibility to distribute dCache core services on 2 different hosts. Another possible issue is the concurrent access to the SRM by Atlas. This has to be investigated. It seems in most of the cases, FTS was not able to send enough data with the initial number of files in the channel (15) when there was a concurrence between all sites: with 15 files at the beginning of the challenge, IN2P3 could sustain > 200MB/s, whereas one week later, the rate was half of that with 30 files. When increasing to 45, the rate immediately doubled but then started to decrease again. Moreover we noticed that the number of concurrent transfers was often lower then the number of files set in the channel. Lionel

ASGC

  • Target here is to reach stable transfers at 50MB/s. Having sustained this over several days, we would try to ramp up to the full nominal rate by end May Jamie

ASGC Apr20 Apr21 Apr22 Apr23 Apr24
Target: 50 51 39 46 50 51

Week two (April 10 on)

  • Week 2 average to sum of Tier1 sites is 1262 MB/s - 79% of the target.

Site Disk-Disk Week1 Average Week2 Average Apr10 Apr11 April12 April13 April14 April15 April16 April17
TRIUMF 50 54 63 62 69 63 63 60 60 62 63
BNL 200 191 199 220 199 204 168 122 139 284 257
FNAL 200 101 231 168 289 224 159 218 269 258 261
PIC 60 49 78 (5 days) 49 - 24 72 76 75 84 82
RAL 150 118 136 137 124 106 142 139 131 151 160
SARA 150 120 178 173 158 135 190 170 175 206 213
IN2P3 200 165 157 86 133 157 183 193 167 166 167
FZK 200 104 142 97 174 141 159 152 144 139 130
CNAF 200 80 88 82 121 96 123 77 44 132 32
ASGC 100   24 22 33 25 26 21 19 22 24
NDGF 50   28 (5 days) - - - 14 38 32 35 20
DESY 60 70 74 71 77 69 72 76 73 76 76
TOTAL (T1s) 1600     1096 1300 1175 1046 1266 1255 1539 1409

Week one (April 3 on)

Site Disk-Disk Apr3 Apr4 Apr5 Apr6 Apr7 Apr8 Apr9 Weekly average Average from startup Target
TRIUMF 50 44 42 55 62 56 55 61 54 54 (>100%) 50
BNL 200 170 103 173 218 227 205 239 191 191 (>95%) 200
FNAL 200 - - 38 80 145 247 198 101 141 (>70%) 200
PIC 60 - 18 41 22 58 75 80 49 42 (70%) 60
RAL 150 129 86 117 128 137 109 117 118 118 (~80%) 150
SARA 150 30 78 106 140 176 130 179 120 120 (80%) 150
IN2P3 200 200 114 148 179 193 137 182 165 165 (>80%) 200
FZK 200 81 80 118 142 140 127 38 104 104 200
CNAF 200 55 71 92 95 83 80 81 80 80 200
ASGC 100 - 7 23 23 - - 12     100
NDGF 50 - - - - - 14 -     50
DESY 60 - 68 63 75 74 68 74   70 60
TOTAL (T1s) 1600 709 599 911 1089 1215 1179 1187   984 (61.5% of target)  

  • Week one summary: there is an improvement with time in the above: the total average daily rate out of CERN increases but then plateaus around 1200MB/s; the number of sites participating builds up as does their stability. There is the clear indication that there is (are) a (some) limiting factor(s) at CERN which need to be understood before trying to push the rate up to the target.

Schedule & Targets for SC4

Tier0 - Tier1 Throughput tests

Disk - disk Tier0-Tier1 tests at the full nominal rate are scheduled for April.

The proposed schedule is as follows:

  • April 3rd (Monday) - April 13th (Thursday before Easter) - sustain an average daily rate to each Tier1 at or above the full nominal rate. (This is the week of the GDB + HEPiX + LHC OPN meeting in Rome...)
  • Any loss of average rate >= 10% needs to be:
  1. accounted for (e.g. explanation / resolution in the operations log)
  2. compensated for by a corresponding increase in rate in the following days
  • We should continue to run at the same rates unattended over Easter weekend (14 - 16 April).
  • From Tuesday April 18th - Monday April 24th we should perform the tape tests at the rates in the table below.
  • From after the con-call on Monday April 24th until the end of the month experiment-driven transfers can be scheduled.

  • To ensure a timely start on April 3rd, preparation with the sites needs to start now (March). dTeam transfers (low priority) will therefore commence to the same end-points as in the SC3 disk-disk re-run as soon as sites confirm their readiness.

Site Disk-Disk Disk-Tape
TRIUMF 50 50
BNL 200 75
FNAL 200 75
PIC 60* 60
RAL 150 75
SARA 150 75
IN2P3 200 75
FZK 200 75
CNAF 200 75
ASGC 100 75
NDGF 50 50
DESY 60 60

  • The nominal rate for PIC is 100MB/s, but will be limited by the WAN until ~November 2006.
Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt HighEnergyDataPump.ppt r1 manage 1492.5 K 2006-04-06 - 16:37 JamieShiers High Energy Data Pump summary at Rome HEPiX (April 2006)
Edit | Attach | Watch | Print version | History: r29 < r28 < r27 < r26 < r25 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r29 - 2006-04-25 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback