--
JamieShiers - 22 Mar 2006
Achieved Rates - Daily Averages for Disk-Disk Transfers
- Rates achieved as per Gridview
- see SC4blog for more information on events, interventions, changes etc.
- Note that the official goal is to meet or exceed the target rate every day (or make up for it shortly after - more information below). In addition, the startup (ramp-up to full nominal rates) needs to be rapid - essentially a step function at the beginning of each LHC running period.
- BOLD - sites meeting or exceeding their nominal target for that day
- ITALICS - sites within 15% of their nominal target for that day
Site-specific Issues and Plans
CNAF
- As far as I understand, the CASTOR2 LSF plugin and rmmaster version you are running is leaking jobs. We had the same problem at CERN during the SC3 re-run in January. The symptom is that LSF dispatches some fraction of the jobs without the message box 4, which is posted from the plugin. It is briefly described on slide 4 in the CASTOR2 training session 5: http://indico.cern.ch/getFile.py/access?resId=33&materialId=0&confId=a058153
. Unfortunately it only describes a workaround for 'get' requests, while you are seeing the problem for 'put'. The procedure should be similar but you have to decide the filesystem yourself (for 'get' it is much simpler since the list of target filesystems is already posted in msg box 1). Alternatively you may just post an empty or non-existing filesystem, which will cause the job to fail. Simply killing the job with bkill would also work but result in an accumulation of rows with STATUS=6 in the SUBREQUESST table. The problem has been fixed (a workaround for a timewindow in LSF) in 2.0.3 so the best would be to upgrade. Olof
- We are upgrading CASTOR2 to the latest release and we should be ready to repeat the disk to disk transfer tests during the first half of May. luca
FZK
- The link to FZK should be moved to the dedicated circuit in the timeframe of early June I think. For the moment the IP service of GEANT has a limitation of single streams <= 1Gb/sec dure to packet re-ordering in the M160 routers. However, I am not sure if this would be a problem, in reality, in the multi-stream usage in SC4. However, traffic via GEANT-IP will be using a shared network so performance may vary. David
- We are in the progress of analyzing the limiting factors. Some key (network, dcache) people are not present this week. A report and throughput improvement planning will arrive next week (wednesday). First things that come to mind: There is a difference in the setup as compared the one used in SC3. Most notably the 2.6 kernel on the gridftp receivers. Another cause could be the costs computation dcache performs. New nodes received an inordinate amount of traffic and left the remaining nodes idling. This could be tuned better. I do not think the single streams limitation would be the real problem because as far as I know the link runs via the same path and hardware as with SC3. We did not see the problem then. Correct me if I'm wrong. Jos
- Due to Easter holidays and the ongoing work on our tape connection at GridKa we won't be able to start the tape tests before mid of next week. Doris
- ... so we will continue to monitor disk - disk transfers ... Jamie
- No theories yet - configuration at IN2P3 should be able to handle up to 300MB/s.
- No limitating factors have been found at IN2P3-CC during SC4 disk-disk transfers except a memory issue on the SRM node which lead to a drop in the rate every night at 21:40 GMT during the DB backup (20 min). Memory will be upgraded and IN2P3 will study the possibility to distribute dCache core services on 2 different hosts. Another possible issue is the concurrent access to the SRM by Atlas. This has to be investigated. It seems in most of the cases, FTS was not able to send enough data with the initial number of files in the channel (15) when there was a concurrence between all sites: with 15 files at the beginning of the challenge, IN2P3 could sustain > 200MB/s, whereas one week later, the rate was half of that with 30 files. When increasing to 45, the rate immediately doubled but then started to decrease again. Moreover we noticed that the number of concurrent transfers was often lower then the number of files set in the channel. Lionel
ASGC
- Target here is to reach stable transfers at 50MB/s. Having sustained this over several days, we would try to ramp up to the full nominal rate by end May Jamie
Week two (April 10 on)
- Week 2 average to sum of Tier1 sites is 1262 MB/s - 79% of the target.
Site |
Disk-Disk |
Week1 Average |
Week2 Average |
Apr10 |
Apr11 |
April12 |
April13 |
April14 |
April15 |
April16 |
April17 |
TRIUMF |
50 |
54 |
63 |
62 |
69 |
63 |
63 |
60 |
60 |
62 |
63 |
BNL |
200 |
191 |
199 |
220 |
199 |
204 |
168 |
122 |
139 |
284 |
257 |
FNAL |
200 |
101 |
231 |
168 |
289 |
224 |
159 |
218 |
269 |
258 |
261 |
PIC |
60 |
49 |
78 (5 days) |
49 |
- |
24 |
72 |
76 |
75 |
84 |
82 |
RAL |
150 |
118 |
136 |
137 |
124 |
106 |
142 |
139 |
131 |
151 |
160 |
SARA |
150 |
120 |
178 |
173 |
158 |
135 |
190 |
170 |
175 |
206 |
213 |
IN2P3 |
200 |
165 |
157 |
86 |
133 |
157 |
183 |
193 |
167 |
166 |
167 |
FZK |
200 |
104 |
142 |
97 |
174 |
141 |
159 |
152 |
144 |
139 |
130 |
CNAF |
200 |
80 |
88 |
82 |
121 |
96 |
123 |
77 |
44 |
132 |
32 |
ASGC |
100 |
|
24 |
22 |
33 |
25 |
26 |
21 |
19 |
22 |
24 |
NDGF |
50 |
|
28 (5 days) |
- |
- |
- |
14 |
38 |
32 |
35 |
20 |
DESY |
60 |
70 |
74 |
71 |
77 |
69 |
72 |
76 |
73 |
76 |
76 |
TOTAL (T1s) |
1600 |
|
|
1096 |
1300 |
1175 |
1046 |
1266 |
1255 |
1539 |
1409 |
Week one (April 3 on)
Site |
Disk-Disk |
Apr3 |
Apr4 |
Apr5 |
Apr6 |
Apr7 |
Apr8 |
Apr9 |
Weekly average |
Average from startup |
Target |
TRIUMF |
50 |
44 |
42 |
55 |
62 |
56 |
55 |
61 |
54 |
54 (>100%) |
50 |
BNL |
200 |
170 |
103 |
173 |
218 |
227 |
205 |
239 |
191 |
191 (>95%) |
200 |
FNAL |
200 |
- |
- |
38 |
80 |
145 |
247 |
198 |
101 |
141 (>70%) |
200 |
PIC |
60 |
- |
18 |
41 |
22 |
58 |
75 |
80 |
49 |
42 (70%) |
60 |
RAL |
150 |
129 |
86 |
117 |
128 |
137 |
109 |
117 |
118 |
118 (~80%) |
150 |
SARA |
150 |
30 |
78 |
106 |
140 |
176 |
130 |
179 |
120 |
120 (80%) |
150 |
IN2P3 |
200 |
200 |
114 |
148 |
179 |
193 |
137 |
182 |
165 |
165 (>80%) |
200 |
FZK |
200 |
81 |
80 |
118 |
142 |
140 |
127 |
38 |
104 |
104 |
200 |
CNAF |
200 |
55 |
71 |
92 |
95 |
83 |
80 |
81 |
80 |
80 |
200 |
ASGC |
100 |
- |
7 |
23 |
23 |
- |
- |
12 |
|
|
100 |
NDGF |
50 |
- |
- |
- |
- |
- |
14 |
- |
|
|
50 |
DESY |
60 |
- |
68 |
63 |
75 |
74 |
68 |
74 |
|
70 |
60 |
TOTAL (T1s) |
1600 |
709 |
599 |
911 |
1089 |
1215 |
1179 |
1187 |
|
984 (61.5% of target) |
|
- Week one summary: there is an improvement with time in the above: the total average daily rate out of CERN increases but then plateaus around 1200MB/s; the number of sites participating builds up as does their stability. There is the clear indication that there is (are) a (some) limiting factor(s) at CERN which need to be understood before trying to push the rate up to the target.
Schedule & Targets for SC4
Tier0 - Tier1 Throughput tests
Disk - disk Tier0-Tier1 tests at the full nominal rate are scheduled for April.
The proposed schedule is as follows:
- April 3rd (Monday) - April 13th (Thursday before Easter) - sustain an average daily rate to each Tier1 at or above the full nominal rate. (This is the week of the GDB
+ HEPiX
+ LHC OPN
meeting in Rome...)
- Any loss of average rate >= 10% needs to be:
- accounted for (e.g. explanation / resolution in the operations log)
- compensated for by a corresponding increase in rate in the following days
- We should continue to run at the same rates unattended over Easter weekend (14 - 16 April).
- From Tuesday April 18th - Monday April 24th we should perform the tape tests at the rates in the table below.
- From after the con-call on Monday April 24th until the end of the month experiment-driven transfers can be scheduled.
- To ensure a timely start on April 3rd, preparation with the sites needs to start now (March). dTeam transfers (low priority) will therefore commence to the same end-points as in the SC3 disk-disk re-run as soon as sites confirm their readiness.
Site |
Disk-Disk |
Disk-Tape |
TRIUMF |
50 |
50 |
BNL |
200 |
75 |
FNAL |
200 |
75 |
PIC |
60* |
60 |
RAL |
150 |
75 |
SARA |
150 |
75 |
IN2P3 |
200 |
75 |
FZK |
200 |
75 |
CNAF |
200 |
75 |
ASGC |
100 |
75 |
NDGF |
50 |
50 |
DESY |
60 |
60 |
- The nominal rate for PIC is 100MB/s, but will be limited by the WAN until ~November 2006.