A number of regions reported high load and apparently non optimal usage of resource by the biomed
VO which is currently running some challenges. e.g.
High load LFCs in France.
Multiple WorkerNodes in the UK accessing identical files at remote SEs.
Attendance
EGEE
Asia Pacific ROC:
Central Europe ROC: Małgorzata Krakowian
OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen, Diana, Maria
French ROC: Osman, Piere
German/Swiss ROC: Angela Poschlad, Wen Mei
Italian ROC:
Northern Europe ROC:
Russian ROC: Lev Sharardin
South East Europe ROC: Kai
South West Europe ROC: Kostas
UK/Ireland ROC: Jeremy Coles
GGUS: Torsten
GOCDB: Gilles Mathieu
WLCG
WLCG Service Cordination: Harry Renshall, Jamie Shiers
Nick to add hyperlink to agenda and minutes template for the Alcatel meeting call back. Update: link was always there, but now uses a font for the blind.
Hydra testing , ROCs are asked if they wish to help with certifying, testing or using HYDRA.
EGEE Items From ROC Reports
ROC France
INFORMATION IN2P3-CC: Central LFC for Biomed VO is currently overloaded due to a growth of Biomed activity. Even if the hardware was upgraded in emergency on Friday the problem is still there. The problem might be due to some limitations in the number of simultaneous connections between the LFC and the Oracle DB. We will contact LFC support to find a good (and scalable) solution. Sorry for the inconvenience.
Comments from Pierre
using a bad method against the LFC.
ROC UK/I
A Biomed user's activity has caused site instabilities by repeatedly trasfering the same 2.8GB file to WNs across EGEE from a single UK site SE. After ticketing the user they produced more replicas but there is concern about this data distribution model and the bandwidth stress. For a related GGUS ticket see: GGUS:43489. The user responded quickly. We may be seeing signs of the limit of the standard submission approach/model: "We are submitting theses jobs with the native EGEE command glite-wms-job-submit . These grid jobs are then accessing the 2.8GB data file through the command lcg-cp . So we didn't decide neither where the jobs are scheduled nor which file-replicate is used by these jobs. The EGEE middleware is deciding." Because of the I/O limitations the Biomed jobs are often quite inefficient.
Comments from Jeremy
Results given for one site but in fact all UK sites are being hit hard. e.g Lots of requests for same files. User has adjusted but more could be done.
Comments from Nick
We can contact to VO to have them change their habbits.
Comments from Jeremy
VO may just say the middleware is deficient.
Comments from Nick
Contact BIOMED VO, Action on Nick.
ROC UK/I
UKI-NORTHGRID-LANCS-HEP saw a problem with a recent WN update: GGUS:43473. The ticket seems to bounce around without anybody really knowing how to help! The point to note is that it is likely a site problem but the site/ROC has struggled to understand the problem as it (looks like it) requires middleware expert help. The site will try a reinstall with 64-bit gLite to try to remove the 64/32-bit incompatibilities but no real understanding of the problem has happened.
Site availability does not take into account SRM V2 systems. As a result the overall RAL availability is dependent on a dcache service which is no longer considered a front line service. SRM V2 not being in the overall availability figures is a problem with the monitoring not the site. Update The WLCG Management Board decided on Tuesday to use SRMv2 in the availability calculations as of December (in lieu of the SRMv1 tests). This will be discussed with the EGEE ROC Managers to ask them to ratify this.
WLCG as of wednesday will consider the SRMv2 tests as the important one, should we do the same for EGEE (See John's email, and stick it in ).
ROC UK/I
On the topic of SAM, has there been any progress on centrally identifying common problems seen in SAM? On 19th November from 18:00-21:00 UK time a number of sites saw the same (top-level BDII?) problem. It would save much time if these errors could be automatically flagged as possibly due to an offsite problem.
Comment from John
We shall try and send broadcasts in such situations.
Comment from Jeremy
This would help to avoid duplication of wasted effort.
Later their will be PATCH:2417 with a similar fix cream CE.
Java Bouncy Castle problems
Extract from broadcast:
A few days ago jpackage updated bouncycastle to version 1.41. This version causes problems for several glite nodes as it places the jars in a new directory. The glite developers are currently working on patches to solve this issue. For the time being please make sure that your site DOES NOT UPGRADE to bouncycastle 1.41.
Node types affected by this problem:
glite-UI
glite-MON
glite-CREAM
glite-FTS_oracle
glite-WN
glite-TORQUE_utils
glite-LSF_utils
glite-CONDOR_utils
glite-VOMS_mysql
glite-VOMS_oracle
glite-VOBOX
lcg-CE
WLCG Items
Upcoming WLCG Service Interventions
Interventions
Monday December 1st , maybe a VOMS intervention for LHC VOs. Broadcast will be sent if this is going to happen.
This is going to happen, transparent intervention now entered in the GOCDB.
ATLAS Service
BNL<->CNAF network problems solved. Took a couple of weeks.
ALICE Service
CMS Service
Everything quite smooth, ramping down from exercises of the past few weeks.