LCG Management Board |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Date/Time |
Tuesday
18 August 2009 16:00-17:00 – Phone Meeting
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Agenda
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Members |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
(Version 1 – 22.8.2009) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Participants |
A.Aimar
(notes), D.Barberis, O.Barring, D.Britton, Ph.Charpentier, D.Duellmann,
M.Ernst, I.Fisk, D.Foster, Qin Gang, J.Gordon, A.Heiss, M.Kasemann,
M.Lamanna, P.Mato, G.Merino, R.Pordes, H.Renshall, M.Schulz, Y.Schutz,
J.Shiers (chair), O.Smirnova, R.Tafirout |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Invited |
J.Andreeva, L.Perini |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Action
List |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Next Meeting |
Tuesday
1 September 2009 16:00-17:00 – Phone Meeting |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Minutes and Matters arising (Minutes)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1.1 Minutes of Previous Meeting
No
comments received about the minutes. The minutes of the previous MB meeting
were approved. 1.2 ALICE Requirements on SL5 and CREAM Deployment (Slides) – Y.SchutzALICE
requested all WLCG Sites to provide as soon as possible SL5 Worker Nodes and
the CREAM CE. SL(C)5 ALICE
has expressed its wish to have the WNs in SL5 asap since last year. All
ALICE Sites should have migrated to SL5 by mid-September 2009 Those
sites which have not completed the migration will not be able to participate
in the production until the full migration. ALICE does not wish to maintain
double support to the sites (SL5 AND SL4) nor hybrid setups (WNs in SL5 and
VOBOX in SL4 for example). VOBOX Therefore
the same deadline applies also to VOBOXES. The migration of gLite-VOBOX
service to SL5 is ongoing by both IT-GD and IT-GS developers. CREAM-CE ALICE
has expressed its interest to have CREAM-CE at all sites (in parallel to the
LCG-CE) before the real data taking. ALICE would like to have the CREAM-CE
system in parallel to the LCG-CE at ALL sites by November 2009. The
current situation is not evolving (CREAM-CE, VOBOX with CREAM-Clients +
gridftp server): CERN,
RAL, Torino, CNAF, Legnaro, SPBu, KISTI, Kalkota, FZK, IHEP, Prague,
Subatech, SARA Which
is the same situation since CHEP09. ALICE
would like to know from the Sites by when they can fulfil the agreed
deadline. J.Gordon noted that the request
is in reality about the resources allocated to ALICE at the Sites. M.Schulz added that the VOBOX on
SL5 is currently being clarified, in term of packages to include, and will be
define in by the end of the week. Testing will start right after. Instead for
the SL5 WNs is all ready for deployment. A.Heiss added that FZK will move
all WNs for ALICE to SL5. For VOBOXES they are not sure they manage by September
as depends on hardware arrival and presence of the people responsible. Ph.Charpentier asked that the
Sites give to the CE referring to SL5 batch queue a new name. In DIRAC each
CE refers to a specific platform. And should be properly advertised in the
BDII. J.Gordon noted that WLCG asked
that a number of Sites run the CREAM-CE. And that target was achieved; but
maybe not enough of them are supporting ALICE. Ph.Charpentier asked what happens
now with the request. Who is going to follow it up? Y.Schutz asked for a report at
the GDB about the progress of these requests. J.Gordon replied that the
requests was already asked and will be also reported at the GDB. The MB
endorses the request but the VOs should follow the progress. M.Schulz added that a global
target can be added to the milestones for the Sites. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. Action List Review (List of actions)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The MB Action List was not reviewed this week. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3.
LCG Operations
Weekly Report (Slides)
– H.Renshall
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings 3.1 GGUS Tickets and AlarmsThere were three
alarm tickets: -
From CMS
to CERN on 12 August when an ETICS virtual machine flooded network with DHCP
requests (not reproducible) overloading a network switch then a router in
front of castorcms and castoralice. Switch was stopped overnight. -
ATLAS to
DE-KIT on 9 August (when MC disk space filled up) exceptionally to keep MC
Production running. 2TB added (with thanks from ATLAS). But KIT think this
did not warrant an alarm ticket – experiment production space should be well
planned – ATLAS agree but at the time did not know status of some obsolete
data deletion services at FZK. -
ATLAS to
RAL on 7 August for hanging LFC connections. Front end servers were rapidly
restarted fixing problem. Incidents leading to service incident reports: -
RAL air
conditioning (chiller) failure from 12-17 August. Draft SIR available (see
later report). -
Fibre cut between Madrid and Geneva – LHCOPN
primary and secondary down. SIR requested.
Slide 4 shows the VO Availability results and one can notice: - NIKHEF scheduled move to new computer centre performed from 10th to 14th for normal servers back up on 14th. This week moving disk servers and worker nodes. - The ALICE earlier but ongoing downtime at SARA was due to a failed ALICE VObox there. There is a separate ALICE VObox at NIKHEF – not clear if/how the two are coupled. - RAL had air conditioning stoppages due to water chillers’ failures starting on 12th continuing till Monday 17th. On the RAL Ait Conditioning Stoppage the
following report was sent. The SIR
from the LHC-OPN is still expected.
3.2 Miscellaneous ReportsCERN CERN responded to the Red Hat 4 and 5 zero-day exploit of loaded modules that was exposed on the morning of Friday 14th August. The workaround of modifying /etc/modprobe.conf was validated as good enough for the weekend for protecting lxplus/lxbatch. Was rapidly propagated but nodes with a suspect module loaded (mostly bluetooth was found) needed reboot – one lxplus (of 50) and 150 lxbatch (of 1000). \ Experiment-info list was informed of reboots with apologies for any losses and also of later decision to reboot lxbuild machines. Linux for controls list was also advised to include WAN accessible DAQ and accelerator services. CERN actions were minuted in the daily report. LHC-OPN Fibre Madrid to Geneva was cut about 12.30 on Wednesday 5th August due to public construction work and taking out both primary and secondary OPN links. First indication was a GGUS ticket from an ATLAS user at 17:04 on the 4 August reporting file transfer errors showing in the dashboard DDM web page for PIC_DATADISK where PIC thought problems lay with FZK and INFN. Reply from OPN was then confusing as it referred to previous non-OPN routing problems between PIC and FZK and CNAF. Did PIC know the OPN was down ? Probably not. And why not? This ticket was superseded by an ATLAS team ticket at 20.00. Reply on 5 August did mention that the OPN was down and referred to a GGUS LHCOPN ticket. These cross references are part of the agreed procedure. The T1 were confused between the issue of the routing when the OPN is down and the OPN failure itself and the tickets reflect this. The main concern is why there was no GGUS LHCOPN ticket earlier if the break happened at 12.30 on 4th as this would have saved Experiment and Sites a lot of time. We cannot see any dashboard style monitoring available to us of the OPN – some monitoring is password protected. Total downtime was 26 hours and a SIR has been requested. DE-KIT Several FZK disk servers (ALICE and LHCb) were hit over several days by a bug in the BIOS that falsely detected overheating and shut down the machine. In addition, one of three FZK tape robots broke down several times from 8 August leaving the tape service intermittently degraded. Finally fixed on 13 August. WLCG now formally requesting all sites to upgrade worker nodes to SL(C)5 following ATLAS confirmation. 3.3 SummaryIn summary the main issues of the last two weeks were: -
Should
space exhaustion warrant an alarm ticket to save production time? -
This
month’s computer centre infrastructure failure was air conditioning at RAL. -
LHCOPN monitoring/alarms/information
flow for the sites and experiments needs improvement. -
Serious
zero-day exploit on SL4 and 5 diverted a large amount of expertise. -
Universal
migration to SL(C)5 worker nodes requested. D.Barberis
thanked FZK for adding disk upon ATLAS’ request. Disk space request should
not be an alarm but the persons on shift were not experts and that one was
the only solution found. The reason of the lack of disk space was that the
deletion scripts did not work. J.Gordon
added that the installations of Red Hat 5 that do not run SE Linux may have
to install additional security updates as critical and this otherwise would
not be needed. If ATLAS was allowing the use of SE Linux it would help the
Sites. D.Barberis
replied that only reasons are the error codes libraries. D.Duellmann
added that the problem is due to the ORACLE client and the new ORACLE client
11 has still to be validated in the Applications Area. Ph.Charpentier
added that the AF stated that the request and it is independent on which
Experiments needs it. It should apply to all Sites, for all VOs. J.Gordon
noted that is ATLAS requires something and the Site is not supporting ATLAS
could avoid to do the patch. P.Mato
noted that also other VOS may use the same ORACLE client (e.g. LHCb) and the
Site should not look at the specific VO request. J.Shiers
suggested that the issue is clarified and reported to the MB. D.Barberis
highlighted the fact that even is a new release is deployed the old ones will
still be required for 12 months (i.e. end of 2010). J.Gordon
noted that in this way old security issues will still be such until all Sites
can enable the SE Linux extensions. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4. Update on the EU Proposals (Slides)
– J.Shiers
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Additional
Material: - Wiki - "P0" SSC preparation page (Indico); - EGI SA1 Draft - as an example; - HEP brief input to EGI editorial board (end July); - MB Agenda July 7th (previous update); -
SA4 outline;
- Agenda of HEP SSC call August 17; - Draft WLCG SA4 task descriptions J.Shiers
presented the status of the EU proposals and asked for input from the
Experiments. There were references to several documents and material (links
above) but the situation is summarized in the Slides attached. The
main repository of information is the HEP SSC wiki: https://twiki.cern.ch/twiki/bin/view/LCG/HEPSSCPreparationWiki
From this
page one can find pointers to the Indico category, the mailing list and all
documents and presentations (check uploaded material on the wiki page). 4.1 OverviewThe previous update to MB was made on July 7th (link). Since that date several event happened: -
The call
has officially opened (July 30) -
HEP has
provided input to the EGI editorial board on its outline plans (July 31) -
A “HEP” member
of the editorial board has been proposed (J.Shiers) -
Timetables
for preparing the relevant proposals in response to 1.2.1.1 (EGI) &
1.2.1.2 (Service Deployment) + 1.2.3 (Virtual Research Communities) have been
prepared -
Laura
Perini is coordinating the preparation of the single proposal to cover
1.2.1.1 & 1.2.1.2 -
Cal
Loomis is coordinating the “P0” proposal (see e-mail link ) for an SSC against 1.2.3 4.2 Manpower EstimatesThe table below was used in response to the editorial board. The funding from the EU will be 50% of the resources below (except for row 11).
Following
yesterday’s call there are the following changes: -
Move 1
and 2 (liaison functions) to SSC -
Diane
and AMGA: combination of other communities (e.g. AMGA) and / or SSC The remaining
estimated effort (6 + 10 + 4 + 2) is “about right” but needs further
(“final”) discussion plus justification. -
22 FTE at 50% EU funding over 3 years = 11 x 3 x EUR100K = EUR3.3M -
1.2.1.2
total is expected to be capped at EUR5M (out of EUR25M total for 1.2.1.1
& 1.2.1.2) Ph.Charpentier
asked about the resources for Diane and AMGA. J.Shiers
replied that they are removed from the table above and will be covered by
other sciences if they want them. The table above total is 22 and not 26 as
before the changes. 4.3 Draft “EGI Proper” ProposalEGI Proper is coordinated by L.Perini
will cover 1.2.1.1 &
1.2.1.2. Is a working title and will change to a meaningful acronym. From the Activity
SA4 (“Services for Communities of Heavy Grid Users”) document (link) the Communities identified as Heavy Users
Communities (HUCs) are: -
High
Energy Physics (HEP) -
Life
Sciences (LS) -
Astronomy
and Astrophysics (AA) -
Computational
Chemistry and Material Sciences (CCMT) -
Earth
Sciences (ES) The tasks currently
foreseen in the SA draft are as follows (all prefixed by TSA4., e.g. TSA4.1 for management):
-
Management
-
Hosting
of Community Specific Services (TSA4.2) -
Hosting
of VO Specific Services (TSA4.3) -
Support
of Frameworks (TSA4.4) -
Support
of Scientific Gateways -
Support
for interoperability within SA4 services -
Support
for transition to SA4 services
4.4 Next StepsThere is too little information on the VO services needed. Their
input is needed urgently. We should write 3 task descriptions
covering the above areas ~ 1 page (perhaps more for VO services) -
Deadline
is
Friday 21st August -
Drafts are attached (pages 3 to 6 in link) Next con-call is
Thursday at 16:00: From Thursday’s call on: move focus to SSC preparation. -
SSC
deadline: Monday 31st August Propose to use next
Tuesday’s MB slot for further SSC discussions, plus probably Thursday too
(both at 16:00 UTC+2). There is still time
for further iteration prior to, during and even after EGEE’09 in Barcelona! (HEP SD & SSC sessions). The current SSC proposals are: -
EGI SSC P0
(this proposal): High-Energy Physics (HEP), Life Science (LS), Comp.
Chem./Material Science (CMST), Complex Systems (CS), and Grid Observatory
(GO) (C. Loomis, CNRS, FR) Deadline: end August 2009 -
EGI SSC
P1: Astron/Astrophysics, Earth Science, Fusion (Claudio Vuerli, INAF, IT) -
CUE:
Dissemination, Training, Outreach to business (Roberto Barbera, INFN, IT) The future timetable is: -
Thursday
20th Aug: “HEP SSC” con-call @ 16:00 –
Iterate
on SA4 input to 1.2.1.2 –
Start to
discuss plans for P0 SSC for 1.2.3 -
Tuesday
25th Aug: “HEP SSC” con-call in MB slot –
Any
feedback from SA4 input to 1.2.1.2? Actions? –
First
drafts of sub-contributions to HEP SSC Further calls and
meetings are to be defined and several sessions already scheduled at EGEE’09.
Ph.Charpentier
asked whether Hosting of Community Specific Services (TSA4.2) and Hosting of
VO Specific Services (TSA4.3) includes providing the software running and the
client software needed to use such Services (e..g. FTS, DPM, LFC, etc). Also
the dashboard Services support needs to be clarified. He also noted that the
expected deadline was for September not end of August. J.Shiers replied that the proposals all
together must provide all that is needed. But is not to be all in a proposal.
Development should not be included but support and evolution of the software
needed can be included. L.Perini
added that if the call specifies Service Deployment it may include the support
of the software needed. Further development can be added to the JRA Joint
Research activities not in the SA activities. M.Kasemann added that also the liaison
functions with the Experiments should be added to the manpower requests. J.Shiers agreed but reminded that major
changes in the manpower data cannot be done.
Ph.Charpentier asked what the Analysis Task
for WLCG is. And whether the VO services should all be listed (Panda, DIRAC,
PhEDEx, ALIEN, etc) J.Shiers replied that the Analysis Support
seems to require 2 more people. The list of what the VO needs should be
provided as their input. Each Experiment could specify, as input, their
requests in a page with all the software and services expected. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5. AOB
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5.1 Visits to the Tier-1 SitesM.Kasemann about organising visits to the Tier-1 Sites. The goal is not to scrutinize the situation but to tune the expectation to the current status of the Sites and to meet the people at the Sites. J.Shiers replied that WLCG would be interested and D.Barberis added that also ATLAS would participate and supported the proposal. Which Sites could be visited should be discussed at the next meeting in 2 weeks. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6. Summary of New Actions |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
No new
actions. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||