LCG Management Board
Tuesday 18 August 2009 16:00-17:00 – Phone Meeting
(Version 1 – 22.8.2009)
A.Aimar (notes), D.Barberis, O.Barring, D.Britton, Ph.Charpentier, D.Duellmann, M.Ernst, I.Fisk, D.Foster, Qin Gang, J.Gordon, A.Heiss, M.Kasemann, M.Lamanna, P.Mato, G.Merino, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, J.Shiers (chair), O.Smirnova, R.Tafirout
Mailing List Archive
Tuesday 1 September 2009 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No comments received about the minutes. The minutes of the previous MB meeting were approved.
1.2 ALICE Requirements on SL5 and CREAM Deployment (Slides) – Y.Schutz
ALICE requested all WLCG Sites to provide as soon as possible SL5 Worker Nodes and the CREAM CE.
ALICE has expressed its wish to have the WNs in SL5 asap since last year.
All ALICE Sites should have migrated to SL5 by mid-September 2009
Those sites which have not completed the migration will not be able to participate in the production until the full migration. ALICE does not wish to maintain double support to the sites (SL5 AND SL4) nor hybrid setups (WNs in SL5 and VOBOX in SL4 for example).
Therefore the same deadline applies also to VOBOXES. The migration of gLite-VOBOX service to SL5 is ongoing by both IT-GD and IT-GS developers.
ALICE has expressed its interest to have CREAM-CE at all sites (in parallel to the LCG-CE) before the real data taking. ALICE would like to have the CREAM-CE system in parallel to the LCG-CE at ALL sites by November 2009.
The current situation is not evolving (CREAM-CE, VOBOX with CREAM-Clients + gridftp server):
CERN, RAL, Torino, CNAF, Legnaro, SPBu, KISTI, Kalkota, FZK, IHEP, Prague, Subatech, SARA
Which is the same situation since CHEP09.
ALICE would like to know from the Sites by when they can fulfil the agreed deadline.
J.Gordon noted that the request is in reality about the resources allocated to ALICE at the Sites.
M.Schulz added that the VOBOX on SL5 is currently being clarified, in term of packages to include, and will be define in by the end of the week. Testing will start right after. Instead for the SL5 WNs is all ready for deployment.
A.Heiss added that FZK will move all WNs for ALICE to SL5. For VOBOXES they are not sure they manage by September as depends on hardware arrival and presence of the people responsible.
Ph.Charpentier asked that the Sites give to the CE referring to SL5 batch queue a new name. In DIRAC each CE refers to a specific platform. And should be properly advertised in the BDII.
J.Gordon noted that WLCG asked that a number of Sites run the CREAM-CE. And that target was achieved; but maybe not enough of them are supporting ALICE.
Ph.Charpentier asked what happens now with the request. Who is going to follow it up?
Y.Schutz asked for a report at the GDB about the progress of these requests.
J.Gordon replied that the requests was already asked and will be also reported at the GDB. The MB endorses the request but the VOs should follow the progress.
M.Schulz added that a global target can be added to the milestones for the Sites.
2. Action List Review (List of actions)
The MB Action List was not reviewed this week.
Weekly Report (Slides)
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting.
All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
3.1 GGUS Tickets and Alarms
There were three alarm tickets:
- From CMS to CERN on 12 August when an ETICS virtual machine flooded network with DHCP requests (not reproducible) overloading a network switch then a router in front of castorcms and castoralice. Switch was stopped overnight.
- ATLAS to DE-KIT on 9 August (when MC disk space filled up) exceptionally to keep MC Production running. 2TB added (with thanks from ATLAS). But KIT think this did not warrant an alarm ticket – experiment production space should be well planned – ATLAS agree but at the time did not know status of some obsolete data deletion services at FZK.
- ATLAS to RAL on 7 August for hanging LFC connections. Front end servers were rapidly restarted fixing problem.
Incidents leading to service incident reports:
- RAL air conditioning (chiller) failure from 12-17 August. Draft SIR available (see later report).
- Fibre cut between Madrid and Geneva – LHCOPN primary and secondary down. SIR requested.
Slide 4 shows the VO Availability results and one can notice:
- NIKHEF scheduled move to new computer centre performed from 10th to 14th for normal servers back up on 14th. This week moving disk servers and worker nodes.
- The ALICE earlier but ongoing downtime at SARA was due to a failed ALICE VObox there. There is a separate ALICE VObox at NIKHEF – not clear if/how the two are coupled.
- RAL had air conditioning stoppages due to water chillers’ failures starting on 12th continuing till Monday 17th.
On the RAL Ait Conditioning Stoppage the following report was sent. The SIR from the LHC-OPN is still expected.
3.2 Miscellaneous Reports
CERN responded to the Red Hat 4 and 5 zero-day exploit of loaded modules that was exposed on the morning of Friday 14th August. The workaround of modifying /etc/modprobe.conf was validated as good enough for the weekend for protecting lxplus/lxbatch. Was rapidly propagated but nodes with a suspect module loaded (mostly bluetooth was found) needed reboot – one lxplus (of 50) and 150 lxbatch (of 1000). \
Experiment-info list was informed of reboots with apologies for any losses and also of later decision to reboot lxbuild machines. Linux for controls list was also advised to include WAN accessible DAQ and accelerator services. CERN actions were minuted in the daily report.
Fibre Madrid to Geneva was cut about 12.30 on Wednesday 5th August due to public construction work and taking out both primary and secondary OPN links. First indication was a GGUS ticket from an ATLAS user at 17:04 on the 4 August reporting file transfer errors showing in the dashboard DDM web page for PIC_DATADISK where PIC thought problems lay with FZK and INFN. Reply from OPN was then confusing as it referred to previous non-OPN routing problems between PIC and FZK and CNAF. Did PIC know the OPN was down ? Probably not. And why not?
This ticket was superseded by an ATLAS team ticket at 20.00. Reply on 5 August did mention that the OPN was down and referred to a GGUS LHCOPN ticket. These cross references are part of the agreed procedure. The T1 were confused between the issue of the routing when the OPN is down and the OPN failure itself and the tickets reflect this.
The main concern is why there was no GGUS LHCOPN ticket earlier if the break happened at 12.30 on 4th as this would have saved Experiment and Sites a lot of time. We cannot see any dashboard style monitoring available to us of the OPN – some monitoring is password protected.
Total downtime was 26 hours and a SIR has been requested.
Several FZK disk servers (ALICE and LHCb) were hit over several days by a bug in the BIOS that falsely detected overheating and shut down the machine.
In addition, one of three FZK tape robots broke down several times from 8 August leaving the tape service intermittently degraded. Finally fixed on 13 August.
WLCG now formally requesting all sites to upgrade worker nodes to SL(C)5 following ATLAS confirmation.
In summary the main issues of the last two weeks were:
- Should space exhaustion warrant an alarm ticket to save production time?
- This month’s computer centre infrastructure failure was air conditioning at RAL.
- LHCOPN monitoring/alarms/information flow for the sites and experiments needs improvement.
- Serious zero-day exploit on SL4 and 5 diverted a large amount of expertise.
- Universal migration to SL(C)5 worker nodes requested.
D.Barberis thanked FZK for adding disk upon ATLAS’ request. Disk space request should not be an alarm but the persons on shift were not experts and that one was the only solution found. The reason of the lack of disk space was that the deletion scripts did not work.
J.Gordon added that the installations of Red Hat 5 that do not run SE Linux may have to install additional security updates as critical and this otherwise would not be needed. If ATLAS was allowing the use of SE Linux it would help the Sites.
D.Barberis replied that only reasons are the error codes libraries.
D.Duellmann added that the problem is due to the ORACLE client and the new ORACLE client 11 has still to be validated in the Applications Area.
Ph.Charpentier added that the AF stated that the request and it is independent on which Experiments needs it. It should apply to all Sites, for all VOs.
J.Gordon noted that is ATLAS requires something and the Site is not supporting ATLAS could avoid to do the patch.
P.Mato noted that also other VOS may use the same ORACLE client (e.g. LHCb) and the Site should not look at the specific VO request.
J.Shiers suggested that the issue is clarified and reported to the MB.
D.Barberis highlighted the fact that even is a new release is deployed the old ones will still be required for 12 months (i.e. end of 2010).
J.Gordon noted that in this way old security issues will still be such until all Sites can enable the SE Linux extensions.
4. Update on the EU Proposals (Slides) – J.Shiers
J.Shiers presented the status of the EU proposals and asked for input from the Experiments. There were references to several documents and material (links above) but the situation is summarized in the Slides attached.
The main repository of information is the HEP SSC wiki: https://twiki.cern.ch/twiki/bin/view/LCG/HEPSSCPreparationWiki
From this page one can find pointers to the Indico category, the mailing list and all documents and presentations (check uploaded material on the wiki page).
The previous update to MB was made on July 7th (link).
Since that date several event happened:
- The call has officially opened (July 30)
- HEP has provided input to the EGI editorial board on its outline plans (July 31)
- A “HEP” member of the editorial board has been proposed (J.Shiers)
- Timetables for preparing the relevant proposals in response to 184.108.40.206 (EGI) & 220.127.116.11 (Service Deployment) + 1.2.3 (Virtual Research Communities) have been prepared
- Laura Perini is coordinating the preparation of the single proposal to cover 18.104.22.168 & 22.214.171.124
- Cal Loomis is coordinating the “P0” proposal (see e-mail link ) for an SSC against 1.2.3
4.2 Manpower Estimates
The table below was used in response to the editorial board. The funding from the EU will be 50% of the resources below (except for row 11).
Following yesterday’s call there are the following changes:
- Move 1 and 2 (liaison functions) to SSC
- Diane and AMGA: combination of other communities (e.g. AMGA) and / or SSC
The remaining estimated effort (6 + 10 + 4 + 2) is “about right” but needs further (“final”) discussion plus justification.
- 22 FTE at 50% EU funding over 3 years = 11 x 3 x EUR100K = EUR3.3M
- 126.96.36.199 total is expected to be capped at EUR5M (out of EUR25M total for 188.8.131.52 & 184.108.40.206)
Ph.Charpentier asked about the resources for Diane and AMGA.
J.Shiers replied that they are removed from the table above and will be covered by other sciences if they want them. The table above total is 22 and not 26 as before the changes.
4.3 Draft “EGI Proper” Proposal
EGI Proper is coordinated by L.Perini will cover 220.127.116.11 & 18.104.22.168. Is a working title and will change to a meaningful acronym.
From the Activity SA4 (“Services for Communities of Heavy Grid Users”) document (link) the Communities identified as Heavy Users Communities (HUCs) are:
- High Energy Physics (HEP)
- Life Sciences (LS)
- Astronomy and Astrophysics (AA)
- Computational Chemistry and Material Sciences (CCMT)
- Earth Sciences (ES)
The tasks currently foreseen in the SA draft are as follows (all prefixed by TSA4., e.g. TSA4.1 for management):
- Hosting of Community Specific Services (TSA4.2)
- Hosting of VO Specific Services (TSA4.3)
- Support of Frameworks (TSA4.4)
- Support of Scientific Gateways
- Support for interoperability within SA4 services
- Support for transition to SA4 services
4.4 Next Steps
There is too little information on the VO services needed. Their input is needed urgently.
We should write 3 task descriptions covering the above areas ~ 1 page (perhaps more for VO services)
- Deadline is Friday 21st August
- Drafts are attached (pages 3 to 6 in link)
Next con-call is Thursday at 16:00: From Thursday’s call on: move focus to SSC preparation.
- SSC deadline: Monday 31st August
Propose to use next Tuesday’s MB slot for further SSC discussions, plus probably Thursday too (both at 16:00 UTC+2).
There is still time for further iteration prior to, during and even after EGEE’09 in Barcelona! (HEP SD & SSC sessions).
The current SSC proposals are:
- EGI SSC P0 (this proposal): High-Energy Physics (HEP), Life Science (LS), Comp. Chem./Material Science (CMST), Complex Systems (CS), and Grid Observatory (GO) (C. Loomis, CNRS, FR)
Deadline: end August 2009
- EGI SSC P1: Astron/Astrophysics, Earth Science, Fusion (Claudio Vuerli, INAF, IT)
- CUE: Dissemination, Training, Outreach to business (Roberto Barbera, INFN, IT)
The future timetable is:
- Thursday 20th Aug: “HEP SSC” con-call @ 16:00
– Iterate on SA4 input to 22.214.171.124
– Start to discuss plans for P0 SSC for 1.2.3
- Tuesday 25th Aug: “HEP SSC” con-call in MB slot
– Any feedback from SA4 input to 126.96.36.199? Actions?
drafts of sub-contributions to HEP SSC
Further calls and meetings are to be defined and several sessions already scheduled at EGEE’09.
Ph.Charpentier asked whether Hosting of Community Specific Services (TSA4.2) and Hosting of VO Specific Services (TSA4.3) includes providing the software running and the client software needed to use such Services (e..g. FTS, DPM, LFC, etc). Also the dashboard Services support needs to be clarified. He also noted that the expected deadline was for September not end of August.
J.Shiers replied that the proposals all together must provide all that is needed. But is not to be all in a proposal. Development should not be included but support and evolution of the software needed can be included.
L.Perini added that if the call specifies Service Deployment it may include the support of the software needed. Further development can be added to the JRA Joint Research activities not in the SA activities.
M.Kasemann added that also the liaison functions with the Experiments should be added to the manpower requests.
J.Shiers agreed but reminded that major changes in the manpower data cannot be done.
Ph.Charpentier asked what the Analysis Task for WLCG is. And whether the VO services should all be listed (Panda, DIRAC, PhEDEx, ALIEN, etc)
J.Shiers replied that the Analysis Support seems to require 2 more people. The list of what the VO needs should be provided as their input. Each Experiment could specify, as input, their requests in a page with all the software and services expected.
5.1 Visits to the Tier-1 Sites
M.Kasemann about organising visits to the Tier-1 Sites. The goal is not to scrutinize the situation but to tune the expectation to the current status of the Sites and to meet the people at the Sites.
J.Shiers replied that WLCG would be interested and D.Barberis added that also ATLAS would participate and supported the proposal.
Which Sites could be visited should be discussed at the next meeting in 2 weeks.
6. Summary of New Actions
No new actions.