LCG Management Board
Tuesday 12 May 2009 16:00-18:00 – F2F Meeting
(Version 1 – 18.5.2009)
A.Aimar (notes), I.Bird(chair), D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, S.Foffano, Qin Gang, J.Gordon, F.Hernandez, M.Kasemann, S.Lin, H.Marten, P.Mato, P.Mendez Lorenzo G.Merino, A.Pace, R.Pordes, H.Renshall, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon
Mailing List Archive
Tuesday 26 May 2009 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No comments received about the minutes. The minutes of the previous MB meeting were approved.
Action List Review (List of actions)
No progress for CMS.
J.Templon added that ALICE replied positively with a few comments and NL-T1 has only to implement some minor changes.
The Experiments agreed to present their dataflow and rates at May’s GDB (on the following day).
Later in this meeting.
Started, but not yet completed.
L.Dell’Agnello added that the discussion is still on-going between CNAF and the Security team.
On User Accounting:
J.Gordon will regularly add the list of Sites that publish the information. OSG does not have the DN but the CN information. The result is that individuals will have EGEE and OSG separate identities in the accounting reports.
This is more a GDB issues.
Done by S.Foffano.
To be done.
To be done.
3. LCG Operations Weekly Report (Slides) – H.Renshall
Summary of status and progress of the LCG Operations. It covers the activities since the last MB meeting.
The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
This report covers the service for the two weeks, i.e. the period 26 April to 9 May.
The GGUS ticket rate was normal and no alarm tickets were posted.
Two incidents leading to SIRs were submitted to WLCG Operations:
- IN2P3: 03 May 44 hours cooling down
- SARA: 04 May 36 hours MSS Tape Backend down
Two Central Service (short) outages with post-mortem reports:
- CERN: 05 May 2.5 hours myproxy service down
- CERN: 06 May 12 hours new CERN CA too large
3.2 GGUS Tickets
The SAM results for each VO are represented below.
In the dashboard above one can observe several “red” periods” for all VOs.
- CE test failure on lcgce004.gridpp.rl.ac.uk (but use CREAM)
P.Mendez-Lorenzo noted that probably some ALICE tests are still using the LCG-CE while RAL has moved completely to CREAM CEs.
- CE test failure on ce-001.cnaf.infn.it
- Low efficiency of data transfer to AGLT2 cosmics muon calibration site (site problem).
CASTOR and LFC data. Nearly ready to restart functional tests as pre-req to
participate in STEP’09.
L.Dell’Agnello noted that ce-001.cnaf.infn.it is a test CE and should not be tested.
H.Renshall replied that he will check the situation with ATLAS.
- IN2P3 cooling failure took down most services.
- IN2P3 cooling failure
- Multiple problems at CNAF from invalid tURLs, configuration of new pilot role in STORM then follow-on ACL problems.
L.Dell’Agnello added that the new “pilot” role requested by the VOs gives problems to the current StoRM Configuration. There is a workaround but it was difficult to find out the problem.
3.3 Experiments Activities
- Migration of PANDA (Production and Distributed Analysis) from Oracle test to production cluster was completed on 6 May. The final step is moving from BNL to CERN and will be to integrate the US and French clouds on 11 May.
- Scheduled CERN castoratlas upgrade generated many ATLAS helpdesk calls including from shifters. Better communication is needed.
- Noticed that some 24TB of ‘free space’ disappeared after upgrade – in fact this is correct because offline pools (usually temporary) are no longer included in order to avoid space overcommittment.
- Took cosmics data last week. Data was then exported and reconstructed as a computing exercise only (no physics interest).
Completed first phase of MC09 generation of 10 million minimum bias events. Some problems merging outputs into 5GB files: hitting size limit at FZK and ACL problems at CNAF. Preparing to start generation of 10**9 events.
Moving very large files >~ 100 GB which are too big for PIC worker nodes. They probably need to rethink their policy. Also transfer timeouts being seen for files >~ 50GB.
M.Kasemann noted that the issues should now be solved.
3.4 IN2P3 Cooling Incident
The entire batch system and the front-end services to MSS were out of service from Sunday 3 May at 21:00 after a high temperature level, that had already lead to 200 WNs being stopped, triggered a chillers outage and a chain reaction taking all chillers down.
Progressive restart of the services was done from Monday May 4th on, as the cooling system was back to operational state during the night of Sunday.
A second incident happened Wednesday at 75% load.
Action: Reinforcement of the cooling system is now planned for the earliest possible date of early June and meanwhile the IN2P3 site is running in degraded mode at about 65% of batch capacity.
F.Hernandez added that the cooling upgrade was due earlier in 2009 and was delayed. Otherwise the incident would not have happened. They plan to stay at 65% level until the cooling system is upgraded.
3.5 NL-T1 Backend Outage
On Monday May 4th at 15:11 CET the dmf daemon (this is part of the SGI Data Management Facility managing tape migration/recall) crashed. Some ATLAS data export failed. Unfortunately this was only discovered on Tuesday May 5th, a national holiday in the Netherlands and could only be fixed on Wednesday May 6th.
Investigating what happened on May 4th it was noticed that the backup services at SARA were also having trouble that started at exactly the same point in time. They do not have sufficient logging material to exactly pinpoint the problem but the evidence so far suggests a hiccup on their storage area network as the probable cause of this problem.
Action: They will undertake as a result of this is to monitor the dmf daemon more closely (by Nagios). This work was already planned to be done.
J.Templon added that a parallel problem was occurring. A number of tape drivers was malfunctioning and made more difficult to detect this major issue.
3.6 Central Services Outage no 1 (MyProxy)
Myproxy: The Myproxy service was not working at all (upload, download or info) on 4th May 2009 from ~8:50 to 11:15. The problem affected the users and two GGUS tickets were opened: CMS CRAB (analysis) jobs could not be submitted and ALICE users failed executing the myproxy-init command.
The problem was a subject mismatch in new host certificates replaced that morning at ~8:50 following the procedure currently in place.
Follow up will be that in case there is a mismatch between the existing host cert subject and the new one being asked for, a warning should be generated, and the host cert should not be replaced.
3.7 Central Services Outage no 2 (CERN CA)
Following the update of the CA rpms to version 1.29-1 on Wednesday 6th May 2009, several IT and Experiments services were affected by failing client based certificate authentication: ATLAS Central Catalog, Nagios services, SAM Portal, GridView Portal. The problem was due to the increased number of certificate authorities (the length of the string) and a rollback to the previous rpms was made.
Analysis of the CA showed 5 which are not used by HEP and new rpms were made though removing them violates IGTF rules.
This is seen as a temporary solution and a Savannah bug has been raised. Meanwhile new CA upgrades will be more carefully tested on the PPS.
J.Templon noted that Sites could accept CAs which are not banned by the IGTF.
3.8 WLCG Services Summary
The SCOD rota is working smoothly – there was a positive impact of new ideas from different members. Better (more complete and consistent) reporting of problems from some sites (forSTEP’09 site metric).
Still some long unscheduled downtimes with little or no explanation beyond that in EGEE broadcast. An automatic report of interventions in the previous period would really help as a KPI – can CIC reports contribute ?
The rule “Never on a Friday” – do not schedule interventions on Fridays or before holidays, no matter how lucky you feel. This applies to all of CERN too.
Despite all efforts, awareness of major interventions not always optimal.
For STEP’09 planning Sites should update their Tier 1 capacities.
K.Bos noted that ATLAS needs to know the status of the Sites for STEP09. A round table should be done at the GDB. He stressed the fact that at least the basic functional tests must for at the Sites. ASGC does not seem ready to pass these tests. ASGC should focus on making the Tier-1 work not keep working on DPM. DPM is not a priority for ATLAS. ATLAS needs the Tier-1 working urgently.
Qin Gang replied that the priorities will be redefined and the Tier-1 will be the focus of the ASGC work. Maybe this was not clear to the local administrators until now.
I.Bird asked for a clear statement with the estimate, with dates and milestones, for setting the Tier-1 up. A clear set of operations with dates should be sent to the MB.
ASGC should distribute a detailed plan on setting up the Tier-1 systems
4. Update on EGI (Slides) – I.Bird
I.Bird summarized the status and progress on the EGI preparation.
4.1 Plans after EGEE: EGI
The goal of EGI is to provide the long-term sustainability of grid infrastructures in Europe. The approach followed is the establishment of a new federated model bringing together NGIs to build the EGI Organisation and to provide coordination and operation of a common multi-national, multi-disciplinary Grid infrastructure in order to:
- enable and support international Grid-based collaboration
- provide support and added value to NGIs
- liaise with corresponding infrastructures outside Europe
Below is the timeline expected:
A transition plan is being defined for after EGEE.
The current status of the EGI preparation is:
- Final blueprint published end December
- EGEE transition plan produced, helped in final blueprint. While cannot fully implement transition, clarifies expected state at end of project
- EGI.org + NGIs will take over the infrastructure – transition plan
- EGI_DS Policy Board has selected Amsterdam as location for EGI.org.
The establishment of the EGI organisation will require that:
- Council of NGIs is formed – with an initial MoU (LoI in first instance) for those that participate. At least 10 NGI should sign in order to launch EGI.
- EGI.org must appoint Director and for teams to develop transition plans, etc.
The WLCG OB has made statement of support for the process and willingness to work with EGI.org and NGIs; expects to participate via the User Forum.
4.2 EGI Timescales and Funding
Below are the timescales, as presently understood.
Early April: MoU and LoI available.
End April: LoI signed by interested NGIs.
May 6: (proto)EGI Council established
- NGIs signing LoI are constituents
- The EGI Project(s) team confirmed; this includes the project director(s) identification/confirmation (the person who will lead the team(s))
June: MoU signed, (full)EGI Council setup
- “The Transition towards EGI” Deliverable published
March—May: EGI.org setup preparation
- Includes search for EGI.org director and identification of EGI.org key personnel
June: EGI.org director appointed/identified
September: EGI.org setup at the latest (one month before the call closure)
July—December: EGI Project(s) preparation and submission
The MoU signing will continue after June, but the “latecomers” may not have direct influence on the composition of project preparation team nor on the selection of EGI.org director.
The funding can come from 3 Calls from EC – to close in November:
- EGI.org and ongoing support for existing large application communities (targeted call)
- Community support; Specialised Support Centres (SSC in EGI blueprint)
- Middleware, repositories, etc
4.3 CERN’s and WLCG Proposed Role
CERN’s role will have 3 facets:
As part of EGI.org
As an SSC for HEP
- As part of a middleware consortium need to support existing middleware. What is the future?
The WLCG should actively work in:
- Ensuring that the service does not break in the transition
- Ensuring that WLCG requirements are communicated to EGI.org and all the NGIs
Evolution of the service,
middleware, etc. What is the mechanism for discussion, interaction?
- Support for sites in countries outside Europe and US that today have a relationship with EGEE for support/coord. Latin America, Asia, Canada. And others?
5. WLCG Technical Forum (Slides; More Information) – I.Bird
The need of a WLCG Technical Forum has already been agreed in previous meetings.
There is the need for a technical forum to discuss issues for evolution, improvement, etc between WLCG stakeholders and should be also able to provide input on a common WLCG position to software providers (e.g. EGEE, EGI, OSG, etc.).
It should be a forum in which all WLCG parties can:
- discuss longer term needs and developments of middleware and other services
- prepare for the sustainability and evolution of the existing middleware in the light of changing technology and experience –
- possibly think again of common solutions in some areas where it is clear that existing solutions are weak
It needs to represent all the stakeholders – Experiments, Sites, Grid projects, etc.; but should not be too large and it should, depending on the topic, bring in the appropriate experts when necessary.
The Forum does not take final decisions but should produce clear documents for discussion in the GDB, and for endorsement by the WLCG MB.
The mandate proposed is the following:
Provide detailed technical
feedback and requirements on grid middleware and services, providing the WLCG
position on technical issues:
- Advise the MB on grid middleware and service needs and problems.
- Discuss the evolution and development of middleware and services, in particular driving common solutions where possible.
- Prioritise WLCG needs and requirements.
- Report to the WLCG MB
5.3 Proposed Membership
- Maarten Litmaath
- Experiments: 1+1 per experiment
- Sites: 1+1 for Tier 1s; 1+1 for Tier 2s; 1+1 for Tier 0.
- Infrastructures: 1+1 from each EGEE, OSG, ARC
- Bring in additional experts as needed
- Members must have the mandate to speak for their community (of course also talking to them about the issues)
- Suggest setting up a strong non-face-to-face channel
- Monthly face to face/video meeting
5.4 Initial Topics
The initial topics that need to be addressed are:
- Data management – review where we are with SRM vs. actual needs
- Analysis model topics
- Support for virtualisation (many sub topics)
- Pilot jobs support
It should also to provide WLCG technical input to EGI discussions:
- Needs from the NGIs
- Input to proposal for the SSC for HEP
- Input to the middleware discussion – we need to be explicit and clear about the WLCG needs in the near future
J.Templon asked the Tier-1 representative(s) are from Sites that also provide grid services to several other communities and not WLCG only.
I.Bird noted that the Sites’ representatives must represent the Tier-1 opinions and not their own specific ones. Even if they do not agree with it. And Sites are not there to represent the needs of non-WLCG VOs.
J.Gordon asked what will be the interaction with the GDB.
I.Bird replied that at the GDB the Forum’s discussions will be presented to the wider audience for feedback and comments.
L.Dell’Agnello noted that the Tier-1 representative must organize meeting with the other Tier-1 Sites, in order to really represent the Tier-1 views and not its own.
Ph.Charpentier noted that there is a need for a technical forum to discuss the current “small but important” matters. Something a la Architects Forum, not something a la Baselines Services WG. Current urgent issues should be discussed (priorities, urgent bugs, etc).
M.Kasemann agreed that part of the mandate should include operational problems to flow and solve. For instance the change of SRM error codes as discussed a few weeks ago at the MB.
I.Bird replied that this is also included in the mandate in the part about “Advise the MB on grid middleware and service needs and problems”.
M.Kasemann agreed that some kind of “Middleware Architects Forum” is needed.
M.Schulz added that it is important that the Forum discusses about prioritization of the ongoing matters. The Forum should foster what is needed by the WLCG even if not matches what is expected by the other VOs (e.g. phasing out the LCG CE).
J.Templon noted that one has to be careful because this forum cannot take decisions for all VOs, including those that are not represented. For instance deciding to move to SRM V2 as default in lcg-utils has an impact on other VOs therefore the decision cannot be taken by this Forum.
O.Smirnova noted that other middleware needs to have a common forum with some power about priorities on common issues (e.g. SRM). It should not just be one more input from the Experiments and Sites.
J.Gordon replied that should be similar to the TMB. But sometimes Experiments TMB representatives do not present the same views as presented at the MB. Experiments have to make sure that their views are represented in the same way.
I.Bird proposed to launch the Forum with these guidelines and then see how they progress on the short and long-term topics. If there is input on these mandate/guidelines the MB members should send them to him.
6. Proposal for MSS Metrics Dashboard (Current Metrics; Slides) – A.Aimar
A.Aimar presented a proposal for collecting MSS Metrics from the Tier-0 and Tier-1 Sites.
6.1 Background Information
Currently the MSS Metrics are published manually on a wiki page (https://twiki.cern.ch/twiki/bin/view/LCG/MssEfficiency) but are updated irregularly, except by a few Sites (e.g. FNAL). In addition the agreed frequency of the update is weekly.
In order to provide a constant monitoring of the situation, for STEP09 and further, add more automated and frequent method is needed.
6.2 Proposal of Using SLS
The proposal is to move to a method for automatically collecting this information using SLS. SLS is the method already used at CERN for all Services and also adopted by the WLCG and some of the Experiments (ATLAS and LHCb). SLS is an easy way to collect all this information without any development. Thanks to Di Girolamo for his help on SLS.
An SLS page that would be used to visualise the overall status of the WLCG Tier-0 and Tier-1 MSS Systems. Below is the SLS usage already done for the WLCG Services.
A page for the WLCG Tier-1 Tape Systems would summarize the global status.
And one can also see the details of each Site. Currently the Additional Data Service Information contains the 8 value published in the wiki page updated manually (with fake “1234” data in the image below). The information in the SLS pages can be customized (e.g. manager, etc).
Like ATLAS already does, shown in the page below, one can then obtain graphs on the data published and show directly those graphs in other ATLAS pages outside SLS. In specific pages or iGoogle or other aggregators of web pages.
In order to provide the information each Sites should:
- Publish on a reachable http URL an XML file with a simple but precise format.
- In the XML file each data value must be updated by the Site.
<numericvalue name="Total_Data_Rate_Read" desc="Total Data Rate Read" timestamp="2009-05-11T15:48:45">1234</numericvalue>
<numericvalue name="Total_Data_Rate_Write" desc="Total Data Rate Write" timestamp="2009-05-11T15:48:45">1234</numericvalue>
Which Tape Metrics should be published should be agreed among Sites and Experiments. The proposal is to start from the list provided by CERN
An XML file per Site will be provided temporarily, with all data values set to “0”, until the Site provides the file with actual data.
J.Gordon, H.Marten and L.Dell’Agnello noted that currently the data in the wiki page is collected at the Sites once a week, for instance, using scripts analysing the log files not in a live manner. Not all sensors are available already. Sites without sensors will have to implement them.
A.Aimar replied that while Sites automate the collection of this information they can manually update the XML file but the frequency will only be weekly and not adequate for the live monitoring of the Tape Systems. This is clearly not a good solution.
L.Dell’Agnello noted that Sites should write sensors to collect this data.
A.Aimar noted that all Sites have rich displays of this kind of information but all in different formats and sometimes only on web pages accessible locally.
We need this information formatted in a uniform way and accessible via http from CERN. SLS simply pulls this XML files and updates the SLS Dashboards.
Ph.Charpentier noted that the information in SLS makes sense only if it is updated at a high frequency (e.g. one hour frequency) and not manually. Experiments would really benefit from this quasi-live information.
A.Aimar agreed the information should be collected frequently and the idea of updating the XML manually is obviously not going to work because the data will be marked as obsolete in SLS.
Ph.Charpentier noted that Sites have advanced but specific system and this is not easy for the Experiment. Monitoring information is available in 10 different ways, depending on the Site.
J.Gordon noted that originally I.Bird asked that Sites state themselves whether they can reach the agree rates and take care of the bottleneck they detect.
A.Aimar added that also Experiments have asked to be able to see this information in a more uniform way. For this reason the manually-updated wiki page was created.
K.Bos supported this proposal and asked that it is implemented as soon as possible. Experiments need to know the rates reached at every Site with frequently updated information.
H.Marten noted that all Sites should then publish the same information in the same units, etc in order to have a uniform view of the Tier-1 Sites.
F.Hernandez noted that IN2P3 does not have rates per Experiment because their tape system is shared by several VOs even outside WLCG.
J.Gordon replied that even if is a global value it is still useful for the Experiments in order to see whether a Site is overloaded and has long queue of jobs waiting for the MSS system to respond. Even the total is better than nothing.
T.Cass noted that the SLS pages can be accessed by anyone authenticated at CERN.
A.Aimar noted that the same authentication level is already required for the manually-updated wiki page.
J.Templon asked whether all the Sites have to provide is this XML updated hourly, for instance.
A.Aimar replied positively and added that the next steps should be:
- Experiments check whether the proposed metrics are adequate
- Sites specify which metrics they can already provide and whether are by VO or globally.
- A.Aimar will prepare the dummy XML file for each Site to fill and send the instructions to be followed.
K.Bos added that all metrics are useful to understand the situation when there is a problem. The minimal set should include the “data in/out” and “no of files in/out” per space token. So that they can understand whether it has to do with accelerator data, MC production, etc. The set presented by CERN is a good set of metrics.
M.Kasemann proposed to start from the metrics in the CERN proposal.
T.Cass added that the list from CERN is the data currently collected in SLS and it is the result of the experience of working with the Experiments for several years.
Experiments agreed on the metrics needed. The CERN set is considered a good starting point.
- Sites reply which information they can provide already and whether can be reported by Experiment.
7. Tier-1 and Tier-2 SAM Reports (Tier1_Reliab_200904.zip; Tier2_Reliab_200904.pdf; comments_received; email_received)
A.Aimar collected for comments from Sites and Experiments about the SAM results (Tier1_Reliab_200904.zip) for April 2009.
- From the Sites about the OPS Tests
- From the Experiment about their VO-specific Tests
J.Gordon commented that the scheduled down-times are already described in the broadcasts and in the GOCDB database. Sites should not be requested to repeat the same report at the MB.
A.Aimar replied that now that reliability is reported regularly is also possible to check the reasons for which Sites schedule themselves down. Maybe not all Sites have the same criteria and some provide partial services while others prefer to schedule themselves down. He will check how to extract the information from GOCDB.
J.Gordon asked whether the MB actually wants to check the reasons while Sites schedule themselves down. The reasons could be interesting if the MB agrees to study the monthly report, otherwise they are not useful.
7.1 Comments on the OPS Tests
J.Templon commented that SARA and NIKHEF are now in a single Site in BDII and therefore the reports should now be about NL-T1. But the two old Sites have not been removed and this requires some consolidation work.
J.Gordon suggested using this information for a few months and then one will see whether is useful for the MB.
Action: A.Aimar will see how to extract the information about scheduled downtimes from GOCDB and report it to the MB every month.
Ph.Charpentier questioned again the algorithms used. He noted that that if you are scheduled down for 24h you are considered available (reliable). If you are 23h59 schedule down and 1 minute unavailable you are considered as 24h unavailable (not reliable).
7.2 Comments on the VO Tests
G.Merino commented that the results from the VO specific tests do not match with the availability from the dashboards. The VOs have other information and it is not always matching. He would like to have a report from the Experiments about their different displays and error reporting techniques. For instance CMS has 2 tests in SAM and 8 tests in their dashboard.
P.Mendez Lorenzo commented that the ALICE dashboard has more checks than the SAM ALICE test that check just that the VOBoxes are running. .
Action: The Experiments will have to explain their VO SAM tests and the tests that are reported in their dashboards.
9. Summary of New Actions