LCG Management Board
Tuesday 3 February 2009 - 16:00-17:00
(Version 1 – 05.2.2009)
A.Aimar (notes), D.Barberis, L.Betev, I.Bird (chair), D.Britton, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, S.Foffano, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, P.Mato, G.Merino, S.Newhouse, A.Pace, R.Pordes, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon
Mailing List Archive
Tuesday 10 February 2009 16:00-17:00 – F2F Meeting
1. Minutes and Matters Arising (Minutes)
1.1 Minutes of Previous Meeting
No comments. The minutes of the previous MB meeting were approved.
Action List Review (List of actions)
27.1.2009: J.Templon reported that, in order to speed up the process, SA3 asked NIKHEF to do a test installation and deployment at NIKHEF before sending it to certification at CERN.
CERN: Done for ALICE, ATLAS, LHCb. CMS still need to agree on the SLA document.
NL-T1: J.Templon reported that the NL-T1 SLA has been sent to the Experiments for review and approval.
NDGF: O.Smirnova reported that NDGF has sent their SLA proposal to ALICE and is waiting for a reply.
Not done yet. I.Bird proposed that someone is appointed in order to complete this action.
Done. Discussed on the 3 February.
M.Schulz reported that there were two problems:
- Ticket 5(open)-8(closed) December. The SAM tests of SRMV2 were timing out under heavy load. This was not reported. The time-out for these tests has been changed. The SAM team suggests that ATLAS uses their own UI not the one used by SAM for the OPS tests.
- The large amount of “unknown availability” was due to lock files not cleaned up. They are there to avoid overlapping of tests .There is a LEMON alarms to inform of these lock files to clean up but nobody receiving the alarm was acting (A.Di Girolamo or J.Novak). Should now be fixed.
3. LCG Operations Weekly Report (Slides, Weekly Report) – J.Shiers
Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
3.1 GGUS Summary
Nothing special and no alarms this week. CMS and ALICE were reminded that they should use GGUS routinely for request of support.
An FAQ page is in preparation explaining differences between “TEAM” and normal tickets that now go directly to sites.
3.2 ASGC Update
J.Shih connected to last Thursday’s 3D meeting and confirmed that they were on target to put “3D” related services back in production on the 1st week February. The mid-February (pre-CASTOR F2F@RAL) deadline for a clean CASTOR+SRM+DB installation may well be too aggressive and needs to be followed.
What can we learn from this experience?
- Standard configurations help: It is easier to diagnose problems, easier for sites to help each other and easier to avoid problems in the first place by following a well-trodden route;
- Manpower + knowledge levels must be adequate. Levels to support key WLCG services need to be clarified and eventually reviewed – what are affordable / reasonable support costs for e.g. “curating a PB of data”?
The ASGC situation will be followed up in the next weeks.
3.3 LCG-cp Issues
This was an item that spiked in the middle of the week (reported as fixed on Thursday) but highlights one or more weaknesses in the testing and roll-out procedure. Without wishing to be too specific to this particular problem, some better coordination of testing and roll-out appears to be necessary. Expecting some comments from groups / experiments concerned / affected. Could be coordinated through the daily OPS meeting.
Ph.Charpentier noted that the problem was not fixed but actually the installation was rolled-back.
A.Pace noted that the SRM issues are known but the choice is between fixing the current status or make a new version from scratch. The current tests do not test all the branches. The changes in the SRM server error codes have caused major problems to the CASTOR SRM client that was correcting the problems and sometimes circumvented errors codes from the servers.
J.Shiers noted that the preparation and delivery of a new version will take several months, therefore it is inapplicable.
Ph.Charpentier noted that the test cases should at least cover the typical use of the Experiments. This would reduce the late discovery of most problems.
D.Barberis noted that this period is adequate for changes, later when data-taking approaches there will be no ways to deploy new SRM software.
F.Donno suggested that every bug fix should also leave a test that will verify that such bug does not reappear.
Ph.Charpentier asked for information when new versions are in the Applications Area software repository. Experiments did not know there was a new version available.
3.4 Oracle-related Issues
FZK: LFC and FTS at FZK were down over weekend due to problem with the Oracle backend. Oracle support was involved and the “SIR” has not been provided.
NL-T1: Later in the week the migration of the LFC backend at NL-T1 to a RAC environment came with a huge performance loss. Is believed to be due to the “mainly write” load from ATLAS plus the “new world record” in #files / day reported by ATLAS, compared to R/O access from LHCb LFC.
J.Templon reported that now the LFC has been moved back to the old hardware and the migration will be redone with the lessons learned.
Big IDs: More big ID problems: This week at RAL again, The ASGC DB is now back in production and will be resynchronised. Seems that all these problems are at the CASTOR Sites.
L.Dell’Agnello reported that the Big IDs problem has NOT appeared at CNAF until now.
3.5 Other Services Issues
The usual level of operational issues continues but with improved (or at least improving) reporting by e-mail or at the meeting. It is hard to get the balance exactly right the level of detail provided now is about right. It is probably cheaper to provide a report than not. And it is a very convenient way of communicating with the community (bi-directionally).
The fix for the FTS delegation bug has been back-ported to FTS 2.1 and it is in the hands of the certification team under patch #2760.
It has been suggested that a dedicated WLCG section of the weekly joint operations meeting is now redundant: this would help streamline meetings +and save some time.
3.6 File Loss at FZK NOT Reported
However, some significant problems do not get reported to the meeting and this should be corrected asap. For example, from CMS Facilities Operations on Friday last week:
This issue was NEVER reported to the Operations meetings. Both Sites and VOs should have reported it.
A.Heiss stated that the CMS representative in the GridKA Technical Advisory Board asked not to report that issue.
M.Kasemann added that CMS knew immediately about the issue and in FZK they wanted t understand the issue before reporting it. But this major issue should have been reported by Site and VO.
A.Heiss asked what a Site should do if a VO at the Technical Advisory Board asks not to report.
I.Bird agreed not to report until the issue is clear. But once it is a real problem it should be reported asap.
J.Gordon also asked that the technical overview boards at the Sites should be officially asked not to hide problems from the WLCG.
Ph.Charpentier noted that the Sites must report to the Operations meeting and then from there to the MB if needed.
4. Requirements to report WLCG Installed Capacity information (Slides) – R.Pordes
R.Pordes presented the method to automatically report installed capacity at the WLCG Sites.
This is the summary of the work done with F.Donno and J.Gordon.
The requirements document was prepared out of the combined requirements/implementation document after the implementation.
Here is the latest version:
There are 2 sets of requirements which differ in the rate at which they must be provided:
- Monthly capacities/usage for management reports
- Dynamic/timely information for resource availability and allocation.
R.Pordes focused on the sentences in bold as requirements (slide 3) from the Experiments and the MB.
For Disk Storage the values must be collected in the same way for all infrastructures. There are several possibilities, for example:
- Monthly average
- Value at the end of the month
- Highest value during the month
It is crucial that the infrastructures agree on the same values. Otherwise the reports are inconsistent.
J.Templon noted that currently the value is the one at the time of filling the monthly report.
J.Gordon noted that the agreement was exactly to report the value at the end of the month. Not in the middle of the following month, when the report is filled or fixed by the Site.
R.Pordes asked whether it would not be better to have an average over the month instead. If a Site is down for the last day of the month the Site reports no resources while had them working for all other days of that month.
J.Templon proposed that the highest value could be taken. The reason is that it gives an indication of how much was necessary at the peak of usage by a given VO.
I.Bird reminded that this is the value that is reported to the management, and not the dynamic value for the Experiment to choose where to submit its jobs.
R.Pordes noted that actually both the average and the highest value provide useful information and should be reported.
The MB agreed that the topic should be discussed at in the working group and a proposal be brought to the MB.
The second set of requirements is for dynamic usage.
F.Donno explained that the points above are there to make sure that the schema is not changed in a non-compatible way.
R.Pordes agreed but noted that this is more a constraint than a requirement.
M.Kasemann asked how disk in front of the MSS is accounted on the VOs.
J.Gordon reminded that the pledged from the VOs are including the disk caches of the MSS for instance.
M.Kasemann concluded that then CMS must include this amount of disk in their revised requirements.
5. Status of OSG Collecting Installed Capacity (Slides) – R.Pordes
R.Pordes also summarized the status of the work going on in OSG in order to collect the installed capacity and provide the information to the APEL accounting collector.
US ATLAS and US CMS appreciate the additional 6 weeks for comment and review of the document and sign off on the WLCG Installed Capacity Document V1.8.
OSG will report capacity as agreed to in the current WLCG MOU resource commitments, not beyond. They will not report additional capacity provided beyond their pledges.
The monthly reporting information will be transmitted from the OSG from OSG-Operations specifically to the report collectors (APEL, etc)
- The mechanisms will be similar to those existing for the OSG reporting of accounting and availability information.
- The reporting will NOT be through the WLCG BDIIs used by the resource brokers and workload management systems.
I.Bird noted that the total resources of ATLAS and CMS will not be realistic if there are hidden resources in OSG.
R.Pordes replied that the APEL accounting the resources will be all reported as dynamic information but not in the monthly reports.
M.Kasemann noted that the accounting values will still be correct, like for sites that go beyond the pledges because of opportunistic usage of other resources.
J.Templon noted that if WLCG is “World-wide” is not very good that the OSG do not report everything.
I.Bird replied that he is ready to ask to the US representative at the RRB about what resources they report.
M.Schulz asked why the information is not provided via the standard BDII.
R.Pordes replied that US ATLAS do not uses it. But they may implement a dummy BDII in order to report.
OSG plan to participate in joint activities for
- planning the timeline and steps needed for deployment, analysis and validation of the information at the WLCG and
- Ongoing validation of the information.
Is F.Donno the contact for both of these? They expect there to be a several month (3-4) window where the information is included in A.Aimar reports, there is focused and detailed validation, and the information is not reported at a higher level than the MB.
We plan to implement a means to fix data at the OSG management layer rather than going back to the site administrators.
With April 2009 as the time of the “APEL/CESGA” reports then July 2009 would be the timeframe for the reports to be published officially.
I.Bird replied that the reports have been shown to the LHCC to present the kind of reports that will be provided.
F.Donno is the contact person but someone else will have to be nominated in the coming months.
I.Bird asked for a milestones plan to get to the OSG accounting information finally into APEL.
R.Pordes agreed to provide, within 2 weeks, the milestones for OSG reporting installed capacity into APEL.
F.Donno presented the comments received on the previous version of the document. Both about computing and storage resources.
She also presented an operational plan for the next few months.
The new version of the document is v1.8 publicly available:
- It includes a description for pure xrootd installations
- Integration of the new benchmark HEP-SPEC
- A few more comments received from OSG
6.1 Computing Capacity: Changes (slide 3)
Sites that decide to publish CPU power using the new benchmark HEP-SPEC:
- MUST use the GlueHostBenchmarkSI00 attribute to publish CPU power. In this case, the power MUST be expressed in SpecInt2000 using the scaling factor that can be found in the proposal of the HEPiX group.
- MUST also publish the attribute GlueHostProcessorOtherDescription: Benchmark=<value>-HEP-SPEC, where in <value> they report the CPU power expressed in the new unit.
The OSG reporting does not use the GlueCECapabilites but the GLUESiteSponsor attribute. Therefore the formulas had to be changed (slide 3) as below:
6.2 Xrootd Installations
The xrootd installations will also be included now (slide4). Changes involve the description of pure xroot installations: The other resources, just using just xrootd as additional access method, are already accounted.
- GlueSEImplementationName/Version: It includes xrootd and the version of the protocol.
- GlueControlProtocol/AccessProtocol: Pure xrootd installations will publish both control and access protocol (xroot is both a storage control and file access protocol) to distinguish from “xroot door”-only installations such as dCache, DPM, and CASTOR.
- It is advisable that pure xroot installations will publish only one SA.
Installed Capacity = (∑WLCG GlueSAGlueSACapability(InstalledCapacity)
6.3 Status of the Document on Installed Capacity
The new version of the document (v1.8-7) has been published with all details, operational instructions for site administrators and explicative examples (OSG missing).
Technical agreement on the content of the document: Approved from both EGEE and OSG (in the previous section).
6.4 Implementation Plan for EGEE
Below is the operational plan agreed with WN working group, Gstat, EGEE operations, developers, gLite certification team.
This plan assumes that the document is approved today.
- Presentation of the requirements and of the document at the next EGEE operations meeting.
- Presentation at the next GDB (February 11th 2009).
- Creation of a support group in GGUS to support sites. To be in place by the 18th of February 2009
- Nothing to be done for DM or WMS clients. Nothing to be done for SAM tests.
- Next version of YAIM supporting configuration to be released by end of February 2009.
- New Gstat sanity checks will go to production at the end of February 2009.
- New GridMap will go to production at the end of February 2009.
- size the sites by #LogicalCPUs (Cores) or Installed Capacity (SI2K) as defined in the document
- will have a button to show the OSG sites contributing to WLCG (OSG sites are shown if they are listed in SAM and also in BDII)
- will allow to interactively explore the PhysicalCPU and LogicalCPU numbers of sites (helps with inspection if the values are set correctly)
- If we get the online data feed for WLCG topology information in time, we'll also add this as the source for the "tiers" button. This will then show WLCG sites only.
In principle APEL will produce the first reports for Tier-2s in April 2009. These are just test reports to verify that the system as conceived can work.
Official reports will be produced later on in the year.
The described operational plan has been approved by the EGEE side of WLCG. Started engaging with the OSG.
Progress is being made. A list of the current concerns and ideas is being created.
A concrete plan is being worked on. OSG’s top concerns regarding deployment:
- The ability to "override" any final numbers collected in APEL which might be incorrect.
- USATLAS has requested that they not send any data via the BDII which could result in jobs matching their sites in a WMS.
- USCMS appears to be interested in having separate data paths between the "normal" BDII which might be used by meta-schedulers and any used in accounting data.
I.Bird asked why GridMap is involved in the plan.
F.Donno replied that GridMap can be used as a validation tool to visually check the amount of resources reported.
I.Bird asked whether storage accounting and reporting for the installed CPU capacity is ready in APEL.
J.Gordon replied that the Installed CPU capacity per VO is not complex with 4 values per Site to publish. This can be produced directly from APEL.
F.Hernandez asked whether the two units CPU are going to be published.
F.Donno replied that in order to check the change of unit the best is to have both units. One will be able to monitor who is using the new and old benchmarks. The unit in the report is mandatory otherwise one does not know who is using which unit.
J.Templon added IN2P3 can migrate to the new unit. The new field should have the “value-HEPSPEC” specified. In the old field the unit is still the old SPECint2K. Therefore they are still comparable.
I.Bird added that Sites should also re-benchmark the typical hardware that they have already. In this way a Site can migrate to the new unit for all resources, not just for the newly acquired hardware.
I.Bird reported that the Experiment’s spokespersons (or their representatives) have been invited to the F2F Meeting, on the 10 February, to discuss the LHC Schedule that will be defined at the workshop in Chamonix on Friday.
8. Summary of New Actions
R.Pordes agreed to provide, within 2 weeks, the milestones for OSG reporting installed capacity into APEL.