LCG Management Board |
|
Date/Time |
Tuesday
3 February 2009 - 16:00-17:00 |
Agenda
|
|
Members |
|
|
(Version 1 – 05.2.2009) |
Participants |
A.Aimar
(notes), D.Barberis, L.Betev, I.Bird (chair), D.Britton, F.Carminati, T.Cass,
Ph.Charpentier, L.Dell’Agnello, M.Ernst, S.Foffano, Qin Gang, J.Gordon,
A.Heiss, F.Hernandez, M.Kasemann, P.Mato, G.Merino, S.Newhouse, A.Pace,
R.Pordes, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon |
Invited |
F.Donno |
Action
List |
|
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
Next Meeting |
Tuesday
10 February 2009 16:00-17:00 – F2F Meeting |
1. Minutes and Matters Arising (Minutes)
|
|
1.1 Minutes of Previous Meeting
No comments. The minutes of the previous MB
meeting were approved. |
|
2.
Action List Review (List of actions)
|
|
27.1.2009:
J.Templon reported that, in order to speed up the process, SA3 asked NIKHEF
to do a test installation and deployment at NIKHEF before sending it to
certification at CERN.
CERN: Done for ALICE, ATLAS, LHCb. CMS
still need to agree on the SLA document. NL-T1: J.Templon reported that the
NL-T1 SLA has been sent to the Experiments for review and approval. NDGF: O.Smirnova reported that NDGF has
sent their SLA proposal to ALICE and is waiting for a reply.
Not done yet. I.Bird proposed that
someone is appointed in order to complete this action.
Done. Discussed on the 3 February.
M.Schulz reported that
there were two problems: -
Ticket 5(open)-8(closed) December. The SAM tests of SRMV2 were
timing out under heavy load. This was not reported. The time-out for these
tests has been changed. The SAM team suggests that ATLAS uses their own UI
not the one used by SAM for the OPS tests. -
The large amount of “unknown availability” was due to lock files
not cleaned up. They are there to avoid overlapping of tests .There is a
LEMON alarms to inform of these lock files to clean up but nobody receiving
the alarm was acting (A.Di Girolamo or J.Novak). Should now be fixed.
|
3. LCG Operations Weekly Report (Slides, Weekly Report) – J.Shiers
|
||||||||||||||||||||||||||
Summary of status and progress of the LCG Operations since last MB
meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings 3.1 GGUS SummaryNothing special and no alarms this week. CMS
and ALICE were reminded that they should use GGUS routinely for request of
support.
An FAQ page is in preparation explaining
differences between “TEAM” and normal tickets that now go directly to sites. 3.2 ASGC UpdateJ.Shih connected to last Thursday’s 3D meeting and confirmed that they were on target to put “3D” related services back in production on the 1st week February. The mid-February (pre-CASTOR F2F@RAL) deadline for a clean CASTOR+SRM+DB installation may well be too aggressive and needs to be followed. What can we learn from this experience? - Standard configurations help: It is easier to diagnose problems, easier for sites to help each other and easier to avoid problems in the first place by following a well-trodden route; - Manpower + knowledge levels must be adequate. Levels to support key WLCG services need to be clarified and eventually reviewed – what are affordable / reasonable support costs for e.g. “curating a PB of data”? The
ASGC situation will be followed up in the next weeks. 3.3 LCG-cp IssuesThis was an item that spiked in the middle of the week (reported as fixed on Thursday) but highlights one or more weaknesses in the testing and roll-out procedure. Without wishing to be too specific to this particular problem, some better coordination of testing and roll-out appears to be necessary. Expecting some comments from groups / experiments concerned / affected. Could be coordinated through the daily OPS meeting. Ph.Charpentier
noted that the problem was not fixed but actually the installation was
rolled-back. A.Pace
noted that the SRM issues are known but the choice is between fixing the
current status or make a new version from scratch. The current tests do not
test all the branches. The changes in the SRM server error codes have caused
major problems to the CASTOR SRM client that was correcting the problems and
sometimes circumvented errors codes from the servers. J.Shiers
noted that the preparation and delivery of a new version will take several
months, therefore it is inapplicable. Ph.Charpentier
noted that the test cases should at least cover the typical use of the
Experiments. This would reduce the late discovery of most problems. D.Barberis
noted that this period is adequate for changes, later when data-taking
approaches there will be no ways to deploy new SRM software. F.Donno
suggested that every bug fix should also leave a test that will verify that
such bug does not reappear. Ph.Charpentier
asked for information when new versions are in the Applications Area software
repository. Experiments did not know
there was a new version available. 3.4 Oracle-related IssuesFZK: LFC and FTS at FZK were down over weekend due to problem with the Oracle backend. Oracle support was involved and the “SIR” has not been provided. NL-T1: Later in the week the migration of the LFC backend at NL-T1 to a RAC environment came with a huge performance loss. Is believed to be due to the “mainly write” load from ATLAS plus the “new world record” in #files / day reported by ATLAS, compared to R/O access from LHCb LFC. J.Templon
reported that now the LFC has been moved back to the old hardware and the
migration will be redone with the lessons learned. Big IDs: More big ID problems: This week at RAL again, The ASGC DB is now back in production and will be resynchronised. Seems that all these problems are at the CASTOR Sites. L.Dell’Agnello
reported that the Big IDs problem has NOT appeared at CNAF until now. 3.5 Other Services IssuesThe usual level of operational issues continues but with improved (or at least improving) reporting by e-mail or at the meeting. It is hard to get the balance exactly right the level of detail provided now is about right. It is probably cheaper to provide a report than not. And it is a very convenient way of communicating with the community (bi-directionally). The fix for the FTS delegation bug has been back-ported to FTS 2.1 and it is in the hands of the certification team under patch #2760. It has been suggested that a dedicated WLCG section of the weekly joint operations meeting is now redundant: this would help streamline meetings +and save some time. 3.6 File Loss at FZK NOT ReportedHowever, some significant problems do not get reported to the meeting and this should be corrected asap. For example, from CMS Facilities Operations on Friday last week:
This issue was NEVER reported to the Operations meetings. Both Sites and VOs should have reported it. A.Heiss
stated that the CMS representative in the GridKA Technical Advisory Board
asked not to report that issue. M.Kasemann
added that CMS knew immediately about the issue and in FZK they wanted t
understand the issue before reporting it. But this major issue should have
been reported by Site and VO. A.Heiss
asked what a Site should do if a VO at the Technical Advisory Board asks not
to report. I.Bird
agreed not to report until the issue is clear. But once it is a real problem
it should be reported asap. J.Gordon
also asked that the technical overview boards at the Sites should be
officially asked not to hide problems from the WLCG. Ph.Charpentier
noted that the Sites must report to the Operations meeting and then from
there to the MB if needed. |
||||||||||||||||||||||||||
4. Requirements to report WLCG Installed Capacity information (Slides) – R.Pordes
|
||||||||||||||||||||||||||
R.Pordes presented the method to automatically report installed capacity at the WLCG Sites. This is the summary of the work done with F.Donno and J.Gordon. The requirements document was prepared out of the combined requirements/implementation document after the implementation. Here is the latest version: There are 2 sets of requirements which differ in the rate at which they must be provided: - Monthly capacities/usage for management reports - Dynamic/timely information for resource availability and allocation. R.Pordes focused on the sentences in bold as requirements (slide 3) from the Experiments and the MB.
For Disk Storage the values must be collected in the same way for all infrastructures. There are several possibilities, for example: - Monthly average - Value at the end of the month - Highest value during the month It is crucial that the infrastructures agree on the same values. Otherwise the reports are inconsistent.
J.Templon
noted that currently the value is the one at the time of filling the monthly
report. J.Gordon
noted that the agreement was exactly to report the value at the end of the
month. Not in the middle of the following month, when the report is filled or
fixed by the Site. R.Pordes
asked whether it would not be better to have an average over the month
instead. If a Site is down for the last day of the month the Site reports no
resources while had them working for all other days of that month. J.Templon
proposed that the highest value could be taken. The reason is that it gives
an indication of how much was necessary at the peak of usage by a given VO. I.Bird
reminded that this is the value that is reported to the management, and not
the dynamic value for the Experiment to choose where to submit its jobs. R.Pordes
noted that actually both the average and the highest value provide useful
information and should be reported. The MB agreed that the topic should be discussed at in the
working group and a proposal be brought to the MB. The second set of requirements is for dynamic usage.
F.Donno
explained that the points above are there to make sure that the schema is not
changed in a non-compatible way. R.Pordes
agreed but noted that this is more a constraint than a requirement. M.Kasemann
asked how disk in front of the MSS is accounted on the VOs. J.Gordon
reminded that the pledged from the VOs are including the disk caches of the
MSS for instance. M.Kasemann
concluded that then CMS must include this amount of disk in their revised
requirements. |
||||||||||||||||||||||||||
5. Status of OSG Collecting Installed Capacity (Slides) – R.Pordes
|
||||||||||||||||||||||||||
R.Pordes also summarized the status of the work going on in OSG in
order to collect the installed capacity and provide the information to the
APEL accounting collector. US ATLAS and US CMS appreciate the additional 6 weeks for comment and
review of the document and sign off on the WLCG Installed Capacity Document
V1.8. 5.1 PlansOSG will report capacity as agreed to in the current WLCG MOU resource commitments, not beyond. They will not report additional capacity provided beyond their pledges. The monthly reporting information will be transmitted from the OSG from OSG-Operations specifically to the report collectors (APEL, etc) - The mechanisms will be similar to those existing for the OSG reporting of accounting and availability information. - The reporting will NOT be through the WLCG BDIIs used by the resource brokers and workload management systems. I.Bird
noted that the total resources of ATLAS and CMS will not be realistic if
there are hidden resources in OSG. R.Pordes
replied that the APEL accounting the resources will be all reported as
dynamic information but not in the monthly reports. M.Kasemann
noted that the accounting values will still be correct, like for sites that
go beyond the pledges because of opportunistic usage of other resources. J.Templon
noted that if WLCG is “World-wide” is not very good that the OSG do not report
everything. I.Bird replied that he is ready to ask to the US
representative at the RRB about what resources they report. M.Schulz
asked why the information is not provided via the standard BDII. R.Pordes
replied that US ATLAS do not uses it. But they may implement a dummy BDII in
order to report. OSG plan to participate in joint activities for -
planning
the timeline and steps needed for deployment, analysis and validation of the
information at the WLCG and -
Ongoing
validation of the information. Is F.Donno the contact for both of these? They expect there to be a
several month (3-4) window where the information is included in A.Aimar
reports, there is focused and detailed validation, and the information is not
reported at a higher level than the MB. We plan to implement
a means to fix data at the OSG management layer rather than going back to the
site administrators. With April 2009 as
the time of the “APEL/CESGA” reports then July 2009 would be the timeframe
for the reports to be published officially. I.Bird
replied that the reports have been shown to the LHCC to present the kind of
reports that will be provided. F.Donno
is the contact person but someone else will have to be nominated in the coming
months. I.Bird
asked for a milestones plan to get to the OSG accounting information finally
into APEL. New Action: R.Pordes agreed to provide, within 2 weeks, the milestones
for OSG reporting installed capacity into APEL. |
||||||||||||||||||||||||||
6. Installed Resource Capacity: Update and Operational Plan (Slides, Document) – F.Donno
|
||||||||||||||||||||||||||
F.Donno presented the comments received on the previous version of the document. Both about computing and storage resources. She also presented an operational plan for the next few months. The new version of the document is v1.8 publicly available: - It includes a description for pure xrootd installations - Integration of the new benchmark HEP-SPEC - A few more comments received from OSG 6.1 Computing Capacity: Changes (slide 3)Sites that decide to
publish CPU power using the new benchmark HEP-SPEC: -
MUST use
the GlueHostBenchmarkSI00 attribute to publish CPU power. In this case, the
power MUST be expressed in SpecInt2000 using the scaling factor that can
be found in the proposal of the HEPiX group. -
MUST
also publish the attribute GlueHostProcessorOtherDescription:
Benchmark=<value>-HEP-SPEC, where in <value> they report the CPU power expressed in the
new unit. The OSG reporting does not use the GlueCECapabilites but the GLUESiteSponsor attribute. Therefore the formulas had to be changed (slide 3) as below: 6.2 Xrootd InstallationsThe xrootd installations will also be included now (slide4). Changes involve the description of pure xroot installations: The other resources, just using just xrootd as additional access method, are already accounted. - GlueSEImplementationName/Version: It includes xrootd and the version of the protocol. - GlueControlProtocol/AccessProtocol: Pure xrootd installations will publish both control and access protocol (xroot is both a storage control and file access protocol) to distinguish from “xroot door”-only installations such as dCache, DPM, and CASTOR. - It is advisable that pure xroot installations will publish only one SA. Installed Capacity = (∑WLCG GlueSAGlueSACapability(InstalledCapacity) 6.3 Status of the Document on Installed CapacityThe new version of the document (v1.8-7) has been published with all details, operational instructions for site administrators and explicative examples (OSG missing). Technical agreement on the content of the document: Approved from both EGEE and OSG (in the previous section). 6.4 Implementation Plan for EGEEBelow is the operational plan agreed with WN working group, Gstat, EGEE operations, developers, gLite certification team. This plan assumes that the document is approved today. Plan description. - Presentation of the requirements and of the document at the next EGEE operations meeting. - Presentation at the next GDB (February 11th 2009). - Creation of a support group in GGUS to support sites. To be in place by the 18th of February 2009 - Nothing to be done for DM or WMS clients. Nothing to be done for SAM tests. - Next version of YAIM supporting configuration to be released by end of February 2009. -
New
Gstat sanity checks will go to production at the end of February 2009. -
New
GridMap will go to production at the end of February 2009. -
size the
sites by #LogicalCPUs (Cores) or Installed Capacity (SI2K) as defined in the
document -
will
have a button to show the OSG sites contributing to WLCG (OSG sites are shown
if they are listed in SAM and also in BDII) -
will
allow to interactively explore the PhysicalCPU and LogicalCPU numbers of
sites (helps with inspection if the values are set correctly) -
If we
get the online data feed for WLCG topology information in time, we'll also
add this as the source for the "tiers" button. This will then show
WLCG sites only. In principle APEL will produce the first reports for Tier-2s in April 2009. These are just test reports to verify that the system as conceived can work. Official reports will be produced later on in the year. The described operational plan has been approved by the EGEE side of WLCG. Started engaging with the OSG. Progress is being made. A list of the current concerns and ideas is being created. A concrete plan is being worked on. OSG’s top concerns regarding deployment: - The ability to "override" any final numbers collected in APEL which might be incorrect. - USATLAS has requested that they not send any data via the BDII which could result in jobs matching their sites in a WMS. - USCMS appears to be interested in having separate data paths between the "normal" BDII which might be used by meta-schedulers and any used in accounting data. I.Bird
asked why GridMap is involved in the plan. F.Donno
replied that GridMap can be used as a validation tool to visually check the
amount of resources reported. I.Bird
asked whether storage accounting and reporting for the installed CPU capacity
is ready in APEL. J.Gordon
replied that the Installed CPU capacity per VO is not complex with 4 values
per Site to publish. This can be produced directly from APEL. F.Hernandez
asked whether the two units CPU are going to be published. F.Donno
replied that in order to check the change of unit the best is to have both
units. One will be able to monitor who is using the new and old benchmarks.
The unit in the report is mandatory otherwise one does not know who is using
which unit. J.Templon
added IN2P3 can migrate to the new unit. The new field should have the
“value-HEPSPEC” specified. In the old field the unit is still the old
SPECint2K. Therefore they are still comparable. I.Bird
added that Sites should also re-benchmark the typical hardware that they have
already. In this way a Site can migrate to the new unit for all resources,
not just for the newly acquired hardware. |
||||||||||||||||||||||||||
7. AOB
|
||||||||||||||||||||||||||
I.Bird reported that the Experiment’s spokespersons (or their representatives) have been invited to the F2F Meeting, on the 10 February, to discuss the LHC Schedule that will be defined at the workshop in Chamonix on Friday. |
||||||||||||||||||||||||||
8. Summary of New Actions |
New Action:
R.Pordes agreed to provide, within 2 weeks, the milestones
for OSG reporting installed capacity into APEL.