LCG Management Board

Date/Time

Tuesday 3 February 2009 - 16:00-17:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=49389

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 05.2.2009)

Participants

A.Aimar (notes), D.Barberis, L.Betev, I.Bird (chair), D.Britton, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, S.Foffano, Qin Gang, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, P.Mato, G.Merino, S.Newhouse, A.Pace, R.Pordes, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Invited

F.Donno

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 10 February 2009 16:00-17:00 – F2F Meeting

1.   Minutes and Matters Arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments. The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions) 
 

 

  • SCAS Testing and Certification

27.1.2009: J.Templon reported that, in order to speed up the process, SA3 asked NIKHEF to do a test installation and deployment at NIKHEF before sending it to certification at CERN.

  • VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).
    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

CERN: Done for ALICE, ATLAS, LHCb. CMS still need to agree on the SLA document.

NL-T1: J.Templon reported that the NL-T1 SLA has been sent to the Experiments for review and approval.

NDGF: O.Smirnova reported that NDGF has sent their SLA proposal to ALICE and is waiting for a reply.

  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

Not done yet. I.Bird proposed that someone is appointed in order to complete this action.

  • 09 Dec 2008 - Comments to the proposal for collecting Installed Capacity (F.Donno's document) should be sent to the MB mailing list.

Done. Discussed on the 3 February.

  • 3 Feb 2009 - M.Schulz will report about the issues on the SAM servers (and the issues of the ATLAS lock files).

M.Schulz reported that there were two problems:

-       Ticket 5(open)-8(closed) December. The SAM tests of SRMV2 were timing out under heavy load. This was not reported. The time-out for these tests has been changed. The SAM team suggests that ATLAS uses their own UI not the one used by SAM for the OPS tests.

-       The large amount of “unknown availability” was due to lock files not cleaned up. They are there to avoid overlapping of tests .There is a LEMON alarms to inform of these lock files to clean up but nobody receiving the alarm was acting (A.Di Girolamo or J.Novak). Should now be fixed.

  • Actions for moving to the new CPU unit
    • Convert the current requirements to the new unit.
    • Sites Tier-1 Sites and main Tier-2 Sites buy the license for the benchmark
    • A web site at CERN should be set up to store the values from WLCG.
    • A group to prepare the plan of the migration regarding the CPU power published by sites through the Information (J.Gordon replied that will be discussed at the MB next week).
    • Pledges and Requirements need to be updated.

 

3.   LCG Operations Weekly Report (Slides, Weekly Report) – J.Shiers

 

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      GGUS Summary

Nothing special and no alarms this week. CMS and ALICE were reminded that they should use GGUS routinely for request of support.

 

VO affected

USER

TEAM

ALARM

TOTAL

ALICE

1

0

0

1

ATLAS

22

9

0

31

CMS

3

0

0

3

LHCb

0

4

0

4

 

An FAQ page is in preparation explaining differences between “TEAM” and normal tickets that now go directly to sites.

3.2      ASGC Update

J.Shih connected to last Thursday’s 3D meeting and confirmed that they were on target to put “3D” related services back in production on the 1st week February. The mid-February (pre-CASTOR F2F@RAL) deadline for a clean CASTOR+SRM+DB installation may well be too aggressive and needs to be followed.

 

What can we learn from this experience?

-       Standard configurations help: It is easier to diagnose problems, easier for sites to help each other and easier to avoid problems in the first place by following a well-trodden route;

-       Manpower + knowledge levels must be adequate. Levels to support key WLCG services need to be clarified and eventually reviewed – what are affordable / reasonable support costs for e.g. “curating a PB of data”?

 

The ASGC situation will be followed up in the next weeks.

3.3      LCG-cp Issues

This was an item that spiked in the middle of the week (reported as fixed on Thursday) but highlights one or more weaknesses in the testing and roll-out procedure. Without wishing to be too specific to this particular problem, some better coordination of testing and roll-out appears to be necessary. Expecting some comments from groups / experiments concerned / affected. Could be coordinated through the daily OPS meeting.

 

Ph.Charpentier noted that the problem was not fixed but actually the installation was rolled-back.

A.Pace noted that the SRM issues are known but the choice is between fixing the current status or make a new version from scratch. The current tests do not test all the branches. The changes in the SRM server error codes have caused major problems to the CASTOR SRM client that was correcting the problems and sometimes circumvented errors codes from the servers.

J.Shiers noted that the preparation and delivery of a new version will take several months, therefore it is inapplicable.

Ph.Charpentier noted that the test cases should at least cover the typical use of the Experiments. This would reduce the late discovery of most problems.

 

D.Barberis noted that this period is adequate for changes, later when data-taking approaches there will be no ways to deploy new SRM software.

F.Donno suggested that every bug fix should also leave a test that will verify that such bug does not reappear.

 

Ph.Charpentier asked for information when new versions are in the Applications Area software repository.  Experiments did not know there was a new version available.

 

3.4      Oracle-related Issues

FZK: LFC and FTS at FZK were down over weekend due to problem with the Oracle backend. Oracle support was involved and the “SIR” has not been provided.

 

NL-T1: Later in the week the migration of the LFC backend at NL-T1 to a RAC environment came with a huge performance loss. Is believed to be due to the “mainly write” load from ATLAS plus the “new world record” in #files / day reported by ATLAS, compared to R/O access from LHCb LFC.

 

J.Templon reported that now the LFC has been moved back to the old hardware and the migration will be redone with the lessons learned.

 

Big IDs: More big ID problems: This week at RAL again, The ASGC DB is now back in production and will be resynchronised. Seems that all these problems are at the CASTOR Sites.

 

L.Dell’Agnello reported that the Big IDs problem has NOT appeared at CNAF until now.

3.5      Other Services Issues

The usual level of operational issues continues but with improved (or at least improving) reporting by e-mail or at the meeting. It is hard to get the balance exactly right the level of detail provided now is about right. It is probably cheaper to provide a report than not. And it is a very convenient way of communicating with the community (bi-directionally).

 

The fix for the FTS delegation bug has been back-ported to FTS 2.1 and it is in the hands of the certification team under patch #2760.

It has been suggested that a dedicated WLCG section of the weekly joint operations meeting is now redundant: this would help streamline meetings +and save some time.

3.6      File Loss at FZK NOT Reported

However, some significant problems do not get reported to the meeting and this should be corrected asap. For example, from CMS Facilities Operations on Friday last week:

 

We (CMS Facilities Operations) were informed yesterday evening by the FZK dCache/Tape Admins, that there is a severe file loss at FZK. They discovered a bug in a script, responsible for the migration of files between the dCache write-pools and the tape library. In some specific constellation, if the TSM/TSS process of the tape library was killed and restarted for some reason, a wrong error code was passed from the migration script to dCache and the files were marked as being on tape, although they never made it there.

 

The admins provided us now a list of about 500 lost files which could not be recovered on e.g. a read pool on our site. Our admins ensured us that the issue is understood and the bug is fixed, but nevertheless, this is a real mess.

 

 

This issue was NEVER reported to the Operations meetings. Both Sites and VOs should have reported it.

 

A.Heiss stated that the CMS representative in the GridKA Technical Advisory Board asked not to report that issue.

M.Kasemann added that CMS knew immediately about the issue and in FZK they wanted t understand the issue before reporting it. But this major issue should have been reported by Site and VO.

 

A.Heiss asked what a Site should do if a VO at the Technical Advisory Board asks not to report.

I.Bird agreed not to report until the issue is clear. But once it is a real problem it should be reported asap.

 

J.Gordon also asked that the technical overview boards at the Sites should be officially asked not to hide problems from the WLCG.

Ph.Charpentier noted that the Sites must report to the Operations meeting and then from there to the MB if needed.

 

 

4.   Requirements to report WLCG Installed Capacity information (Slides) – R.Pordes

 

R.Pordes presented the method to automatically report installed capacity at the WLCG Sites.

This is the summary of the work done with F.Donno and J.Gordon.

 

The requirements document was prepared out of the combined requirements/implementation document after the implementation.

 

Here is the latest version:

https://twiki.cern.ch/twiki/pub/LCG/WLCGCommonComputingReadinessChallenges/installedcapactiyreportingrequirements-v3-1-2.pdf

 

There are 2 sets of requirements which differ in the rate at which they must be provided:

-       Monthly capacities/usage for management reports

-       Dynamic/timely information for resource availability and allocation.

 

R.Pordes focused on the sentences in bold as requirements (slide 3) from the Experiments and the MB.

 

The WLCG MOUs Current Pledge and Resources table defines the pledges of resources from Tier-1 and Tier-2 sites to meet the needs of the LHC experiments.  Both this table and the needs of the experiments are updated annually. The management and oversight bodies are charged with assessing whether the agreed upon set of resources are indeed available, whether the needs of the experiment are and can be met, and do capacity planning and negotiation for the future. The WLCG has been reporting the CPU usage of available resources on a monthly basis for Tier1s for several years and for Tier2s for the last year. However, there has been no automated reporting of the usage of archival and disk storage. Additionally, there is no ongoing reporting of the resources actually available at any site.  The funding agencies and oversight boards require this information in order to do their assessments.

The management of ALICE, ATLAS, CMS and LHCB have asked for the information to be available on an ongoing basis in order to assess the installed base of resources and plan for the future.  The manual reporting done by Tier1s will not scale to all Tier2s so it is an MB requirement that the data collection is automatic.

The WLCG baseline services define the requirements for publishing of dynamic information to enable the resource brokers, job and schedulers to ensure maximal efficiency in the use of the installed base of resources. It is therefore imperative that accurate and complete information be published about the capacity and current availability of the resources available.

Published disk resources should include disk cache in front of tape and other s requested by the experiments, not just disk for permanent files. Other resources used internally by a site for optimization reasons, testing or operation do not have to be reported.

 

 

For Disk Storage the values must be collected in the same way for all infrastructures. There are several possibilities, for example:

-       Monthly average

-       Value at the end of the month

-       Highest value during the month

 

It is crucial that the infrastructures agree on the same values. Otherwise the reports are inconsistent.

 

There are two classes of requirements: the first covering the needs for monthly publishing of installed capacity and resource usage; the second providing timely information to be used by the experiments for ongoing resource monitoring and scheduling.  The requirements are:

1. Provide the WLCG management with a monthly view of the "total installed capacity" at a site. This includes resources that are not currently in use but are available to be deployed at short notice. Measurement errors associated with jobs running over month boundaries and other uncertainties will limit the accuracy. This should be accurate to within 10%.

2. Publishing information to provide the WLCG and VO management with a monthly view of the resource assignment per experiment at sites. This includes only resources that have been configured and explicitly assigned to a VO. An exception is made for shared resources. In this case it is accepted to provide only aggregated information for the common usage.

 

Notes

Since the measurements are only required to be monthly it cannot be assumed that any values published are up to date. An agreement needs to be reached on the period during which data are valid. E.g. data could be held always to refer to the previous calendar month.

 

A decision needs to be taken on how disk storage is measured for each month. Possibilities include: the last measurement taken each month, the highest value, the average, a time integral.    

 

 

 

J.Templon noted that currently the value is the one at the time of filling the monthly report.

J.Gordon noted that the agreement was exactly to report the value at the end of the month. Not in the middle of the following month, when the report is filled or fixed by the Site.

 

R.Pordes asked whether it would not be better to have an average over the month instead. If a Site is down for the last day of the month the Site reports no resources while had them working for all other days of that month.

 

J.Templon proposed that the highest value could be taken. The reason is that it gives an indication of how much was necessary at the peak of usage by a given VO.

 

I.Bird reminded that this is the value that is reported to the management, and not the dynamic value for the Experiment to choose where to submit its jobs.

 

R.Pordes noted that actually both the average and the highest value provide useful information and should be reported.

 

The MB agreed that the topic should be discussed at in the working group and a proposal be brought to the MB.

 

The second set of requirements is for dynamic usage.

The second set are:

       Publishing information to allow experiment operators to monitor the experiment usage of the resources. This is specifically important for monitoring the usage of storage staging areas or to monitor the usage of disk only areas.

       Support for users to make maximal use of the installed resource base. Any changes made to the information service schema or semantics should not impact existing uses of the information service.  Such changes should either be backwards compatible or be made after careful consultation with users and application developers.

It is not clear how the 2nd requirement detail is sufficient with the requirement to “Support for users to make maximal use of the installed resource base.

 

 

F.Donno explained that the points above are there to make sure that the schema is not changed in a non-compatible way.

R.Pordes agreed but noted that this is more a constraint than a requirement.

 

M.Kasemann asked how disk in front of the MSS is accounted on the VOs.

J.Gordon reminded that the pledged from the VOs are including the disk caches of the MSS for instance.

M.Kasemann concluded that then CMS must include this amount of disk in their revised requirements.

 

5.   Status of OSG Collecting Installed Capacity (Slides) – R.Pordes

 

R.Pordes also summarized the status of the work going on in OSG in order to collect the installed capacity and provide the information to the APEL accounting collector.

US ATLAS and US CMS appreciate the additional 6 weeks for comment and review of the document and sign off on the WLCG Installed Capacity Document V1.8.

5.1      Plans

OSG will report capacity as agreed to in the current WLCG MOU resource commitments, not beyond. They will not report additional capacity provided beyond their pledges.

 

The monthly reporting information will be transmitted from the OSG from OSG-Operations specifically to the report collectors (APEL, etc)

-       The mechanisms will be similar to those existing for the OSG reporting of accounting and availability information.

-       The reporting will NOT be through the WLCG BDIIs used by the resource brokers and workload management systems.

 

I.Bird noted that the total resources of ATLAS and CMS will not be realistic if there are hidden resources in OSG.

R.Pordes replied that the APEL accounting the resources will be all reported as dynamic information but not in the monthly reports.

 

M.Kasemann noted that the accounting values will still be correct, like for sites that go beyond the pledges because of opportunistic usage of other resources.

 

J.Templon noted that if WLCG is “World-wide” is not very good that the OSG do not report everything.

I.Bird replied that he is ready to ask to the US representative at the RRB about what resources they report.

 

M.Schulz asked why the information is not provided via the standard BDII.

R.Pordes replied that US ATLAS do not uses it. But they may implement a dummy BDII in order to report.

 

OSG plan to participate in joint activities for

-       planning the timeline and steps needed for deployment, analysis and validation of the information at the WLCG and

-       Ongoing validation of the information.

 

Is F.Donno the contact for both of these? They expect there to be a several month (3-4) window where the information is included in A.Aimar reports, there is focused and detailed validation, and the information is not reported at a higher level than the MB.

 

We plan to implement a means to fix data at the OSG management layer rather than going back to the site administrators.

With April 2009 as the time of the “APEL/CESGA” reports then July 2009 would be the timeframe for the reports to be published officially.

 

I.Bird replied that the reports have been shown to the LHCC to present the kind of reports that will be provided.

F.Donno is the contact person but someone else will have to be nominated in the coming months.

 

I.Bird asked for a milestones plan to get to the OSG accounting information finally into APEL.

 

New Action:

R.Pordes agreed to provide, within 2 weeks, the milestones for OSG reporting installed capacity into APEL.

 

6.   Installed Resource Capacity: Update and Operational Plan (Slides, Document) – F.Donno 

 

F.Donno presented the comments received on the previous version of the document. Both about computing and storage resources.

She also presented an operational plan for the next few months.

 

The new version of the document is v1.8 publicly available:

-       It includes a description for pure xrootd installations

-       Integration of the new benchmark HEP-SPEC

-       A few more comments received from OSG

6.1      Computing Capacity: Changes (slide 3)

Sites that decide to publish CPU power using the new benchmark HEP-SPEC:

-       MUST use the GlueHostBenchmarkSI00 attribute to publish CPU power. In this case, the power MUST be expressed in SpecInt2000 using the scaling factor that can be found in the proposal of the HEPiX group.

-       MUST also publish the attribute GlueHostProcessorOtherDescription: Benchmark=<value>-HEP-SPEC, where in <value> they report the CPU power expressed in the new unit.

 

The OSG reporting does not use the GlueCECapabilites but the GLUESiteSponsor attribute. Therefore the formulas had to be changed (slide 3) as below:

 

 

6.2      Xrootd Installations

The xrootd installations will also be included now (slide4). Changes involve the description of pure xroot installations: The other resources, just using just xrootd as additional access method, are already accounted.

-       GlueSEImplementationName/Version: It includes xrootd and the version of the protocol.

-       GlueControlProtocol/AccessProtocol: Pure xrootd installations will publish both control and access protocol (xroot is both a storage control and file access protocol) to distinguish from “xroot door”-only installations such as dCache, DPM, and CASTOR.

-       It is advisable that pure xroot installations will publish only one SA.

 

Installed Capacity = (WLCG GlueSAGlueSACapability(InstalledCapacity)

6.3      Status of the Document on Installed Capacity

The new version of the document (v1.8-7) has been published with all details, operational instructions for site administrators and explicative examples (OSG missing).

 

https://twiki.cern.ch/twiki/pub/LCG/WLCGCommonComputingReadinessChallenges/WLCG_GlueSchemaUsage-1.8.pdf

 

Technical agreement on the content of the document: Approved from both EGEE and OSG (in the previous section).

6.4      Implementation Plan for EGEE

Below is the operational plan agreed with WN working group, Gstat, EGEE operations, developers, gLite certification team.

This plan assumes that the document is approved today.

 

Plan description.

-       Presentation of the requirements and of the document at the next EGEE operations meeting.

-       Presentation at the next GDB (February 11th 2009).

 

-       Creation of a support group in GGUS to support sites. To be in place by the 18th of February 2009

 

-       Nothing to be done for DM or WMS clients. Nothing to be done for SAM tests.

-       Next version of YAIM supporting configuration to be released by end of February 2009.

 

-       New Gstat sanity checks will go to production at the end of February 2009.

-       New GridMap will go to production at the end of February 2009.

-       size the sites by #LogicalCPUs (Cores) or Installed Capacity (SI2K) as defined in the document

-       will have a button to show the OSG sites contributing to WLCG (OSG sites are shown if they are listed in SAM and also in BDII)

-       will allow to interactively explore the PhysicalCPU and LogicalCPU numbers of sites (helps with inspection if the values are set correctly)

-       If we get the online data feed for WLCG topology information in time, we'll also add this as the source for the "tiers" button. This will then show WLCG sites only.

 

In principle APEL will produce the first reports for Tier-2s in April 2009. These are just test reports to verify that the system as conceived can work.

Official reports will be produced later on in the year.

 

The described operational plan has been approved by the EGEE side of WLCG. Started engaging with the OSG.

Progress is being made. A list of the current concerns and ideas is being created.

 

A concrete plan is being worked on. OSG’s top concerns regarding deployment:

-       The ability to "override" any final numbers collected in APEL which might be incorrect.

-       USATLAS has requested that they not send any data via the BDII which could result in jobs matching their sites in a WMS. 

-       USCMS appears to be interested in having separate data paths between the "normal" BDII which might be used by meta-schedulers and any used in accounting data.

 

I.Bird asked why GridMap is involved in the plan.

F.Donno replied that GridMap can be used as a validation tool to visually check the amount of resources reported.

 

I.Bird asked whether storage accounting and reporting for the installed CPU capacity is ready in APEL.

J.Gordon replied that the Installed CPU capacity per VO is not complex with 4 values per Site to publish. This can be produced directly from APEL.

 

F.Hernandez asked whether the two units CPU are going to be published.

F.Donno replied that in order to check the change of unit the best is to have both units. One will be able to monitor who is using the new and old benchmarks. The unit in the report is mandatory otherwise one does not know who is using which unit.

 

J.Templon added IN2P3 can migrate to the new unit. The new field should have the “value-HEPSPEC” specified. In the old field the unit is still the old SPECint2K. Therefore they are still comparable.

 

I.Bird added that Sites should also re-benchmark the typical hardware that they have already. In this way a Site can migrate to the new unit for all resources, not just for the newly acquired hardware.

 

7.   AOB

 

 

I.Bird reported that the Experiment’s spokespersons (or their representatives) have been invited to the F2F Meeting, on the 10 February, to discuss the LHC Schedule that will be defined at the workshop in Chamonix on Friday.

 

8.    Summary of New Actions

 

 

New Action:

R.Pordes agreed to provide, within 2 weeks, the milestones for OSG reporting installed capacity into APEL.