LCG Management Board

Date/Time

Tuesday 28 April 2009 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=55736

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 2.5.2009)

Participants

A.Aimar (notes), D.Barberis, O.Barring, I.Bird(chair), Ph.Charpentier, L.Dell’Agnello, D.Duellmann, M.Ernst, I.Fisk, S.Foffano, Qin Gang, J.Gordon, M.Kasemann, M.Lamanna, P.Mato, P.McBride, G.Merino, S.Newhouse, A.Pace, H.Renshall, Y.Schutz, J.Shiers, O.Smirnova, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 5 May 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes.

The minutes of the previous MB meeting were then approved.

1.2      Tier-2 SAM Reports (CPU count, etc) (Tier2_Reliab_200903.pdf)

A.Aimar distributed that the Tier-2 SAM reports. Some Sites have zero as CPU count since the information is taken from the new Installed Capacity mechanism adopted last month. We have to verify the situation with the new reports for April (next week).

 

J.Gordon objected that the request to update the Installed Capacity is being fulfilled by the Sites; but is GridView that is not reporting the whole information correctly and some Site have zero as CPU count. He is following the issue with S.Traylen.

1.3      VO Tier-1 SAM Reports: Comments Received (Comments.pdf; Summary-Tier1-Avail.pdf; Tier1_Reliab_200903.zip)

The VOs were requested to comment their SAM reports for March. See attachment.

 

The only major “unknown” period was for LHCb in the middle of March and as explained by R.Santinelli:

 

“There is a long gray period from 11th to 16th of March commonly to all T1's. This is because SAM suite was not submitting because the [CERN] WMS were sick. (Remedy ticket open, patch applied).”

 

I.Bird asked that at the next meeting F2F the VOs comment their April’s VO SAM reports.

 

2.   Action List Review (List of actions) 
 

  • VOBoxes SLAs:
    • CMS: Several SLAs still to approve (ASGC, IN2P3, CERN and PIC).
    • ALICE: Still to approve the SLA with NDGF. Comments exchanged with NL-T1.

No progress.

  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

The Experiments agreed to present their dataflow and rates at May’s GDB.

  • M.Schulz will summarize the situation of the User Analysis WG in an email to the WLCG MB.

Not done.

  • I.Bird will look for the chairperson and also distribute a proposal for the mandate of the WLCG Technical Forum and reminding the GDB’s mandate.

Later in this meeting

  • 5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

Next week.

On User Accounting:

  • Tier-1 Site should start publishing the UserDN information.
  • Countries should comment on the policy document on user information accounting
  • The CESGA production Portal should be verified. J.Gordon will check and send again the information to the MB on which portal to use (prod or pre-prod).
  • Actions for moving to the new CPU unit
    • Convert the current requirements to the new unit.

Done by S.Foffano.

    • Sites Tier-1 Sites and main Tier-2 Sites buy the license for the benchmark

To be done.

    • A web site at CERN should be set up to store the values from WLCG Sites.

To be done.

    • A group to prepare the plan of the migration regarding the CPU power published by sites through the Information
    • Pledges and Requirements need to be updated.

 

 

3.   LCG Operations Weekly Report (Slides) – D.Duellmann

Summary of status and progress of the LCG Operations. It covers the activities since the last MB meeting.

The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      GGUS Summary

There were no alarms this week and one can see below that the level of ticket, after a low number of tickets over Easter, is back to normal.

 

 

For Sites availability one can see below some issues for LHCb at CNAF and NL-T1.

 

L.Dell’Agnello added that he will investigate the issue and Ph.Charpentier noted that there is a GGUS ticket open.

 

J.Templon commented that there was a critical network switch not on a UPS and a power outage caused the interruption of the network switch.

 

 

Other issues this month were:

-       ASGC cleanup for ATLAS
LFC was purged, isolating ATLAS and adding redundancy. The Tier 2 cleanup is ongoing with the help from DPM support.

 

D.Duellmann noted that in general Sites should not hesitate to contact service support for help.

 

-       CC-IN2P3:
Service Incident Report received (see below)

 

-       CERN: Significant performance boost in CASTOR after memory upgrade (8->16GB/node) for repack stager database.

 

Note: All sites should regularly review DB clusters for memory utilisation / swapping during increased load periods.

 

Common DB Monitoring: A Common DB monitoring was proposed at the DB workshop in PIC. Sites can install a small agent that will send monitoring information to a central DB where data can be compared and analysed and KPI can be extracted.

The CASTOR DB should have agreed to install these agents.

3.2      IN2P3 MSS Outage 20th April

Below is the IN2P3 SIR report.

 

Duration: 12 hours

Date: April 20th 2009

Description: Hardware failure of the robotic library inducing a global outage of the MSS.

Impact: MSS

 

Batch was unavailable for any job depending on MSS.

Local backup service interrupted during the outage.

Estimated 18% shortfall of running (non-grid) jobs during outage (jobs locked in queue).

 

Timeline of the Incident

Monday April 20th

11:12 incident report opened against our robotic supplier

12:15 robotic supplier on site – hardware and software checks start

18:00 final tests and diagnosis

19:30 hardware change

22:30 robotic library in operational state

23:10 MSS system accessible in degraded mode

Tuesday April 21rst

11:30 Full pledge usage of MSS

 

Full report at https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents

 

 

The investigation is still ongoing at IN2P3 with their hardware providers.

3.3      “Big ID” problem with Oracle in CASTOR

There is now a better understanding of the pre-conditions under which the problem occurs. It affects Oracle server side data buffer when using “returning into” from OCCI using bulk operations.

 

We cannot exclude yet a more general Oracle consistency issue. A test case has been developed – and the problem is reproducible now in minutes.

 

More Big ID cases were observed at CERN and T1s. So far they cause failed CASTOR requests, but no data loss.

 

Further investigations are ongoing with Oracle development and several workaround scenarios are being discussed.

Finding the optimal (risk & effort) needs to be detailed from the Oracle side their knowledge about the corruption process.

 

Using CERN direct contact into Oracle development to speed up information exchange but may need to raise priority for this Oracle fix.

3.4      Other Service Issues

LHCb : LFC use via CORAL:

-       Issues with DB credential look-up from LFC: discussion between LHCb, LFC and CORAL team on possible optimisations. Is LHCb accessing the correct LFC instance? Read-write versus read-only LFC? Use the Site local LFC instead of the Tier 0 instance?

ALICE: pre-certification WMS 3.2 deployed at two sites (GRIF).

-       Found issues with job wrapper observed and fixed.

 

CERN/CASTOR single user caused an unintended “DoS attack”.

-       Better CASTOR user monitoring tools in 2.1.8 will address this area

3.5      DB Workshop at PIC

Slide 9 shows the summary of the DB Workshop at PIC (Agenda) all Tier-1 Sites participated.

Probably it would be good to have a report at next GDB meeting.

 

The main outcome is that the Experiments agreed that the resources allocated are there for STEP09 and for the 2009/10 activities.

 

Main outcomes

 

• Allocated DB resources for STEP'09 and for the 2009/2010 run

• Positive test and deployment reports for DB set-ups on RHEL 5 – CERN upgrade by summer 2009

• Tracking and announcing procedures reviewed and confirmed

• Good progress with AMI replication preps between IN2P3 and CERN

• Procedures and responsibilities for re-instantiation of DB between Tier 1sites confirmed

• Final NDGF database setup being setup in Oslo - DataGuard based migration from Helsinki

• Pilot queries being proposed by ATLAS to throttle DB workload

• Failover LFC database based on DataGuard between CNAF and ROMA1

• No plans to upgrade to Oracle 11 before the LHC run - but 11.2 beta evaluations ongoing

• Several castor DB operational issues and procedures discussed

• Define a standard castor DB config between T1 sites (test setup a9t T0)

 

 

 

4.   C-RRB Meeting and Next Steps (Slides) – I.Bird

 

I.Bird summarized the C-RRB meeting that took place in the morning:

4.1      C-RRB Meeting

The updated requests were presented as discussed. As well as a C-RSG preliminary report that concluded that 2008/9 pledges should be sufficient for 2009/10.

 

But this was strongly contested by the Experiments, and all agreed that more work was needed to understand really the models and assumptions.

 

The DG proposed that Experiments and RSG work together to resolve differences and make a recommendation by the summer. This is needed in any case for September to make decisions for 2010 resources.

Next steps:

-       WLCG should schedule a meeting with all concerned parties: Experiments, C-RSG, and LHCC referees.

-       WLCG also needs to discuss how to present estimated requirements for 2011/12 (by October) because the funding agencies are asking for these now. Even if the LHC schedule is not defined yet.

 

D.Barberis and M.Kasemann noted that the LHC schedule is needed in order to make better calculations Experiments need to know whether there is going to be a long shutdown or not. A lot of changes can totally change the estimates of resource requirements from the Experiments.

 

I.Bird noted that this is one more reason more to have the LHCC Referees really involved in this requirements process.

All Experiments replied that all the LHCC Referees had very little presence and communication with the Experiments until now.

 

J.Templon added that is quite normal that the RSG finds strange that the requirements are almost unchanged even if there is this delay.

 

Ph.Charpentier noted that the estimated proposed by the RSG should be clarified with them. They reach very different values and need to understand how the Experiments calculate different values. The RSG uses a too simplest model for the calculations that worked last year but not in this case.

 

M.Kasemann noted that many changes are due to a different usage of disk (higher than planned) and of tape systems (almost nothing). If there are major differences the funding agencies will have to decide at the RRB whether to accept the Experiments estimates or those of the RSG.  

 

Proposal:

I.Bird agreed with the comments and proposed that:

-       The Experiments work independently and directly with the Referees and the RSG group to address each issue, assumptions and this major discrepancy that results.

-       The LHCC Referees should keep the LHCC informed on the progress of the discussions with the Experiments.

-       A report is needed during the summer and deadline should be defined: mid-June is the proposed date.

 

I.Bird reported that the other issues mentioned at the RRB meeting were the following:

-       EGEE – EGI transition is critical – need to ensure that the WLCG service is not disrupted. WLCG must discuss the NGI plans in the upcoming GDBs

-       Monitoring of resources and tools must be automated. “These will be essential in understanding usage and efficiency and in particular bottlenecks and experiment model problems”

 

D.Barberis noted that the mandate of the RSG includes also verifying that the Sites actually provide the amount of resources pledged.

This part of the mandate has not yet been implemented by the RSG and they should be reminded about it. The RSG is only focusing on checking the requirements from the Experiments. But not the pledges from the Sites and the efficiency of the resources provided.

 

5.   Follow-up WLCG Technical Forum (Slides; More Information) – I.Bird

 

As agreed previously I.Bird presented the mandate of the GDB and the proposed mandate for the new Technical Forum.

 

Text from the Slides is here below:

 

GDB

 

The GDB has several roles, including:

-       to make agreements between the resource centres and the experiments (VOs) - in practice this will require background agreements with other organisations that run various parts of the infrastructure - such as grid operations, certificate authorities, organisations providing VO management, network providers, etc.;

-       to make agreements between resource centres;

-       to make agreements between resource centres and CERN;

-       to review schedules, service performance, resource utilisation, etc.;

-       to exchange information.

 

From TDR:

-       The Grid Deployment Board (GDB) is the forum within the Project where the computing managements of the experiments and the regional computing centres discuss and take, or prepare, the decisions necessary for planning, deploying, and operating the LHC Computing Grid. Its membership includes: as voting members — one person from each country with a regional computing centre providing resources to an LHC experiment ..., a representative of each of the experiments; ... The GDB reports to the LCG Management Board.

 

Technical Forum

 

-       Forum in which to discuss longer term needs and developments of middleware and other services

-       Must prepare for the sustainability and evolution of the existing middleware in the light of changing technology and experience –

-       Can we think again of common solutions in some areas where it is clear that existing solutions are weak?

-       It needs to represent all the stakeholders – experiments, sites, grid projects, etc.; but should not be too large.
Depending on the topic – bring in the appropriate experts

 

-       It does not take decisions – should produce clear documents for discussion in the GDB, and potential agreement in the MB

 

-       I think we need a forum within which to discuss how we drive the next evolution of the middleware and other key services – and provide a coherent WLCG view

-       Should it be a one-off (write document and stop) – like hepcal/baseline etc?

-       Should it be a “standing” group – like the AF?

 

 

J.Templon asked where the planning of software deployment is. Should be in the GDB mandate.

I.Bird replied that the first GDB bullet with “agreements between resource centres and Experiments” covers this role. But many other bullets also cover the same aspects.

 

M.Kasemann stressed the fact that also developers in the Experiments should be included in the discussions of this Technical Forum. GDB can be involved for explaining the proposals and then have it approved by the MB,

 

Ph.Charpentier noted that also short-term problems should be discussed and solved. It should not be a one-off committee but a permanent regular forum addressing and solving also short-term issues. Escalating everything is not always necessary. For instance escalating the move to SRM2 in the LCG tools was not an item for the MB.

 

J.Templon suggested looking at the GAG mandate and membership.

 

I.Bird summarized the discussion as:

-       Should be a standing group

-       Not necessarily always meet in person

-       A core group, plus invited experts

-       A good chairman must be found

 

6.   MSS Sites Metrics (Slides; WLCG-T1-MSS Metrics.pdf) - Sites' Round-table

 

Sites have replied to the request of which MSS metrics can be collected automatically.

Their replies are collected in this document WLCG-T1-MSS Metrics.pdf.

 

Some sites have URLs, others want to update the wiki manually, and others have plugins that could be created.

The goal is to be able to fetch these values and show them on a single web page during STEP09 and after that.

 

L.Dell’Agnello reported that CNAF has internal monitoring but updates manually the wiki page at CERN.

O.Barring noted that CNAF already has the necessary Nagios plugins for the CASTOR MSS tapes metrics.

 

M.Lamanna noted that the work done for instance for ATLAS could be reused.

I.Bird agreed but noted that one needs the global view of the Sites not the VO-specific ones.

 

7.   AOB

 

7.1      User Accounting ((Sites Publishing UserDN)

J.Gordon attached the result of the query of which Sites have been publishing UserDN this year.

He will look for a more regular page.

 

The information on the Portal is obviously visible only if one has the authorization or has run some jobs.

 

8.    Summary of New Actions