LCG Management Board

Date/Time

Tuesday 11 November 2008 16:00-18:00 – F2F Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=39179

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 16.11.2008)

Participants

A.Aimar (notes), D.Barberis, O.Barring, I.Bird(chair), D.Britton, F.Carminati, Ph.Charpentier, L.Dell’Agnello, M.Ernst, X.Espinal, I.Fisk, D.Foster, J.Gordon, F.Hernandez, M.Kasemann, M.Lamanna, H.Marten, P.Mato, A.Pace, B.Panzer, Di Qing,, M.Schulz, Y.Schutz, J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 18 November 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

A clarification, from F.Hernandez, has been added in section 3 of the minutes of the previous meeting. The proposed change was done right away to the minutes.

 

The minutes of the previous MB meeting were then approved. 

1.2      Site Reliability Reports for OPS and VOs (Reliability Reports Oct 2008)

From this month the MB will start reviewing also the VO specific SAM test results.

 

I.Bird added that these VO results and the tests used to calculate them, need to be reviewed.  Just as it was done for the OPS tests in 2007.

 

Action:

VO and Sites should review the VO tests and results.

 

H.Marten asked that, in addition to the summaries in PDF format, the data is also provided in a format that can be extracted and exported to Excel for instance.

 

J.Gordon and I.Bird added that once the VO SAM tests and data are agreed in detail then an overview can be provided. For the moment the details need to be validated by each VO and Site.

 

Action:

A.Aimar should distribute to the MB list where/how the reliability data can be extracted/exported from GridView into other formats.

 

2.   Action List Review (List of actions) 
 

  • For the ATLAS Job Priorities deployment the following actions should be performed :

-       DONE. A document describing the shares wanted by ATLAS

-       DONE. Selected sites should deploy it and someone should follow it up.

-       ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end

 

M.Lamanna and D.Barberis reported ATLAS will prepare a request to WLCG there are 4 user groups that should be recognised by the Sites for production, analysis and local communities. The solution adopted is much simpler than planned originally.

 

This action can be considered done.  

  • SCAS Testing and Certification

 

To be reported in one of next meetings.

  • F.Donno will distribute a document to describe how the installed accounting is collected and should describe in details the proposed mechanism for sites to publish their inhomogeneous clusters using the current Glue1.3 schema Cluster/Subcluster structures.

Not done yet. Will be discussed at the GDB on the following day.

·         P.Mato should report to the MB the progress on the SL5 testing by the Experiments.

Done later in this meeting.

 

 

3.   LCG Operations Weekly Report (Slides) – J.Shiers

Summary of status and progress of the LCG Operations. It actually covers last two weeks.

The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Overview

The overall goal is to move rapidly to a situation where weekly reporting is largely automatic so that one can instead focus on and follow up the exceptions. There is now a table of Service Incident Reports (see slide 2) but one needs to follow-up on each case – the follow-up cannot be automated.

 

It is proposed adding a table of alarm / team tickets & timeline and a summary of scheduled / unscheduled interventions, including cross-check of advance warning with the WLCG / EGEE targets. E.g. you cannot “schedule” a 5h downtime 5’ beforehand.

 

3.2      SIRs Received

Four site incidents reports were received this week. The shortest was 10h the longest 2 week.

 

Site

Date

Duration

Service

Impact

PIC

31 Oct

10 hours

SRM

Down

NL-T1

21 Oct

12 hours

Most

Down

ASGC

25 Oct

?Days?

CASTOR

Down

SARA

28 Oct

?7 hours?

SE/SRM/tape b/e

Down

 

There is no other SIR that is waited for this week for which a report is requested.

 

Other issues of the week:

-       VOMS service at CERN: had just 5 minute loss of service, 2 hours degraded (see weekly minutes)

-       L O N G report Monday from ATLAS including dCache issues (pnfs performance & # entries / directory)

-       Frequent ORA-14403: cursor invalidation detected after getting DML partition lock (RAL)

-       Additional double disk failure affecting another DS at RAL

-       Another “big ids being inserted into id2type”

-       We had 12 occurrences of ORA-00001 errors in the stager, all generated by PtG requests on the same thread (TID=24).  They all came to get one of 2 files. (RAL)

-       Outstanding problem with LHCb T0D1 DS since 1 month

 

Slides 5 to 12 show the problems found at PIC, NL-T1 and ASGC.

 

The discussion focused on the fact that there must methods to contacts by phone other sites in case of power problems. Each sites should have the emergency contacts of the sites available.

 

I.Bird reminded that every Tier-1 must have these emergency procedures in place and should include whom to contact and how to do so. Every site should have an emergency plan for power failures.

3.3      ASGC Issues

ASGC had a major Oracle problem (!!) and the solution was found when there was a conf call with Castor and Oracle experts at CERN. ASGC need a real 1 full FTE to manage the Oracle databases and should be a real DBA Oracle Certified Professional certificate.

 

ASGC should also participate to the Distributed Databases meetings and to the Castor conf-call meetings (ASGC never participates to those meeting). The conf calls with Asia Pacific should probably restart.

 

When issues should be escalated from a Site to become a general issue? Should never be after 2 weeks anymore: one day at most.

 

The MB agreed on the following proposal:

-       In the case of a major service degradation or outage for more than one working day, the (Tier0, Tier1) site must provide at least some (useful) information on the problem and state of follow-up by the daily WLCG operations call one working day later – and make every effort to attend that meeting

 

There will be a Workshop this week and many other issues will be discussed. It is clear that there is still much room for improvement and the continued goal must be to get the services as reliable as possible as soon as possible – in particular, maximizing the remaining period of EGEE III funding.

 

Target: using metrics discussed and agreed at the MB (or where appropriate), weekly operations report should “normally” (i.e. at least 3 times per month) have no specific problems or anomalies to be discussed.

3.4      Targets for Services

Targets (not commitments) proposed for Tier0 services

-       Similar targets requested for Tier1s/Tier2s

-       Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable)

-       The MoU lists targets for responding to problems (12 hours for T1s)

 

-       Targets for Tier1s: 95% of problems resolved <1 working day?

-       Targets for Tier2s: 90% of problems resolved < 1 working day?

-       A Post-mortem should be triggered when targets not met!

 

Time Interval

Issue (Tier0 Services)

Target

End 2008

Consistent use of all WLCG Service Standards

100%

30’

Operator response to alarm / call to x5011 / alarm e-mail

99%

1 hour

Operator response to alarm / call to x5011 / alarm e-mail

100%

4 hours

Expert intervention in response to above

95%

8 hours

Problem resolved

90%

24 hours

Problem resolved

99%

 

J.Templon noted that the last year of EGEE funding should be spent to strengthen the existing services. Which ones in particular?

J.Shiers suggested that (1) data management and (2) software deployment should be two useful topics for the sites to work on. If they want to participate, trainingcan be organized. New development should be postponed until there is a stable set of services at the Sites.

 

J.Gordon noted that some services have received urgent request from Experiments even if then there will not be time to deploy these  (e.g. FTS2.2).

 

 

4.   Milestones Review / New Milestones (HLM_20081107;   Slides) – I.Bird

 

Before discussing new milestones the MB reviewed the existing incomplete VOBoxes milestones in the current dashboard (HLM_20081107).

4.1      VOBoxes Milestones

The milestones on VOBoxes are still to be completed at several Sites:

-       CERN is waiting that the SLAs are approved by the Experiments.

-       NL-T1 the document is still in discussion and should be done.

-       NDGF not present at the meeting. 

-       IN2P3 waiting for feedback from CMS. Received from LHCb.

 

New Action:

VOBoxes SLAs:

-       Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).

-       NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

4.2      Middleware Milestones

Slide 2 shows the Middleware milestones that were presented at the MB and that should be defined in term of future dates. O.Keeble should report on them.

 

 

  • SL5 WN available (actually define SL5/compiler combination)
    • Certified and available for deployment
    • Tested by experiments
  • FTS on SL4 – deployed at Tier 1s
    • On SL5 – available for deployment
  • SCAS
    • Certification and avail for deployment
    • Verification by ATLAS and LHCb (others?)
    • Deployed at Tier 1s; deployed at Tier 2s
  • Pilots – framework reviews finished?
  • CREAM
    • Verify can be used to replace LCG-CE

¨  deploy at a few Tier 1s; verify use by experiments/sites (manageability)

    • Availability of WMS/ICE
    • Availability of Condor_g client for CREAM
  • Installation mechanism – after GDB discussion
  • SRM – short term solutions (end of year):
    • Availability/deployment (for each system)

 

4.3      Accounting Milestones

Also the milestones on accounting should be set.

 

 

  • Accounting reports
    • T2 reports (add wall time, etc.)
  • Update of info providers for
    • Cpu installed capacity
    • Storage capacity
  • Reporting – updates to APEL/portal
  • ? Status of user-level accounting?
  • Benchmarks (by the WG)
    • Set up wiki with benchmark method, conversion proposal, measured benchmarks
    • Convert experiment requirements
    • Convert pledges

 

 

J.Gordon reported that User Level Accounting is available but visible from the development CESGA portal only. It will be publicly available in the next few weeks. Many sites are not yet publishing correct data. And other sites want to know the policies to protect the privacy for this user data.

4.4      Reliability Metrics

 

  • Validate experiment SAM tests
    • Sign-off on set of experiment-defined critical tests to be used to measure availability
  • Targets for VO-specific reliabilities?
    • Start high? 98%?? For Tier 1s
  • Reporting of Tier 2 federation (weighted) reliabilities
    • Target 95% (now)
  • Update OPS tests:
    • For SRM v2
    • Better combination of services (need a proposal)
  • Nagios (or equiv) installed at Tier 1 and Tier 2 sites,
    • so that sites can receive alarms and problem notification

 

 

J.Gordon noted that the data from the Tier-2 Sites should be verified because not all of them use the same policy. For instance the Finnish Sites report wall-clock time and therefore 100% efficiency.

4.5      Metrics

 

  • Tape metrics – should start reporting – what?
    • Need to agree (for Tier 1) experiment write and read rates from tape
    • Sites to show that these rates can be achieved
  • Job reliability
    • Look at dashboards for specific metrics to follow?
  • Site dashboards
    • We need these in place
  • Monitoring for Castor, DPM, dCache, etc.
  • Other metrics to monitor performance of Tier 1s and Tier 2s (and Tier 0!)
    • User support response?
    • Service downtimes?

 

 

J.Gordon asked that Sites receive from the Experiments clear information on the rates to reach.

I.Bird added that each site should propose the metrics for their MSS systems.

 

Next Actions:

The next steps will be to propose some new milestones and target dates.

Proposals for metrics are also needed from the Tier-1 and data flows from the Experiments.

 

5.   ARDA Final Status Report (Slides) – M.Lamanna

 

M.Lamanna presented a final report for the ARDA project. That spanned during the EGEE 1 and EGEE2 four years.

5.1      Beginning of the Project

A Roadmap for Distributed Analysis (ARDA) was started after a workshop. One of the most attended workshops (Miron, Predrag, Torre, etc...). Was held at the same time of the OGSI >WSDL announcement (and the premature death of GT3).

 

It then became A Realisation of Distributed Analysis (ARDA) with EGEE effort (4 persons) + 4 matching funds from WLCG. And the initial mantra was “Production is understood, Analysis not yet...”

5.2      First Phase

The first phase had contact with the experiments via separate contacts (one per experiments)

With a specific agreement on the activity, various levels of integration and sometimes we were a bit sidetracked

Exchanges via the ARDA team were very useful.

 

Slides 5 and 6 shows the roles of ARDA in providing the intermediate layer between middleware and experiments frameworks.

5.3      Second Phase

Closer collaboration with each experiment. Also on “critical path” activities and some prototypes have been stopped For example the analysis system ASAP because the official tool CRAB got more and more momentum, but our contributions was reused. In this specific case, in CRAB itself and especially in the monitoring (dashboard).

 

Other activities could be expanded and attract more experiments (Dashboard) in this second phase.

5.4      Summary of ARDA activities

ARDA participated and contributed to several projects. Below is a short description of each contribution; see slides (from 9 onwards) for more details and pictures.

 

AMGA - ARDA Metadata Grid Access

-       Metadata catalogue: obvious starting point for ARDA

-       Studied existing systems in the experiments

-       Initially contributed an I/F (it was outside of the scope of JRA1) and a working prototype (endorsed by GAG)

-       Basis for a few interesting contributions to the fields (master and PhD students)

-       Eventually is now part of the gLite distribution since 2006

-       Collaborative effort coordinated by the original developer but all effort coming from outside (Catania, Korea, Clermont-Ferrand, etc)

-       Now coordinating the release process, adding new features etc...

 

Adopted by LHCb for their Logging and bookkeeping catalogue (used until now; migration taking place now)

 

Great success in EGEE. Some examples:

-       Earth sciences:

-       Climatology (Climatology centre in Hamburg DKRZ – also under in DGrid)

-       UNOSAT (Access of satellite images) (see slide 12)

-       Biomedical sciences:

-       Wisdom (Insilico drug searches)

-       Healthechild (see slide 11)

-       Digital imaging

-       Non LHC: partners in EGEE3 are using their resources on this subject

 

Dashboard

It was a major success of the project. Initially it was born as CMS Dashboard. Then was generalized reusing components of the CMS analysis prototype and with the fundamental contribution of MonALISA

Its scope has progressively grown being adopted by more experiments, with new VOs (non HEP) interested and with other activities that are added to the dashboards (data transfer, site status, middleware errors, etc)

 

Slide 14 shows the CMS dashboard showing the number of CFRAN analysis jobs. And slide 15 the progress of commissioned sites.

 

Slide 17 shows the data transfer to Tier-1 sites.

 

 

5.1      API Service (ALICE)

Analysis as an interactive service Interface with gLite. Proposed very early for ALICE Inspiring for other development in ARDA.

5.2      GANGA

Was a common project (ATLAS – LHCb) started collaborating on LHCb side only and after ~18 months, the ARDA/ATLAS contribution joined Ganga

Due to the EGEE links, considerable interest is present also outside HEP.

 

Excellent adoption it is the entry point to the ATLAS (PanDA) and LHCb (Dirac) system

In the case of ATLAS, complemented by pAthena

Excellent user feedback, the several tutorials paid off.

In addition, we discover of communities discovering and adopting Ganga without our direct involvement (e.g. Minos – discovered by googling). See slide 27 to see the Ganga Communities.

 

Below is the number of Ganga users per month (in 2007).

 

 

5.3      Legacy and Outlook

ARDA smoothly ended with EGEE2. CERN plays an important role in EGEE3, but with a slightly smaller effort building also on ARDA experience.

 

The persons from the ARDA team are in general still contributing into the LHC experiments and WLCG.

As for the tools:

-       Ganga: centre of the analysis in ATLAS and LHCb

-       Dashboard: more and more used in the experiments and in the infrastructure

 

Experience was not lost and “exARDA” people are still providing excellent work in the various area of WLCG.

 

M.Lamanna advocated the approach of close collaboration with the experiments is very positive. – Experiments benefit from “back office” collaboration

-       Which leads to true commonality

-       Did we fulfil our initial mandate (“production is understood, analysis not yet”)?

-       Not yet: analysis is a moving target

-       Analysis (in particular high performance data access) is still evolving

 

An approach à la ARDA might be useful.

 

D.Barberis expressed his appreciation and thanks for the work done by M.Lamanna and by the whole ARDA project.

 

M.Kasemann added his appreciation for the work done and noted that convergence was reached where it was possible. ARDA had a crucial role in promoting all possible common projects across Experiments. He then asked what will happen now to the separate projects.

 

M.Lamanna replied that the ex-ARDA projects are now ready to continue independently with clear leadership.

 

I.Bird concluded thanking M.Lamanna for the excellent work in managing the ARDA project as well as the work of the whole ARDA team.

 

6.   SL5 testing on the Experiments and App. Area Software (Slides) – P.Mato

 

6.1      Initial Hypothesis

Experiment and AA software has stronger dependencies with the compiler than the OS. The OS weak dependency supported by previous migrations (SLC3->SLC4)

 

Usually compiler upgrades offers sizable benefits (performance, better standard compliance) but have always been a difficult task. Many code modifications/adaptations are often needed. And mixing libraries from different compiler versions is always problematic.

 

The Architects Forum (AF) agreed on the following strategy with 3 phases to maximize the probability to have validated Experiment software ready when sizeable resources will be installed with SLC5.

 

J.Templon asked how much this is CERN specific vs. the standard RHEL and SL versions.

P.Mato replied that there are no CERN-specific dependencies.

6.2      Initial Strategy

  1. Verify that the existing SLC4 (gcc-3.4) binaries run normally on SLC5

-       Should allow running “old releases” of experiment software unchanged on SLC5 nodes 

-       Make sure that all needed backward compatibility modules in the current SLC5 distribution

  1. Produce a native build for SLC5 with gcc-3.4 and verify that it works properly

-       We do not expect any major work here since the code runs correctly with this version of the compiler.

  1. Native builds for SLC5 and gcc-4.3 (skipping the native version 4.1 of gcc)

-       gcc-4.3 is a better compiler (more strict, better performance) and probably will keep it for a while

-       LCG-AA nightly build need to be made fully operational for SLC5

-       We expect most of the problems here: code adaptations, full data validations, etc. 

-       Requires middle-ware client libraries to be made available for gcc-4.3

6.3      Experiments Testing

ATLAS

Several as yet unresolved problems remain:

-       The ROOT 5.18.00-based releases are incompatible with SELinux apart from the new 5.18.00f version (against which none of the ATLAS releases is built).

-       When above is bypassed, reconstruction jobs fail with an OpenAFS problem

-       It appears to be a massive memory leak associated with running 32-bit applications on a 64-bit kernel. Investigations are still ongoing.

 

I.Bird asked whether these memory leaks problems are specific to ATLAS or general.

This is unclear for the moment.

 

ATLAS has yet started phase 2 and 3

 

CMS

They are confident that they can manage well Phase 1

-       Back ported the single patch (for the SELinux and ROOT/Cintex problem) into ROOT5.18/00a + patches, which is what they are currently using for both the 22X and 30X integration builds

-       Initial tests do not show further problems

-       Better testing when 3_0_0_pre2 comes out

 

CMS want to skip phase 2 altogether and go directly to phase 3

-       Started last week some test with gcc-4.3

-       They need to make more progress on it before they know how difficult it will be

 

For CMS Online

-       Need to migrate to a 64-bit SLC5 kernel and this implies re-writing a number of CMS-specific drivers

 

LHCb

-       Not working for phase 1 They will profit from the ATLAS and CMS progress

-       As soon the LCG-AA nightlies for SLC5 are operational will start integrating and testing their software (phase 2 and 3)

ALICE

-       All OK for SLC5 in 32 and 64-bit mode and for both compilers gcc3.4 and gcc-4.3

-       Fixing all compiler warning spit out by gcc-4.3

 

Ph.Charpentier asked when the file access libraries dCache, GFAL, etc are going to be ported to gcc 4.3. Multi-platform support is required on the WN machines. To have more than on gcc supported on each machine.

 

M.Schulz replied that SL5 with gcc 4.3 is another platform to support (and create in ETICS). This platform should be also asked to the EGEE TMB as a platform of priority compared to others. The WN software can be ported either within ETICS or manually for now.

 

M.Schulz explained that EGEE middleware must supports the standard compiler of the platform and for SL5, gcc 4.1 is the system native compiler. This policy has never been challenged at the EGEE TMB by the HEP VOs.

There are no resources to support many combinations of OS and gcc compilers. Therefore the issue should be raised at the TMB. The native compiler is also used for all the binaries needed for the SL5. If another compiler is used all gLite externals may need to be compiled.

 

M.Schulz also added that the middleware team has not resources for porting to many other platforms or add new features.

P.Mato noted that the WN client libraries are sufficient for the Experiments’ software.

 

New action:

The DM and dCache teams should report about the client tools should present estimated time lines and issues in providing the porting to gcc 4.3.

 

7.   Discussion on Disk Procurement – I.Bird

 

The proposal of splitting the procurement of disks was raised a few times during the last few weeks. The MB should agree on a policy and time line valid for all sites.

 

The proposal is to have 50% by 1 April and the other 50% by 1 September or end of September. This means the installation must be done during the summer.

 

 

J.Gordon noted that end September seems late.

 

I.Bird asked for the opinion of the Experiments:

 

-       D.Barberis expressed the worry that the installation slip too late after September.

-       Ph.Charpentier noted that the installations should be finished by the time the reprocessing starts during the winter shutdown.

-       Y.Schutz commented that also ALICE needs the full capacity by September.

-       M.Kasemann asked whether 50/50 is the good split and really needs that end of September cannot slip into the end of the year.

 

I.Bird finally proposed, and the MB accepted:

-       50% by 1 April

-       50% by 1 September

 

After having been announced to the next OB and RRB committees – and agreed - this change will be introduced in the Annexe of the MoU.

 

8.   AOB

 

 

LHCC Referees Meeting

At the LHCC Referees meeting (next Tuesday) the Experiments are invited to present their progress.

 

User Analysis Working Group

The MB asked about progress on the User Analysis working group.

M.Schulz replied that the membership has been defined but the group has not met yet.

 

9.    Summary of New Actions

 

 

Action:

VO and Sites should review the VO tests and results.

 

New Action:

VOBoxes SLAs:

-       Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).

-       NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

 

 

Action:

A.Aimar should distribute to the MB list where/how the reliability data can be extracted/exported into other formats.

 

Next Actions:

The next steps will be to propose some new milestones and target dates.

Proposals for metrics are also needed from the Tier-1 and data flows from the Experiments.

 

New action:

The DM and dCache teams should report about the client tools should present estimated time lines and issues in providing the porting to gcc 4.3.