LCG Management Board

Date/Time:

Tuesday 18 September 2007 16:00-17:00 - Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=17201

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 22.9.2007)

Participants:

A.Aimar (notes), D.Barberis, O.Barring, K.Bhatt, I.Bird (chair), Ph.Charpentier, S.Digamber, T.Doyle, M.Ernst, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, J.Knobloch, M.Lamanna, E.Laure, P.Mato, G.Merino, R.Pordes, Di Quing, H.Renshall, Y.Schutz, Z.Sekera, J.Shiers, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 25 September 2007 16:00-17:00 - Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous meeting were approved.

1.2      24 Sept: LCG-LHCC Referees Meeting (Agenda) -            J.Shiers

J.Shiers presented the proposed agenda for the Referees Meeting on the 24th September 2007 and asked for speakers from the Experiments.

 

Update: Below is the updated agenda with all speakers.

 

12:00 

Status of SRM v2.2 Production Service Deployment (20') Jamie Shiers (CERN)

  • Status of the Deployment Schedule
  • WLCG Management Follow-up

 

12:20 

Status of Dress Rehearsal Preparations (40') Harry Renshall (CERN)

  • Service, Site and Experiment Readiness for the above

 

13:00 

Plans for End-User Analysis (40')

  • ATLAS M4 Experience - Outlook (Kors Bos)
  • CMS CSA07 Analysis Plans (Ian Fisk)

1.3      SRM Update – J.Shiers

More detailed information is also available from the WLCG SRM Production Deployment meeting (Agenda, Minutes).

 

The main conclusions of the SRM Production Deployment meeting were:

-       Definition of the production roll-out (CERN and FZK will be first, in early November).

-       Workshop in Edinburgh of the Tier-1 sites and some major Tier-2 sites.

-       The dCache developers agreed to (1) fix the “return code” issues by mid-October (2) produce at the workshop the documentation needed to setup and configure the new features.

-       The goal now is to have CERN and other Tier-1 sites by end of 2007 and all Tier-1 sites by February 2008.

1.4      Sites Resources Planning

H.Renshall asked that the Tier-1 sites should send the acquisition plans for the next 6 months for CPU, disk, tapes.

 

New Action:

21 October 2007 - Sites should send to H.Renshall their resources acquisition plans for CPU, disks and tapes until April 2008

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 10-July 2007 - INFN will send the MB a summary of their findings about HEP applications benchmarking.

Not done. Could be for the F2F meeting in October.

  • 28 August 2007 - SAM tests in July 2007: ALICE on all tests will investigate and report on the VO-specific results.

Done. ALICE: The issues seem fixed. The ALICE-specific SAM results in September are considerably better.

 

F.Hernandez added that was difficult to find why the tests were failing and who were the contacts about the VO-specific SAM tests in the experiments. He asked that the sites are informed about the contact persons and that the VO-specific tests are adequately documented.

New Action:
25 Sept 2007 - A.Aimar finds out the information about the VO-specific tests and adds it to the SAM wiki page (contacts, test documentation, etc).

  • 18 Sept 2007 – Sites and Experiments will send to J.Shiers the name of their representative in the CCRC Coordination team.

Missing: FNAL. ASGC, CMS, LHCb.

Update: At the meeting the following names were proposed: FNAL: I.Fisk, CMS  M.Kasemann, LHCb: N.Brook

ASGC will send the representative name to J.Shiers.

  • 18 Sept 2007 - Next week D.Liko will report a short update about the start of the tests in the JP working group.

No news.

E.Laure added that the Job Priorities status and plans were going also to be discussed at the TCG on the following day.

3.    Simplification of Future Quarterly Reports (Document) - A.Aimar

 

A.Aimar had distributed the proposal for changes of the QR for the Overview Board and asked for approval of the MB in order to proceed with the proposal for the Overview Board.

 

Decision:

The MB agrees on proposing to the Overview Board some changes of the Quarterly Status and Progress Reports (see Document).

 

 

4.    Sites Reliability Reports (Aug_07; Sept_07; Sites Reports to Complete)

 

For information to the MB A.Aimar distributed the Site Reliability data, including VO-Specific SAM results for August and September (until the 17th).

 

He also asked the sites to complete the availability reports for August 2007 and mail then back to him by Friday 21st September (Sites Reports to Complete).

 

 

5.    Agenda of the LHCC Comprehensive Review (Agenda) - I.Bird

 

 

I.Bird asked for feedback from the MB about the draft agenda distributed by L.Robertson (see email and draft agenda).

5.1      Day 1: Monday

NOTE – All times should be seen as MANDATORY 25% discussion, MAXIMUM 75% presentation

 

Monday 19 November 2007

 

09:00 Overview - (40')  Les Robertson

09:40 Resources and Accounting (20')  Sue Foffano

 

10:00 Coffee

10:30->13:00    Stream A - part 1  - Applications

 

 

10:30  Application Area (2h30')   Pere Mato

-          Overview (40')

-          Simulation & generators (30')

-          Core Libraries and Services (30')

-          Persistency Framework Projects (30')

-          Software process (20’)

 

10:30->13:00    Stream B - part 1 - Mass Storage, Fabric,  Networking

 

10:30  Mass Storage Progress (1h30')

-          CASTOR (20')   Tony Cass

-          dCache (20')   Patrick Fuhrmann

-          DPM (20')  Jean-Philippe Baud

-          SRM v2.2 & Experiment Progress (30')
                                                        Flavia Donno

12:00   CERN Fabric -  Tier-0 + CAF Status -
                Performance - Reliability (30') Bernd Panzer

12:30   Networking including the LCG OPN (20')
                                                                David Foster

 

13:00  LUNCH

14:00->17:30    Stream A - part 2 – Grid, 3D

 

14:00  Middleware (1h00')

-          Status of the middleware to support baseline services

--  general

--  EGEE specific

-- OSG specific

-- NDGF specific

 

15:00 Grid Deployment (1h30')

-          Middleware Deployment (30')

 

15:30   Coffee

 

-          Operations - EGEE and OSG (1h00')

 

17:00  3D project (30’)

 

14:00->17:30    Stream B - part 2 - Tier-1s, Tier-2s

 

14:00  Tier-1 Status (1h30')

-          Summary of the status of the Tier-1s (30')

-          Reports from 3 Tier-1 sites (1h00')

 

 

 

 

 

 

 

15:30   Coffee

 

16:00  Tier-2 Status (1h30')

-          Summary of the status of the Tier-2s (30')

-          Reports from 3 Tier-2 sites (1h00')

 

17:30     Visit to the CERN Computer Centre

 

 

The agenda has some parallel stream about:

-       Applications Area, Mass Storage and Fabrics (Monday AM)

-       Middleware Status and Tier-1 and Tier-2 sites Status (Monday PM)

 

For the morning sessions P.Mato will organize the Applications Area speakers and the speakers for MSS and Fabrics are defined.

 

For the afternoon sessions C.Grandi will organize the Middleware streams. Speakers are missing for the Tier-1 and Tier-2 Sites Status.

 

I.Bird asked if there was any volunteer to:

-       Two speakers to summarize the status of the Tier-1 and Tier-2 sites

-       Two Tier-1 sites from different regions (maybe one EGEE, one OSG)

-       Two Tier-2 sites from different regions (one EGEE, one OSG)

 

R.Tafirout volunteered to provide the presentation of the TRIUMF Tier-1 site.

 

R.Pordes agreed to provide the summary of the OSG Tier-2 sites.

 

The UK could provide a summary of the UK Tier-2 sites.

 

New Action:

25 Sept 2007 - I.Bird will propose speakers for the Tier-1 and Tuer-2 presentations at the LHCC Comprehensive Review.

 

5.2      Day 2: Tuesday

Tuesday 20 November 2007

 

08:30->12:30    Service & Experiment Status

 

08:30  Service Overview (40') Jamie Shiers

 

09:20  Experiment Status Overview (40’)

 

10:00 Coffee

 

10:30 Experiment-specific sessions – demonstrations or presentations

-          ALICE (30min.)

-          ATLAS (30 min.)

-          CMS (30 min.)

-          LHCb (30 min.)

 

12:30  Management, planning and communications (20')

 

 

The second day includes:

-       LCG Services Overview, by J.Shiers,

-       Overview of the Experiments Status

-       Individual presentation of each LHC Experiment with demonstrations

 

The demonstrations/presentations should include how the experiments execute and monitor the main tasks on the grid.

 

Volunteers are needed for the Experiments overview and presentations.

 

M.Kasemann volunteered to provide the Overview for the LHC Experiments.

 

R.Pordes added that at the OSG review CMS demonstrated submissions, running and monitoring of (short) jobs.

The Experiments replied that the jobs would take long and the slots should be reserved to make sure they end in time.

 

New Action:

28 Sept 2007 The Experiments should propose the speakers and the content of the demonstrations for the LHCC Comprehensive Review.

 

D.Barberis asked whether, like for the Experiments, the future LHCC reviews will have a different format and whether the MB will propose some changes.

 

R.Pordes asked that possibly the review is scheduled in rooms where remote phone connection is available.

I.Bird replied that this will be taken into account but cannot be guaranteed at the moment.

 

 

6.    Computation of Site Availability Metrics in GridView (Paper; Slides) - S.Digamber

 

S.Digamber presented the new method of computation of the availability metrics in GridView.

The details are better described in a document available here.

 

Reported below are extracts of the presentation (see Slides) and of the discussion at the meeting.

 

Gridview computes Service Availability Metrics per VO using SAM test results

The Computed Metrics include

-       Service Status

-       Availability

-       Reliability

 

The above Metrics are computed:

-       per Service Instance

-       per Service (e.g. CE) for a site

-       per Site

-       Aggregate of all Tier-1/0 sites

-       Central Services

-       over various periodicities like Hourly, Daily, Weekly and Monthly

 

The status of a service instance, service or a site is the state at a given point in time (.e.g. UP, DOWN, SCHEDULED DOWN or UNKNOWN).

 

The Availability of a service instance, service or a site over a given period is defined as the fraction of time the same was UP during the given period.

The Reliability of a service instance, service or a site over a given period is defined as:

-       Fraction of time the same was UP (Availability), divided by the scheduled availability (1 – Scheduled Downtime – Unknown Interval) for that period

-       Reliability  = Availability / (1 - Scheduled Downtime – Unknown)

-       Reliability = Availability / (Availability + Unscheduled Downtime)

-        

The three definitions provide the same value for Reliability (see Paper). Reliability definition as approved in LCG MB Meeting, 13th Feb, 2007, Section 1.2).

 

Slide 4 shows some examples of computation of Availability and Reliability using the new algorithm.

 

Differences Between Old and New Computation

 

The old (current) algorithm 

-       Computes Service Status on Discrete Time Scale with precision of an hour

-       Test results sampled at the end of each hour

-       Service status is based only on latest results for an hour

-       Availability for an hour is always 0 or 1

 

The New Algorithm

-       Computes Service Status on Continuous Time Scale

-       Service Status is based on all test results

-       Computes Availability metrics with higher accuracy

-       Conforms to Recommendation 42 of EGEE-II Review Report about introducing measures of robustness and reliability

-       Computes reliability metrics as approved by LCG MB, 13 Feb 2007

 

J.Templon asked if in practice the SAM tests are still going to be run once per hour or if this implies some real change in the measurements.

S.Digamber replied that now SAM allows the selection of the test frequency interval and Gridview is able to take into account any frequency of the tests.

But changes to the current metrics will be discussed first.

 

The major differences (slide 6) between old (current) and new algorithm are:

-       Service Status computation on Continuous time scale

-       Consideration of Scheduled Downtime (SD)

Service may pass tests even when SD

Leads to Inaccurate Reliability value

New algorithm ignores test results and marks status as SD

-       Validity of Test Results

24 Hours for all tests in old case

Defined by VO separately for each test in New method

Invalidate earlier results after scheduled interruption

-       Handling of UNKNOWN status

 

Computation of Service Instance Status

Each VO marks critical tests for each service. A Service should be considered UP only if all critical tests are passed.

 

The Old (Current) Algorithm:

-       Marks status as UP even if test result of only one of the many critical tests is available and UP

-       Ignores the status of those critical tests whose results are not known

 

New Algorithm

-       marks status as UNKNOWN even if the status of all known critical tests is UP and one of the critical tests is UNKNOWN

-       When any of the known results is DOWN, status is marked as DOWN anyway

 

Slides 9 (see with animation) shows an example of the computations with the Old and New algorithm.

 

Discovery of New Service Instances

Old (Current) Algorithm

-       identifies service instances based on available test results

-       computes status of only those services for which at least one critical test is defined and result of at least one critical test is available

-       leads to unexpected behavior like a service instance suddenly disappearing from the Gridview monitoring page

 

New Algorithm

-       tracks service instances based on their registration in Information System, either in BDII or GOCDB

-       computes status of all service instances supporting the given VO irrespective of whether critical tests are defined or results available

 

J.Gordon asked whether a failure of a site's RB will cause the site to be considered failing.

S.Digamber replied that the RB is a central service and its failure does not affect the site's status..

 

Site Status Computation (slides 11 and 12):

-       Status of a service E.g. CE: computed by ORing statuses of all its instances

-       Status of the Site computed by ANDing statuses of all registered site services

-       Status of a site should be UP only if the status of all registered site services is UP

 

Old (Current) Algorithm

-       Marks status of site as UP even if status of only one of the many registered services is known and UP

-       Ignores those services for which the status is not known

 

New Algorithm

-       Marks status of site as UNKNOWN even if the status of all known services is UP and one of the services is UNKNOWN

-       When any of the known services is DOWN, status is marked as DOWN anyway

 

F.Hernandez noted that when the SE is down the CE at the site is failing because of a SAM test verifying the access to the SE from the CE. This causes the CE being marked as failing.

S.Digamber replied that is not clear at the level of Gridview but what is considered critical for the CE. This could be configured but should be discussed outside the meeting and the configuration of what is critical is in the SAM configuration.

 

Ph.Charpentier added that the fact the result as “OR” of the CEs is not valid because maybe one CE failing is causing a major problem to the site.

I.Bird noted that this is an issue regarding the tests not the computation of the sites and services. The SAM tests should address these issues at a lower level than Gridview which is only combining the results. This kind of issues is valid and should be raised and taken into account in the future improvements of SAM and GridView.

 

Availability and Reliability Computation

Hourly

-       Hourly Availability, Scheduled Downtime and Unknown Interval for a service instance, service and site are computed from the respective status information

-       Hourly Reliability is computed from the Availability, Scheduled Downtime and Unknown Interval for that hour

 

Daily, Weekly and Monthly

-       Daily, Weekly and Monthly Availability, Scheduled Downtime and Unknown Interval computed by averaging  Hourly figures

-       Daily, Weekly and Monthly Reliability figures are computed directly from the Availability, Scheduled Downtime and Unknown Interval figures over the corresponding periods

 

Conclusions

Old (Current) Algorithm

-       coarse grain representation of service status

-       arrives on a conclusion based on incomplete data

-       can be misleading, giving the impression that everything is ok when the fact is not known

-       may adversely affect actual availability of the service as it might be hiding some of the service breakdowns rather than generating alerts

 

New Algorithm

-       resolves ambiguities present in the old (current) algorithm

-       generates true status of the service, sincerely stating it as UNKNOWN whenever it doesn't have adequate data to establish the state of the service

-       Fine Grain Model, Computes service availability metrics with more accuracy

-       Implementation is ready, we are awaiting management nod for deployment

 

Decision:

The MB agreed that the old and new computation should be compared by the sites. Feedback should be send to the MB list.

Unless major issues arise the new system will be put in production at the end of September.

 

Further improvement will be collected and discussed at the GDB for inclusion in future versions of SAM and Gridview.

 

New Action.

25 September – MB Members send feedback on the new GridView computation algorithm.

 

 

Update: The information on the old and new (for testing) Gridview sites was distributed after the meeting.

 

GridView with the new algorithm is available here:
http://gvdev.cern.ch/GVPC/same_index.php

 

The current (old) algorithm is here as usual:
http://gridview.cern.ch/GRIDVIEW/same_index.php

 

 

1.    AOB

 

H.Renshall asked for an update of the acquisition plans from the sites. See Section 1.4 above.

 

2.    Summary of New Actions

 

 

25 Sept 2007 - A.Aimar finds out the information about the VO-specific tests and adds it to the SAM wiki page (contacts, test documentation, etc).

25 September – MB Members send feedback on the new GridView computation algorithm.

25 Sept 2007 - I.Bird will propose speakers for the Tier-1 and Tuer-2 presentations at the LHCC Comprehensive Review.

 

28 Sept 2007 The Experiments should propose the speakers and the content of the demonstrations for the LHCC Comprehensive Review.

 

21 October 2007 - Sites should send to H.Renshall their resources acquisition plans for CPU, disks and tapes until April 2008

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.