LCG Management Board

Date/Time:

Tuesday 18 July 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a063093

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 19.7.2006)

Participants:

A.Aimar (notes), D.Barberis, L.Bauerdick, S.Belforte, I.Bird, K.Bos, N.Brook, T.Cass, Ph.Charpentier, M.Delfino, I.Fisk, B.Gibbard, F.Hernandez, M.Lamanna, E.Laure, H.Marten, M.Mazzucato, B.Panzer, Di Quing, L.Robertson (chair), Y.Schutz , J.Shiers, O.Smirnova, R.Tafirout, J.Templon 

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 24 July from 16:00 to 1700

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

The minutes of the 4 and 11 July 2006 were approved.

 

J.Gordon (not present, he asked via email) requested to include in the minutes of the 11 July that all services are *covered* at SCM but some (dCache, WMS, etc) are not *represented* by the responsible people. J.Shiers confirmed that all services are covered in the SCM meeting. Minutes changed.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

  • 23 May 06 - Tier-1 sites should confirm via email to J.Shiers that they have set-up and tested their FTS channels configuration for transfers from all Tier-1 and to/from Tier-2 sites. * Is it not sufficient to set up the channel but the action requires confirmation via email that transfers from all Tier-1 and to/from the "known" Tier-2 has been tested.

 

To be done: BNL, FNAL, NDGF and TRIUMF.

 

Update 18 July 2006

-          BNL will check the status and send a message to J.Shiers.

-          FNAL: Meeting the same day about the FTS setup. Update will be sent to J.Shiers.

-          NDGF no news.

-          TRIUMF incomplete. Tested their Tier-1/Tier-2s associations, only tested link to IN2P3 among the Tier-1 sites.

 

  • 31 May 06 - K.Bos should start a discussion forum to share experience and tools for monitoring rates and capacities and to provide information as needed by the VOs. The goal is then to make possible a central repository to store effective tape throughput monitoring information.

 

Action to be removed.

The fabrics and tools aspects could be followed up in the context of Hepix. I.Bird will ask the Hepix organizers.

The service aspects are going to be discussed at the Service Challenge Technical Day, which is scheduled for the 15 September 2006. Agenda.

 

  • 13 Jun 06 - D.Liko to distribute the Job Priority WG report to the MB.

 

Update. The report and a note on the status of the WG were distributed to the MB after the meeting.

 

Here it is:

 

http://egee-intranet.web.cern.ch/egee-intranet/NA1/TCG/wgs/priority.htm

 

Look for this section, and click
----------
First implementation plan (July 2006):
    - Implementation plan
    - WG Status
----------

 

  • 30 Jun 06 - J.Gordon reports on the defined use cases and policies for user-level accounting in agreement with the security policy working group, independently on the tools and technology used to implement it.

 

Not done.

Presented to the GDB in June and asked for feedback. Did not receive any.

Reminded it to the July GDB. If J.Gordon does not receive any use cases he will propose some use cases himself.

 

  • 20 Jul 06 - Tier-1 sites should nominate one or two representatives to the TCG. Nomination should be sent to I.Bird before the 18 July 2006. The representative should represent the needs of the sites not his/her own site or personal views.

 

No nominations from any site. Sites should send their proposal to I.Bird.

 

3.       Scheduled Downtime and Availability Measurements (more information) – L.Robertson, F.Hernandez

 

3.1         Availability Measurements in June (here) – L.Robertson

 

Note:

-          Site availability is calculated taking into account all services that should be deployed.

-          Data was not properly recorded during18-19-20 June, those dates have been removed from the graphs and the calculation of the average values.

-          NDGF and BNL do not have any availability test working, and are not accounted in the calculation of the averages.

 

The main notable issues in June are:

-          The average availability is 70% (down from 77% in May)

-          Two sites reached the target (two also in May)

-          Two sites within 10% of the target (three in May)

 

There is a regression compared to the previous month and the MB discussed the causes:

-          FZK’s gap (from the 8th to the 28th) is explained by H.Marten in an email (NICE login required) and is now being corrected. But this can happen again if the setup in FZK changes again and the configuration of the tests is not modified accordingly.

-          INFN /CNAF was installing Castor 2 in June and this made all Castor availability tests.fail.
The SAM tests are all in a logical “and” therefore the all site is considered unavailable under the current criteria.

-          The cluster monitored by SAM at SARA-NIKHEF is the one at SARA, which is not the major cluster (which is at NIKHEF). SARA-NIKHEF should discuss with Piotr Nyczyk in order to provide the information needed to adequately monitor the services at SARA-NIKHEF.

-          BNL is not monitored because their firewall and proxy setup does not work with the network access (no proxy support) used by the SAM tests. Discussions with Piotr Nyczyk are ongoing.

-          NDGF does not run the LCG software and the functional tests fail. NDGF should find a way to report their status in the same format via the SAM system as all other sites.

 

The results of the SAM tests are available all the time on the web (http://lcg-sam.cern.ch:8080/sqldb/site_avail.xsql). Therefore the sites should monitor their values regularly so that any service unavailability is discovered promptly and corrective action taken immediately.

 

VO specific tests can always be added in the SAM framework; experiments can provide their tests and publish them via the SAM framework. Contact Piotr for details.

3.2         How to Deal with Scheduled Downtime (presentation) – F.Hernandez

 

The inclusion of scheduled downtime in the availability metrics has been discussed via emails since a couple of weeks.

F. Hernandez provided a summary of the discussion and his proposal on the subject.

 

Currently the services at the sites are tested at regular intervals (slide 2). The monitoring system probes the services BDII, CE, and SE/SRM at every site. It categorizes the services as OK, DOWN or DEGRADED. And then combines the result in a logical “and” in order to provide a single “site availability” value.

 

Scheduled downtime, even if announced in the GOC database, is not taken into account. Nodes marked “not-to-be-monitored in GOC DB, are probed anyway.

 

Meanwhile (slide 3) the definition of Availability in the WLCG MoU (http://lcg.web.cern.ch/LCG/C-RRB/MoU/WLCGMoU.pdf) is different:

MoU Availability = Time Running / Scheduled Uptime

 

The document of J.Shiers approved by the MB, “Scheduling of Service Interruptions at WLCG Sites” defined the convention for announcing interruptions:

-          Announce the interruptions through the CIC portal and store information in the GOC

-          Interruption <= 4 hours: announce1 working day in advance

-          4 < Interruption <= 12 hours: announce at the operations meeting

-          Interruption > 12 hours: announce one week in advance

 

“All interruptions that do not follow this procedure shall be deemed to be
unscheduled and shall be thus accounted in the corresponding Site Availability reports”

 

The CIC Portal (slide 4 and 5) also has an “operations metrics” web section:

http://cic.in2p3.fr  >> CIC Staff >> Operations Metrics >> Site View

But it is still using the SFT test results and does not account the scheduled downtime.

 

J.Gordon, and T.Cass’ contributions, discussed via email propose the definition of two metrics:

-          Availability
- total fraction of time that a service is available to the user
- does not take into account periods of scheduled downtime
- this is what is measured currently via SAM

Reliability
- the fraction of time that a service is actually up when it is supposed to be up
- excludes periods of scheduled downtime
- other metrics in this category: Mean Time Between Failures and number of episodes

The metrics should only be calculated by one tool and in a transparent way (slide 7). The result should be available via the CIC portal and the tools should enforce the convention for announcing in advance the interruptions.

Other tools needed for immediate unscheduled downtime (e.g. for emergency) should also be made available.

 

The discussion that followed highlighted and agreed that:

-          The SAM tests will progressively replace all the SFT tests and also are going be used by the CIC Portal (I.Bird).

-          The dual measures better represent the situation. And reliability metrics should also be distributed by the SAM system. (F.Hernandez).J.Gordon had already spoken to P. Nyczyk about it.

-          The target number in MoU is the target for reliability and the MoU should be changed accordingly.

-          The frequency of the tests should be tuned. Currently the frequency is always of one hour; therefore many shorter interruptions cannot be detected. The frequency of the tests should probably be smaller than the Mean Time to Failure. (Ph.Charpentier)

-          In order to detect the correct downtime, the frequency should increase when a site is down; otherwise every downtime is at least one hour. (L.Robertson)

-          The list of which services must be running in order to consider a site down should be tuned better. In some cases many services are running but the site is considered down. Some more granular way to distinguish the functions (SE, CE, etc) is needed.  (J.Templon)

-          A new value should be introduced in the GOC DB in order to represent scheduled downtime for a given site (J.Shiers and I.Bird)

 

4.      Accounting Data (tables) - L.Robertson

 

 

The tables for June 2006 are available. The sites should send back their feedback to L.Robertson.

 

L.Robertson asked if the sites would agree to send the data earlier, just a few days after beginning of the month.
But, because some operations are not automated at all sites, the MB decided to leave the timescale unchanged, for now.

 

5.      Schedule for Re-planning after the LHC Changes - L.Robertson

 

 

At the GDB meeting in July the new LHC schedule for 2007 and 2008 was presented by J.Engelen. Then a meeting was held of the LCG management and the LHC experiment computing coordinators with J.Engelen, L.Evans and S.Myers in order to understand better the probable schedule and operational parameters during the first two years.

.

Now experiments should re-define, according to the new understanding of the parameters, the capacities needed from 2006 to 2008 from CERN and the Tier-1 and Tier-2 centers.

 

L.Robertson noted that the goal is to present the new capacities required at the OB in early September and proposed the date of 15 August for the experiments submission of their needs for 2007 and 2008.

 

Ph.Charpentier and L.Bauerdick noted that in 2007 there will be not a lot of data stored, but that is very important that the ramp up to 2008 is properly planned, and equipment procured and installed.

 

Experiments should report their new values (energy, luminosity, etc) for 2007 and 2008, and then prepare the new capacities needed. Taking in consideration the beam-time in 2007 and the expected efficiency.

2009 will be discussed next year.

 

Action:

L.Robertson will assemble the current estimates for energy, luminosity,andefficiency for planning purposes, and distribute it to the MB list.

 

L.Robertson noted that the 15 August deadline is needed in order to be operational in April 2007. Because the values must be presented to the OB in September in order to be ready for the RRB in October and to allow sites to proceed with procurement.

 

L.Bauerdick announced that probably CMS will ask for a more formal process for defining the needs of the experiments and for calculating the pledges at the various sites. This may involve the action of the Resources Scrutiny Committee that is mentioned in the MoU.

 

L.Robertson noted that the current numbers were presented by the experiments to the sites and were approved by the LHCC in the past and proposed to follow the same process.

 

M.Mazzucato noted that the re-discussion of the MoU capacity values was made at a “country-level” and not at “site-level”. Therefore changing those numbers will imply the reassessment of the MoU, involving the representative of the countries in the MoU.

 

T.Cass pointed out that CERN and some other sites have a long procurement process and a large purchase needs the approval of the Finance Committee that meets rarely. Therefore the values have to be known in advance, of at least six months.

 

L.Robertson noted that this Scrutiny Committee could be set up only by next RRB meeting in October and would not reach conclusions before April 2007.

 

Decision:

ALICE, ATLAS and LHCb agreed that the process of providing new estimates was acceptable for them and mid-August could be their deadline. NOTE: ATLAS was not present in that moment but had agreed outside the meeting.

CMS did not agree and will send more information to the MB.

 

Action:

15 Aug 06- ALICE, ATLAS and LHCb send to C.Eck their new capacity requirements for 2007 and 2008, for the Tier-1 and Tier-2 sites.

 

 

6.      AOB

 

6.1         Draft agenda for the comprehensive review (agenda)

 

Will be discussed at next MB meeting.

 

 

7.      Summary of New Actions

 

 

Action:

15 Aug 06 - L.Robertson will assemble the current estimates for energy, luminosity and efficiency for planning purposes, and distribute it to the MB list.

 

Action:

15 Aug 06- ALICE, ATLAS and LHCb send to C.Eck their new capacity requirements for 2007 and 2008, for the Tier-1 and Tier-2 sites.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.