LCG Management Board

Date/Time

Tuesday 9 September 2008, 16:00-18:00 - F2F Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=39170

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 2 -20.9.2008)

Participants

A.Aimar (notes), D.Barberis, I.Bird(chair), D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, A.Di Girolamo, M.Ernst, X.Espinal, S.Foffano, F.Giacomini, J.Gordon, F.Hernandez, M.Kasemann, E.Laure, H.Marten, P.Mato, M.Lamanna, P.Mendez-Lorenzo, A.Pace, R.Pordes, Di Qing, H.Renshall, R.Santinelli, M.Schulz, Y.Schutz, J.Shade, K.Skaburskas, O.Smirnova, R.Tafirout, J.Templon 

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 16 September 2008 16:00-18:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

1.1      Minutes of Previous Meeting 

The minutes of the previous MB meeting were approved.

1.2      QR Preparation (June-September)

The next Quarterly Report will cover the period from June until September 2008 (instead of June-August) in order to report also on the activities that followed to the LHC start up.

 

As usual, the Experiments will be asked to present to the MB a short summary of the quarter. That information will then be used in the Quarterly Report.

1.3      Reliability Availability Report - August 2008 (T1_200808; T2_200808)

For information the monthly Reliability and Availability Report are distributed to the MB (links in the section title). Please send comments to the lcg.office@cern.ch.

1.4      Speakers for referees' meeting - Ian Bird

There is a meeting of the LHCC Referees in 2 weeks (22 September 2008). They have asked for a summary status report covering all 4 Experiments (in 20’ total).

 

Y.Schutz accepted to provide the summary at the Referees Meeting. The other 3 Experiments should send him the slides to present.

 

2.   Action List Review (List of actions)

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

About LCAS: Ongoing. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it up should confirm it.

About SCAS: The SCAS server seems to be ready and “certifiable” in a week. The client is still incomplete.

No news for this week about LCAS and SCAS.

  • For the ATLAS Job Priorities deployment the following actions should be performed :

-       DONE. A document describing the shares wanted by ATLAS

-       DONE. Selected sites should deploy it and someone should follow it up.

-       ONGOING. Someone from the Operations team must be nominated follow these deployments end-to-end

 

ATLAS will report on the status of the tests. No news at the F2F meeting.

 

D.Barberis reported that the ATLAS tests on job priorities are supposed to take place at CNAF (Milan and Naples) but they were busy with other issues during the summer. Once the tests in Italy will be completed some tests will also be executed at NL-T1 for final verification.

J.Gordon noted that, once ATLAS has completed the tests, the other VOs should then be requested for prepare their shares and priorities.

  • 19 Aug 2008 - New service related milestones should be introduced for VOMS and GridView.

To be discussed at the MB in the future.

  • M.Schulz should present an updated list of SAM tests for instance testing SRM2 and not SRM1.

This is discussed later in this meeting.

  • J.Shiers will ask SAM to review the MoU requirements and define SAM tests to verify them.

To remove. A working group will review the MoU looking for which SAM tests could be useful (and implementable).

 

3.   LCG Operations Weekly Report (Slides) - H.Renshall

H.Renshall presented a summary of status and progress of the LCG Operations.

The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Sites Reports

 

CERN

Large number of software changes during the week with rolling kernel upgrades on all Quattor-managed machines but most interventions were transparent to the users.

 

The worker nodes upgrades resulted in an LSF problem whereby the appearance of a new system directory stopped LSF knowing the cpu speed and memory attributes of a node. And this caused a delay in the scheduling of some jobs. A patch was received the same day.

 

There was an SRM scheduled intervention on 3 September when all CERN SRM2 endpoints were down for 1 hour from 10.00 to 11.00.

 

There was an LSF hiccup during the daily reconfiguration on 4 September when it failed to read a local configuration file and many jobs were put into the lost+found queue. Manual intervention put the jobs back in their correct queues and a possible timeout window problem has been identified inside the LSF reconfiguration.

 

RAL

The long-standing CASTOR2 bulk-insert ‘Oracle constraint violation’ problem has been fixed for all instances over last week. It required configuration changes to the ATLAS/LHCb RAC and to the RAC hosting the common name server. A post-mortem analysis is in preparation.

 

Also had several other unrelated CASTOR Oracle DB problems during the week with a few hours of downtime.

 

J.Gordon noted that the Oracle problems at CERN where solved with the help of Oracle long time ago, but RAL did not know about it. The same information was not available to the other Sites and was solved because someone from CERN noted the issue and sent the solution to RAL.

 

ASGC

Upgraded to castor 2.1.7, even if was warned of RAL problems. Seeing some 32/64-bit related problems.

 

BNL

On Friday (08/29) at 7:01am data transfer failures to several sites were reported by automated service probes at BNL indicating network connectivity problems on the OPN level. BNL later received a message back from the USLHCnet NOC at 1:24pm confirming that there is an outage of the circuit provided by Colt. A backup circuit was believed to be in place but was not operational. Around 8:30pm while data replication from/to BNL was progressing well, connectivity issues were observed between hosts at CERN connected via the OPN and the PanDA servers/services running at BNL. At the time there were no problems reaching the PanDA services at BNL from CERN hosts that are not routed via the OPN (e.g. lxplus). As a temporary workaround policy based routing was re-enabled in the BNL firewall. This issue was followed up with priority at BNL and CERN and as a result some configuration changes have been made at BNL.
(Update at meeting: M.Ernst pointed out that these problems are understood but the fixes not yet in place)

 

CNAF

The GGUS alarm ticket routing was found to be incomplete since 17 July. Fortunately only one real alarm ticket was raised before this was corrected (on 4 September) and that had anyway been solved.

 

Have performed Oracle July patch upgrades and plan to upgrade to CASTOR 2.1.7 this week (!) with a downtime for all VOs.
(Update L.Dell’Agnello clarified at meeting that because of some problems with Oracle the upgrade is postponed till next week)

 

General

Glite 3.1 update 30 was released. It includes the fix to the VDT limitation of 10 proxy delegations that makes WMS proxy renewal again useable by LHCb and ALICE.

Update 30 contained a new version of gfal/lcg_utils that was incompatible with gLite 3.0 top-level BDIIs. Advisory EGEE broadcasts to upgrade/point to others were made.
(Update at meeting: M.Schulz noted that the gLite 3.0 BDII had already been announced to be no longer supported)

3.2      Experiments Reports

 

ALICE

On 1 September there was an unscheduled power cut in the ALICE server room (in building 30) which lasted beyond their local UPS protection time. ALICE production, dependent on these ALIEN central services, stopped for 2 hours.

 

A new fully backwards compatible release of ALIEN has been made. Already installed at CERN and some smaller sites. Makes subsequent upgrades of VOBoxes much easier.

 

LHCb

Ongoing data access problems at NL-T1 needing daily restarts of services.

 

Have issued GGUS tickets requesting that space tokens missing at some Tier1 be urgently set up.

 

CMS

Magnet running at 3 Tesla over the end-August weekend taking cosmics. Migration of data to CAF found to be failing. Was debugged but the backlog took 2-3 days to recover.

 

On 2 September DEFAULT service class of CASTORCMS instance was temporarily not available. Understood to be caused by a single user filling all the scheduling slots available in the castorcms default service class. May be using cmscaf: being addressed. CAF heavily used: Castor pools/subclusters on cmscaf reached a 3.8 GB/s peak at 2am CERN time

 

Performed last midweek 2-day (Wednesday +Thursday) global cosmics run before beam.

 

Requested a customised elog service to be run by IT (initial setup being done by GS). Their elogs over the last two weeks show that most problems at Tier1s are solved within 24 hours but this is not the case for Tier2s.

 

ATLAS

CASTOR services at RAL resumed early in the week but 4000 files permanently lost due to an earlier unrelated disk failure.

 

Known problems exporting data to NL-T1 where the storage setup is thought to be out of balance between disk and tape.
(Update at meeting: +700 TB of disk has now passed acceptance at SARA and is being prepared for use)

 

Daily conference calls started with ASGC to resolve problem of lack of disk space due to large volume of ‘dark’ (i.e. not catalogued) data.

 

Overnight on 3rd monitoring for the dashboard for Tier0 and data export stopped with huge load of call-backs and timeouts at the server side. Fixed during the day – I do not know if it was site services, DDM or dashboard related.

 

Started program to simulate pre-staging, reprocessing job load and use of conditions database by running a large reprocessing test of FDR and cosmics data.

3.3      Conclusions

(Too) many software upgrades and we do not see this slowing down. 

Many miscellaneous failures which will also continue.

To maintain good quality LCG services over the next months is going to require constant vigilance and be labour intensive.

 

M.Kasemann asked that the upgrades introduced are only patches and not including new features that may cause new issues. He also added that until now new deployments of components have not caused problems to CMS, and the upgrades were transparent to the CMS users.

 

I.Bird agreed that the certification effort is to be spent on testing patches and not on new features and components.

 

18 Set 2008 - Sites should send to A.Aimar and H.Renshall the status of their procurement for the 2009 installations and whether they are on track for installation by April 2009.

 

J.Templon informed the MB that the tender problems (for 2008) at NL-T1 are solved but he is not sure the cooling systems and their housing will be ready until sometimes during next year.

 

4.   High Level Milestones Update (HLM_20080907) - Round table

 

The MB reviewed the High Level Milestones (HLM_20080907).

 

24x7 Support

Completed by all Sites.

 

VOBoxes Support

 

WLCG-07-04

Apr
2007

VOBoxes SLA Defined
Sites propose and agree with the VO the level of support (upgrade, backup, restore, etc) of VOBoxes

WLCG-07-05

May 2007

VOBoxes SLA Implemented
VOBoxes service implemented at the site according to the SLA

WLCG-07-05b

Jul 2007

VOBoxes Support Accepted by the Experiments
VOBoxes support  level agreed by the experiments

ALICE

ATLAS

CMS

LHCb

 

IN2P3 - The document is ready but needs to be validated and then proposed to the Experiment and implemented. Will be done in next few weeks.

CERN - The support is de-facto available but needs to be officially approved.

ASGC - And SLA is defined and in place and needs to be signed-off by the Experiments.

NDGF - The ALICE VOBoxes have 7 different functions and should be described in detail. The other VOs do not have any VOBox at NDGF.        

PIC - The SLA is in place and the CMS contact persons need to be defined. For ATLAS there is not VO Box needed at PIC for now.

NL-T1 - The SLA document is not finalized and as the VOBoxes are working it was not a major priority compared to other issues. SARA and NIKHEF need to both approve the document.

 

D.Barberis clarified that all ATLAS VOBoxes are running at CERN except for one in BNL.

 

CAF at CERN

 

WLCG-07-40

Oct 2007

Experiment provide the Test Setup for the CAF
Specification of the requirements and setup needed by each Experiment

 

The CAF, although used in different ways, is defined and in use by all 4 Experiments.

 

Tape Metrics

 

WLCG-08-03

April 2008

Tape Efficiency Metrics Published
Metrics are collected and published weekly

 

IN2P3 - Have metrics defined but they do not cover all that is requested and what is sufficient to clearly understand the performance. More work is needed.

I.Bird suggested that the sites publish the information they have.

F.Hernandez replied that for the moment is really not sufficient. 

 

ASGC - The sites has taken the scripts from CERN but have now proceed to an upgrade of CASTOR and need to then implement the metrics requested.

 

Tier-1 2008 Procurement (as before the meeting).

ID

Date

Milestone

ASGC

CC IN2P3

CERN

DE-KIT

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

WLCG-07-17

1 Apr 2008

MoU 2008 Pledges Installed
To fulfill the agreement that all sites procure they  MoU pledged by April of every year

Sept 2008

CPU
OK May
Disk
Sep 08

Apr 2008

Apr 2008

CPU
Jul 08
Disk
Sept 08

CPU
OK May
Disk
Sep 08

CPU
OK May
Disk
Jul 08

Apr 2008

Nov
2008

Apr 2008

CPU
Jun 08
Disk
Jul 08

CPU 80% Disk
OK May

 

ASGC - Will send a confirmation to H.Renshall of how much is installed and available.

IN2P3 - Disk is provided at 60% of the pledge. Next week the rest of 2008 and 50% of 2009 are going to be delivered.

CERN - The total amount is on site but there is a problem with the supplied material (rack rails).

INFN - The hardware is in place but not powered on and not installed yet. Should be done in the next few weeks.

NDGF - Disk procurement is completed but the installation should be done by end of September.

PIC - Disk and CPU are installed.

DE-KIT: All hardware is on site, running in test mode and will be provided on October 1st.

SARA - Problems with the cooling system and its housing therefore there will be a new estimate but will be in 2009.

BNL - All CPU is installed and available. The extension of power and cooling at end of September will then allow installing the rest of the Disk.

FNAL - Not represented at the meeting.

 

2009 Procurement

WLCG-08-04

Sep 2008

Sites Report on the Status of the MoU 2009 Procurement
Reporting whether is on track with the MoU pledges by April. Or which is the date when the pledges will be fulfilled.

Will be checked at the end of September and reported in the QR and for the Reviewers.

 

Here is the dashboard updated as discussed during the meeting:

https://twiki.cern.ch/twiki/pub/LCG/MilestonesPlans/WLCG_High_Level_Milestones_20080909.pdf

 

 

5.   Review of the SAM Tests

 

5.1      Update of the Current SAM Tests (Slides) - K.Skaburskas, J.Shade

The updated covered the new SRMv2 tests are now in Production (but with alarming turned off), the Data Management tests and possible improvements to availability calculations. 

 

Additional slides were included but not discussed:

-       Steps taken to stabilize SAM service:

-       APEL-pub test is finally implemented (checks that sites are correctly publishing accounting information)

-       Other sensors forthcoming (WMS, MyProxy, AMGA)

-       Pointer to Operations Automation Strategy document

 

SRM V2 Tests

The current situation in EGEE of the SRM tests is the following:

-       ~278 SRMv1s

-       ~296 SRMv2s

 

Site availability calculations currently only take SRMv1 into account, and yet SRMv2s are predominant. For this reason the SAM team have developed SRMv2-specific probes, starting from the CMS probes, but completely rewritten (simplified, and now timeouts are used).

 

Usage of LFC removed and tests decoupled from BDII for the SRM V2 tests, unlike the SRM V1.

 

All 7 tests are currently all critical so that history can be viewed in SAM portal, but COD alarms are currently suppressed, and site availability figures are not impacted  

-       SRMv2-get-SURLs - get full SRM endpoints and space areas (VO dependent) from BDII

-       SRMv2-ls-dir - check if we can list VO's top level space area(s) in SRM. Only the top level is listed otherwise could be a lengthy operation.

-       SRMv2-put - copy a local file to the SRM into default space area(s)

-       SRMv2-ls - list (previously copied) file(s) on the SRM

-       SRMv2-gt - get Transport URLs for the file(s) copied to storage

-       SRMv2-get - copy given remote file(s) from SRM to local file(s)‏

-       SRMv2-del - delete given file(s) from SRM

 

More details available at https://twiki.cern.ch/twiki/bin/view/LCG/SAMSensorsTests#SRMv2

 

Slide 5 shows the latest test results.

There are 267 SRM 2 end points, 51 are in ERROR status:

-         3 - SRMv2-get-SURLs - either GlueServiceEndpoint and/or GlueSAPath are not published

-       25 - SRMv2-ls-dir - “CGSI-gSOAP: Could not open connection !”: endpoint is in IS but nothing listens on the port, or port is firewalled

-       13 - SRMv2-gt - inconsistent info published for transport protocols (dCache mostly)

-       10 - different types of problems < 5% of the total number of nodes

 

In the future the following features could be added:

-       Introduce some of the SRMv2 enhancements to SRMv1 sensors. Is it worth it?

-       Differentiation between Central and Local LFCs. Don’t try to write to Read-Only Local LFCs

-       Integration of SRMv2 results in GridView availability calculations

-       Distinguish between storage implementations?

Aggregating different types of storage end-points per site? Needs implementation name added to topology database

AND of different Storage types

OR of multiple instances in each storage class

-       Use of Message Bus & WLCG Probe formats. Preliminary work has already started

 

Critical Tests

One should decide which of the tests is critical and how to integrate the SRM2 tests in the calculation.

-       SRMv2-get-SURLs - get full SRM endpoints and space areas (VO dependent) from BDII

-       SRMv2-ls-dir - check if we can list VO's top level space area(s) in SRM. Only the top level is listed otherwise could be a lengthy operation.

-       SRMv2-put - copy a local file to the SRM into default space area(s)

-       SRMv2-ls - list (previously copied) file(s) on the SRM

-       SRMv2-gt - get Transport URLs for the file(s) copied to storage

-       SRMv2-get - copy given remote file(s) from SRM to local file(s)‏

-       SRMv2-del - delete given file(s) from SRM

 

The tests SRMv2-ls and SRMv2-gt seem not so critical but this should be discussed with the Sites and VOs.

 

I.Bird asked whether the SRM tests and SE tests are redundant. If it is so the duplication should be removed.

 

Availability Calculations

The OPS tests as “OR of all SE services” in case of SEs is not very relevant for the VOs. Is not meaningful unless it is split by VO.

5.2      SAM test vs. MoU Requirements

The SAM tests should continue to be used and improved, for reporting to the MB and for alarm purposed for instance. But the MoU Requirement should be reported and determined by some SAM tests in a format that is easily comparable to the MoU statements.

 

Ph.Charpentier and M.Kasemann asked whether the Experiments should test the Sites vs. the MoU only or after they have patched the problems with ad-hoc fixes. These workarounds are actually hiding issues that the sites should fix instead.

 

I.Bird replied that the tests should clarify the issues with and without the workarounds of the Experiments.

Decision:

A working group should define the reasonable and useful set of metrics to be used to check the MoU Requirements.

5.3      OPS vs. VO Tests (VO Tests Description; VO_SAM_200808, Alice VO Tests)

The example easy to see is RAL in August (see VO_SAM_200808) where you have the site 100% up while it is not available for ATLAS for ~50% of the time.

 

The MB agreed that the usage and of the OPS SAM tests should be evaluated but from next month also the VO SAM tests should be the centre of the Sites  reliability and availability.

The VO reports will be distributed for comments, annotated and then used for reporting to the WLCG management.

5.4      T2 Reports (Report)

The MB was requested about the format of the reports:

 

Federations Availability

The reporting by federation is not weighted by resources. A small site and a major site influence the average of the federation in the same way.

What could be the weight used? CPU and Disk? Installed or pledges?

 

I.Bird proposed those limit the reporting to single sites. The average for the federation is misleading. And the federation can check their sites and their availability.

 

6.   AOB
 

 

No AOB.

 

7.    Summary of New Actions

 

 

New Action:

18 Set 2008 - Sites should send to A.Aimar and H.Renshall the status of their procurement for the 2009 installations and whether they are on track for installation by April 2009.