LCG Management Board

Date/Time

Tuesday 15 April 2008, 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=31113

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 26.4.2008)

Participants  

A.Aimar (notes), D.Barberis, I.Bird (chair), D.Britton, T.Cass, L.Dell’Agnello, D.Duellmann, M.Ernst, X.Espinal, I.Fisk, J.Gordon, F.Hernandez, M.Kasemann, M.Lamanna, U.Marconi, H.Marten, P.McBride, A.Pace, B.Panzer, R.Pordes, Di Qing, Y.Schutz, J.Shiers, O.Smirnova, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 29 April 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      Update on Resources Installations at the Tier-1 Sites

For the workshop next week there is the need of the 2008 capacities at the Tier-1 Sites. Sites should send to A.Aimar or H.Renshall their plans for the 2008 installations:

-       what will be installed for May and

-       when the full capacity for 2008 will be in place.

 

New Action:

30 Apr 2008 – Sites send to H.Renshall plans for the 2008 installations and what will be installed for May and when the full capacity for 2008 will be in place.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

-       29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.

Not done. Asked to GridView but it still needs to be implemented.

-       18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.

 

Not Done.

At next meeting we will review the wiki pages of the tape metrics for all Sites.

 

-       Experiments should provide the read and write rates that they expect to reach. In terms of clear values (MB/sec, files/sec, etc) including all phases of processing and re-processing.

 

Not Done.

J.Templon suggested that the Experiments provide the same information in the format used by LHCb. Here is the example from LHCb: https://twiki.cern.ch/twiki/pub/LCG/GSSDLHCB/Dataflows.pdf

 

I.Bird noted that this data is needed for the LHCC Referees meeting that will take place on the 5 May 2008. 

 

-       31 March 2008 - OSG should prepare Site monitoring tests equivalent to those included in the SAM testing suite. J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.

 

Ongoing. Equivalence of the OSG tests not yet officially confirmed.

 

-       15 Apr 2008 - A.Aimar will distribute the VO-specific milestones for March 2008.

 

Done. And will be discussed later in this meeting.

 

-       15 Apr 2008 - A.Aimar- A note should be added to the Tier-2 Reliability reports indicating that the US Sites are not yet reporting availability and reliability data.

 

Done. The US Tier-2 Sites have not available “n/a” instead of the 0% as before.

 

New Action:

31 May 2008 - OSG Tier-2 should report availability and reliability data into SAM and GOCDB so that the monthly Tier-2 report can include their information.

 

-       18 Apr 2008 - M.Ernst should clarify the situation with the ATLAS Sites that are not providing and SRM interface (they use xrootd only apparently). And how availability reliability is reported.

 

Done.

M.Ernst explained that there is the commitment of all US ATLAS Sites to have an SRM interface by end of May 2008.

 

-       18 April 2008 - I.Bird and A.Aimar will propose new milestones to the Management Board.

 

Not done.

 

3.   CCRC08 Update (CERN IT Service Status Board; Slides; Weekly minutes) - J.Shiers
 

J.Shiers presented a summary of status and progress of the recent CCRC08 activities.

3.1      LCG OPN Tests

LCG OPN tests took place on Wednesday 9 April from 15.00 to 19.00 CET. \

 

The plan of the test is here: https://twiki.cern.ch/twiki/bin/view/LHCOPN/BackupTest

The tests simulated:

-       RAL unreachable for 15-20 minutes between 16:45 and 17:15

-       PIC unreachable for 15-20 minutes between 17:15 and 17:45

 

The goal of the maintenance was to verify that all the backup solutions worked as expected. The Tier-1 Sites with a backup link should be up all the time, but at the moment we cannot guarantee that it will be the case and there may be outages at any time for any Tier-1 Sites.

 

Result: The OPN Test ran successfully, without any noticeable degradation seen.

3.2      Databases

DB Migrations - The test migration to new quad-core HW has been successfully performed on all LHC offline Oracle databases concerned. The final migrations of the LHC RAC production databases to the new HW will go ahead as planned as follows:

-       CMSR & LHCBR: Tuesday 15.04 \

-       ATLR: Wednesday 16.04

-       LCGR: Thursday 17.04

 

These major migrations (which will include a move from 32 to 64 bits) will be performed with a downtime of two hours, thanks to the use of the Oracle Data Guard technologies.

 

The downstream databases for ATLAS and LHCb streams setup will be migrated, Tuesday 15th April and Wednesday 16th April, to the new hardware.

 

Clean-up of ATLAS LFC - The LFC deployment team has proposed a clean-up of the ATLAS LFC local catalog deployed on the LCG RAC; this will follow the hardware upgrade and the ATLAS local LFC will not be available for about two hours.

 

Solved RAC5 and 6 Problems - A blocking issue with the monitoring of the storage arrays on RAC5 and RAC6 has been solved with the help of IT-FIO-TSI.

 

Problems with the Infotrend Storage – The new Infotrend storage has showed more problems with the controllers. So far found 7 controllers out of 60 new arrays (3 in the previous week, 4 last week) with issues that required the vendor’s intervention.

 

CNAF Move Completed - CNAF finished the computer centre move last Monday, 7th April, after 2 weeks of downtime. LFC/LHCb replication was synchronized in less than 1 hour. ATLAS replication is still pending because CNAF is now migrating the production servers for ATLAS. The CNAF ATLAS database is scheduled to be ready next Monday, 14th April.

 

BNL Upgraded to 64 Bits - BNL completed successfully an intervention to upgrade the production servers for ATLAS from 32 to 64 bits. Even though the intervention was extended 1 day more due to some complications, there was no impact on the ATLAS replication environment. The migration process was tested using a test-bed at CERN!

 

Problems with the New HW - Problems with new h/w (~10% of storage controllers) – not seen during acceptance tests – are a concern. Rather than postpone migration to new hardware (at CERN) until after May challenge, the on-going plan is to retain old hardware as “stand by” databases to the new. The new hardware should be still tested because is the one that will be used in the future.

3.3      Other Services

WMS on SLC4 Tests - Experiments have been invited to join a production scale test of the SL4 version of the WMS.

 

Many Intervention Planned - See also the CERN IT Service Status Board – a large number of interventions are scheduled for this week.
http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/

3.4      Sites

Site

Day

Issue / Comment

RAL

Tue

Would like information on Tier 1 resource requirements for the May CCRC from CMS and ALICE as soon as possible.

NL-T1

Tue

Confirmed they expect new (2007 pledges) hardware resources to start being published during the last 2 weeks of May.

NL-T1

Wed

Reported ATLAS jobs disappeared rapidly about 21.00 last night (14 April).

K.Bos will follow up and investigate what happened.

GRIF

Wed

Plan to install the new version of DPM which will correct ACL support for ATLAS. S.Jezequel is going to provide a script to correct the ACL of files already in the GRIF ATLAS DPM.

RAL

Fri

Have upgraded all their LHC castor instances to version 2.1.6.

NL-T1

Fri

SARA is having problems completing their dCache upgrade and this has caused batch job failures at both SARA and NIKHEF.

 

J.Gordon added that the information for RAL is needed from all Experiments. CMS and ALICE have not provided such information yet.

M.Kasemann replied that he will find that information and CMS will provide it to RAL.

3.5      Experiments

 

ATLAS

Detailed reports from ATLAS on a daily basis on plans, current activities and problems, as well as follow-up on issues carried over from previous days. This includes both Experiment and Site status, including follow-up from site’s experts on issues seen.

 

Here is the link to the notes from last week’s meetings. It would be interesting to get feedback from Sites on the level of detail provided and whether similar information from all Experiments would be more useful.

 

ALICE, CMS, LHCb

Below is the information provided by the other 3 Experiments.

 

Experiment

Day

Comment

CMS

Tue

The recent SAM srm test failures, due to information disappearing from the BDII, have been fixed.

LHCb

Tue

Preparing for infrastructure testing to start on 18 April.

The slow throughput recently seen on the CERN to CNAF traffic has been corrected (not known what happened). The LHCb LFC is now replicated to NL-T1 and PIC is scheduled to start next week.

LHCb

Wed

For the May CCRC they are requesting a new space token (LHCBUSER) of type 'custodial online' to be deployed at CERN and the T1. They would require 3.3 TB of such space at CERN, about 3 TB at NL-T1 and less at the other T1. When asked JT said that NL-T1 should be able to provide this space provided it were within the MoU envelope of LHCb.

LHCb

Fri

Main activity next week will be movement of data in preparation for CCRC'08 phase 2. There is currently a problem with the RAL LFC.

3.6      Outlook

Next week’s WLCG Collaboration Workshop – over 200 registered people so far – will cover lessons learned / feedback from February as well as planning for May. The agenda leaves plenty of time for discussion, both in the meeting room and outside.

 

Looking forward, in the next couple of months, we have:

-       CCRC’08 post-mortem workshop June 12 – 13 at CERN,

-       HEP Application Cluster sessions at EGEE’08 in Istanbul

 

Around that time – hopefully overlapping with data taking – one will have to start thinking about 2009: middleware versions, storage-ware, resources etc and the plans for testing it all again.

As during March / April, we foresee continuing with the same schedule for conference calls & weekly summaries post-CCRC’08.

 

J.Gordon asked whether in 2009 there is going to be a similar approach with Combined Challenges at the beginning on 2009.

J.Shiers replied that probably tests in February and commissioning in May will be needed.

 

I.Bird reminded the Tier-1 Sites that, in order to have the 2009 pledges by April 2009, sites should already be preparing their procurement procedures and starting the process. In addition the accelerator schedule for 2009 should be known in order to plan more in detail.

 

New Action:

I.Bird should ask information to CERN about the plans of the Accelerator for 2009

 

4.   Next Referees Meeting (May 5) - WLCG One-Day Review (June 30 or July 1) – I.Bird
 

Referees Meeting - Next Referees meeting should be prepared and will be on Monday 5 May 2008 (http://indico.cern.ch/conferenceDisplay.py?confId=27533).

I.Bird will ask information on what to insert in the agenda.

 

WLCG Review – This year there will be 2 one-day WLCG Reviews; one in June and one in November 2008. The details will be discussed and prepared in time.

 

At the end of June the review will probably cover the issues raised at the previous review, i.e. SRM, Castor, 24x7 and CCRC. It would also be good to have the Experiments’ feedback about their CCRC experience.

 

5.   LCG 3D project status and proposed future steps (Slides) - D.Duellmann

D.Duellmann summarized the status of the 3D project and proposed the next step to follow.

 

The Distributed Database Deployment project was started by the PEB in July 2004.

http://lcg.web.cern.ch/LCG/PEB/Minutes/Minutes20040720.html

 

This was the proposal.

 

5.1      Current Status

Large scale database installations are now in place at CERN Tier 0 and ten Tier 1 Sites and deployed in production mode since April 2007, according to the resources requested by the Experiments.

 

Suitable replication technologies have been identified and integrated into the Sites Database Services. Experiments have tested the installations during several larger production activities. Site administrators and Experiments are regularly using the LCG-wide DB procedures and monitoring tools that been developed in the 3D project

 

Data Replication is widely used and mastered now:

-       ATLAS, CMS and LHCb use streams replication for consistent replication between online and offline

-       ATLAS and LHCb also for Tier 1 replication

-       CMS uses FroNTier for Tier 1 and 2

5.2      Software Integration

Site replica look-up and grid authentication (including VOMS roles) has been integrated in the LCG Persistency Framework.

 

FroNTier protocol has been integrated into the LCG Persistency Framework and FroNTier servers have been setup as part of the 3D project and are now operated directly by CMS.

 

Some remaining scalability and security issues being addressed by the CORAL server development in the Persistency Framework

5.3      Proposal

The main service development goals of the project are now achieved. The resulting services, procedures and tools are already well integrated into the daily database operations at CERN and Tier 1

 

The proposal is to conclude LCG 3D as a service development project.

Possibly by defining any last Experiment validation milestones, if not (yet) part of the general CCRC’08 activity to summarise the work by issuing a final project report.

5.4      Move to Operations Mode

During the project a well-functioning group of database experts at Experiments and Sites has been formed and we need to maintain this communication channel, which has been very productive.

 

D.Duellmann proposed to continue with the established regular meetings and workshops for:

-       database version upgrade

-       schedule regular resource, hardware and license reviews

-       technical discussion between database experts from Experiment and LCG Sites.

 

The natural choice would be Maria Girone from the Tier-0 physics database service as new chair for these Database Operational Meetings and workshops.

5.5      Acknowledgements

D.Duellmann concluded thanking the many Experiment and Site database experts, the collaborators from the CERN Openlab and the s/w development projects who allowed running the 3D project successfully, and without significant dedicated resources.

 

I.Bird thanked D.Duellmann and the participants to the 3D project for the successful completion of this important project for the LCG.

He also asked how future developments will be managed.

 

D.Duellmann replied that there were not dedicated resources to the 3D project; therefore, if the group continues to meet, small developments can be organized within that same group.

 

I.Bird asked whether some final validation milestones from the Experiments are still needed.

D.Barberis replied that he will report about ATLAS. But CCRC08 May will already constitute the final 3D test from all Experiments.

 

Decision:

The MB agreed that the 3D project is considered concluded and the final tests will be done during CCRC08-May.

 

6.   Tier-1 and Tier-2 Reliability and Availability - March 2008 (VO_SAM_200803; Tier-1_SR_200803; Tier-2_SR_200803) A.Aimar

The reliability calculated by the VO-specific tests should be again reported and discussed monthly at the MB Meetings.

This is the most effective way to improve on the matter. Below is the Reliability calculated for the general OPS tests and for the LHC Experiments.

 

 

>= 93%

>= 84%

< 84%

 

OPS

ALICE

ATLAS

CMS

LHCb

 

CERN-PROD

96%

78%

0%

90%

100%

CERN-PROD

DE-KIT

97%

93%

79%

50%

100%

FZK-LCG2

FR-CCIN2P3

96%

74%

14%

87%

99%

IN2P3-CC

IT-INFN-CNAF

87%

86%

82%

94%

100%

INFN-T1

NDGF

99%

100%

n/a

-

-

NDGF-T1

ES-PIC

86%

100%

69%

80%

100%

pic

UK-RAL

95%

95%

0%

80%

100%

RAL-LCG2

NL-T1

87%

70%

0%

-

100%

SARA-MATRIX

TW-ASGC

96%

100%

0%

95%

-

Taiwan-LCG2

CA-TRIUMF

99%

-

0%

-

-

TRIUMF-LCG2

US-FNAL-CMS

69%

-

-

100%

-

USCMS-FNAL-WC1

US-T1-BNL

92%

-

100%

-

-

BNL-LCG2

 

D.Barberis proposed that at next meeting ATLAS invites A.Di Girolamo to give a summary of the ATLAS tests and details on the issues they have found at the Sites.

 

I.Bird proposed that in the future Experiments should report on the values of their VO-specific tests (those where they are below the target).

This could be a topic for the F2F meeting in May.

 

J.Templon reminded that is important that the reliability calculation use only the tests that are really critical tests.

 

D.Barberis distributed via email a summary after the MB meeting:

 

From: Alessandro Di Girolamo <Alessandro.Di.Girolamo@cern.ch>
Date: 15 April 2008 17:27:05 GMT+02:00
To: Dario Barberis <Dario.Barberis@cern.ch>
Subject: Re: [LCG MB] March 2008: VO-specific SAM tests

As discussed two months ago, we changed the SAM ATLAS CE list of critical tests. This is the main reason of the (bad) availability results of March.
Together with the critical test, we also changed the policy of "blacklist" CE if they fail: now there is no blacklist. This was done to avoid that this new more aggressive policy of testing would become just a blind blacklisting of all the CEs, while the results could be used to understand how much work is still to be done on many Sites.

This is the list of the critical test for the ATLAS Site Services:
CE:
    Job Submission
    CA certs version
    VO Tag management
    VO sw directory
    GangaRobot
SE (SRM):
    Put: copy and register a file to the SE
    Get: copy back the file from the SE
    Del: Delete a file from the SE

GangaRobot not working on many Sites and VO Tag management (mainly on cern CE) are the most common causes of failures.

- ASGC: SE ok, CE (3 ce): GangaRobot, mai funzionato; upgrade delle lcg-utils non risulta
- ASGC: SE ok, CE (3 ce): GangaRobot, never worked; upgrade of lcg-utils seems not to be done
- BNL: as usual they are out of the BDII so they cannot be tested in the standard SAM way
- CERN-PROD: SE ok, CE (22 ce): VO Tag management. They publish the VO tags not correctly. A ticket was opened more than one month ago.
- FZK: SE ok except last 3days of march when they suffer of an unscheduled downtime, CE (3) ok, except 2 days. They seems to have suffered all the month of a problem to one of their CE, and when they put in scheduled downtime the other two, they had failures coming from the one mentioned above. It seems to have been a technical problem of downtime scheduling.
- IN2P3-CC: SE has been solved the problem related to the fact that they are not publishing on the IS a path to let the lcg-utils write into as default. CE (3): problems with GangaRobot, still not working on the Site, moreover they suffered problem of Job Submission
- INFN-T1: downtime for almost 10days, the rest was almost ok. One detail: storm SE is not under Sam testing, this problem should still be solved for ATLAS since ops should be ok for them
- NDGF, as BNL
- NIKHEF: SE ok, CE problems with the GangaRobot
- PIC: downtime of 5/6days (15-20march), CE they had problems that has been then solved. Now they are ok
- RAL: SE ok, CE (1) problems with GangaRobot
- SARA: CE problems with GangaRobot and with the VO Tag publication. Their problems with the SE seems to be more a problem of the tests
- TRIUMF: SE seems to be failing but they are not really: the tests that are running now, due to some SAM technical limitations, are testing more than one endpoint together, so if one of them it's failing, the test result is error. In the endpoints that we are testing there must be an old endpoint that should be taken out from the SAMdb (our action). CE is ok

 

7.   AOB

7.1      Tier-2 in Turkey

I.Bird reported that two Tier-2 Sites in Turkey want to know from ATLAS and CMS who their respective Tier-1 Sites will be.

 

D.Barberis replied that for ATLAS the Tier-1 is NL-T1. The Sites without “national” grouping to a Tier-1 are assigned to NL-T1.

P.McBride replied that CMS will define this later. I.Fisk added that they have requested an association with the US Tier-1 Site (FNAL), but maybe a Site in a closer time-zone to Turkey would be more appropriate.

 

8.   Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.

 

New Action:

30 Apr 2008 – Sites send to H.Renshall plans for the 2008 installations and what will be installed for May and when the full capacity for 2008 will be in place.

 

New Action:

31 May 2008 - OSG Tier-2 should report availability and reliability data into SAM and GOCDB so that the monthly Tier-2 report can include their information.

 

New Action:

I.Bird should ask information to CERN about the plans of the Accelerator for 2009