LCG Management Board

Date/Time:

Tuesday 29 May 2007 16:00-17:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=13797

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 31.5.2007)

Participants:

A.Aimar (notes), D.Barberis, I.Bird, Ph.Charpentier, L.Dell’Agnello, D.Duellmann, M.Ernst, I.Fisk, S.Foffano, J.Gordon, C.Grandi, F.Hernandez, M.Lamanna, J.Knobloch, G.Merino, B.Panzer, L.Robertson (chair), J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 5 June 2007 16:00-18:00 - F2F Meeting

1.      Minutes and Matters arising (minutes) 

 

1.1         Minutes of Previous Meeting

Minutes of the previous MB meeting approved.

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 15 May 2007 - J.Gordon agreed to make a proposal on how to also store non-grid usage in the APEL repository.

Not done. J.Gordon said that he will present a proposal in a couple of weeks.

 

  • 22 May 2007 - I.Fisk will circulate to the MB the description of the CMS SAM tests.

Not done. He will do it this week.

 

  • 22 May 2007 - L.Robertson will distribute a proposal that had been prepared after the MB presentation on benchmarking in March.

Done. Proposal distributed and a talk is scheduled for the F2F MB meeting next week.

 

  • 29 May 2007 - A.Aimar and L.Robertson circulate follow-up milestones on the VOs Top 5 Issues to the MB.

Not done. Will be done in the next couple of weeks.

 

3.      Data Integrity (document) - B.Panzer

 

On the 8 April 2007 B.Panzer had distributed a document about data integrity, on the causes of data errors (transfers, memory, disks, etc) and some possible solutions, or at least strategies for the detection of those problems.

 

The same issue was also discussed at HEPiX and input is expected from the Experiments, in particular what is acceptable in terms of data loss loss and undetected errors. There is a need to discuss further with the Experiments and the Sites in order to have the issue clearly specified and agreement reached on the appropriate actions to be taken.

 

L.Robertson asked for reactions from the Experiments and the Sites.

 

J.Templon added that inserting checksums in the catalogues, and many other databases at the sites, is a major change to all of the middleware and to the services. Instead Experiments should perform these calculations in their application and store the error checks as other data files.

 

J.Gordon noted that Experiments do not always read all the data in each file but only selected blocks of data; and for each of those blocks there cannot be a checksum (or is at least this would be an expensive solution).

 

G.Merino added that the transfer services (FTS) should check that the data is valid before and after the execution of the transfer.

 

Ph.Charpentier agreed that this check should be done by FTS which is performing the transfer (like FTP does).

 

C.Grandi replied that these are “third party transfers” for FTS and the sum should be calculated on the SE at the site. And this should be added to the SRM systems at the sites, but it is very complicated to agree it and then add it to all implementations.

 

D.Düllmann proposed that the most effective place to generate a checksum and check for data loss would be in the application level read and write functions – ROOT and the functions used by the experiment frameworks.

 

Ph.Charpentier also noted that the document mentions an error rate of 1 byte in 10 millions under some conditions. This is clearly not acceptable. If true each file bigger than 10 MB would be possibly corrupted.

 

B.Panzer mentioned several open issues:

-          Which the acceptable level of error is for the Experiments?

-          And what do we do once an error is detected? How to recover the data? Find the files somewhere else?

-          Where should the original checksum be stored? For checking it every time the file is used? Read it from the catalogue?

-          How do we proceed with compressed files? If just one byte is wrong the whole file is unusable.

 

L.Robertson proposed to continue to investigate the issue more actively than until now. Experiments and Sites should meet and discuss the issues and propose some actions to the MB. D.Barberis and Ph.Charpentier (other experiments not represented) agreed that Experiments will look into the issue in the next couple of weeks.

 

J.Shiers noted that in Victoria there are two BoF sessions still free that could be useful for this topic.

 

Decision:

B.Panzer should organise a meeting to consider these issues and report to the GDB in July proposing any actions that should be taken.

 

 

4.      High Level Milestones Update (HLM Dashboard) - Roundtable

 

4.1         Milestone dashboard BEFORE the roundtable (25.05.2007)

 

25.05.2007

WLCG High Level Milestones - 2007

 

 

 

Done (green)

 

Late < 1 month (orange)

 

Late > 1 month (red)

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

FZK GridKa

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

24x7 Support

WLCG-07-01

Feb 2007

24x7 Support Definition
Definition of the levels of support and rules to follow, depending on the issue/alarm

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-02

Apr
2007

24x7 Support Tested
Support and operation scenarios tested via realistic alarms and situations

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-03

Jun
2007

24x7 Support in Operations
The sites provides 24x7 support to users as standard operations

 

 

 

 

 

 

 

 

 

 

 

 

VOBoxes Support

WLCG-07-04

Apr
2007

VOBoxes SLA Defined
Sites propose and agree with the VO the level of support (upgrade, backup, restore, etc) of VOBoxes

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-05

May 2007

VOBoxes SLA Implemented
VOBoxes service implemented at the site, and tested by the Experiments

 

 

 

 

 

 

 

 

 

 

 

 

Job Priorities

WLCG-07-06

Apr
2007

Job Priorities Available at Site
Mapping of the Job priorities on the batch software of the site completed and information published

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-07

Jun
2007

Job Priorities of the VOs Implemented at Site
Configuration and maintenance of the jobs priorities as defined by the VOs. Job Priorities in use by the VOs.

 

 

 

 

 

 

 

 

 

 

 

 

Accounting 

WLCG-07-08

Mar 2007

Accounting Data published in the APEL Repository The site is publishing the accounting data in APEL. Monthly reports extracted from the APEL Repository.

 

 

 

 

 

 

 

 

 

 

 

 

3D Services

WLCG-07-09

Mar
2007

3D Oracle Service in Production
Oracle Service in production, and certified by the Experiments

 

 

 

 

 

 

 

 

 

 

 

n/a

WLCG-07-10

May 2007

3D Conditions DB in Production
Conditions DB in operations for ATLAS, CMS, and LHCb. Tested by the Experiments.

 

 

 

 

 

 

 

 

 

 

 

 

 

4.2         Update from the Sites

 

ASGC

Not represented at the MB Meeting.

 

IN2P3 (F.Hernandez)

 

WLCG-07-04: VOBoxes SLA Defined.

Work not started yet.

 

WLCG-07-06: Job Priorities Available at the Site.

Tested next week and targets 15 June 2007.

 

WLCG-07-10: 3D Condition DB in Production.

Installed and available. Not known whether experiments have tested it.

 

D.Duellmann provided a summary of the 3D Milestones for all sites:

-          ATLAS has tested all its Tier-1 sites except PIC that will connected soon.

-          LHCb sites have been connected but not regularly tested yet.

-          CMS need not to be followed as they are setting up their sites. (CMS removed from the 3D milestones).

 

Ph.Charpentier added that LHCb would like to have streaming for the LFC database at all sites, like it is now set up at CNAF.

 

J.Templon noted that ATLAS had asked to stop deployment of Job Priorities at sites where this had not been done yet.

 

The next steps are that the installation will be done by YAIM in the standard way. And also documentation must be written for the sites not using YAIM. This should be done in a couple of weeks.

 

The MB agreed that the milestones related to Job Priorities should to be suspended until the YAIM Distribution is available and the documentation written. And then sites will have 2 weeks to install it locally and complete milestones WLCG-07-06.

 

CERN

 

Not represented at the MB Meeting.

 

FZK

 

Not represented at the MB Meeting.

 

Updated vie email:

-          WLCG-07-02:  not done due to mismatch with internal time scales (already reported during last report)

-          WLCG-07-04:  SLA is iterated into third version now; in the mean time we discovered that we can't send this out without comments by our legal department; I will comment on this during the MB

-          WLCG-07-04:  not done

-          WLCG-07-06:  not available in production; we had problems to install the new WMS to test this in pre-production

-          WLCG-07-10:  I think this is done from our side, but I don't know whether experiments do have the conditions DBs in real production. Experiments should comment.

 

 

INFN (L.Dell’Agnello)

 

WLCG-07-02: 24x7 Support Tested

Level of support defined and some services are under 24x7 support. A few tests have implicitly been done because some real alarm has happened.

 

WLCG-07-04: VOBoxes SLA Defined.

Met ATLAS recently, agreed the support level and needs to be written.

 

WLCG-07-08: Accounting Data published in the APEL Repository

INFN collects the data using DGAS and have a tool to export to APEL. Some problems last month with the DGAS sensors and RGMA. They re-exported the data.

L.Dell’Agnello will confirm the status of the milestones via email.

 

NDGF (O.Smirnova)

 

WLCG-07-01: 24x7 Support Defined

24x7 Support is available for the SRM. Planned for June 2007. 

 

WLCG-07-04: VOBoxes SLA Defined.

Same for VOBoxes SLA Defined. Planned for June 2007. 

 

WLCG-07-09: 3D Oracle DB in Production.

The 3D setup is still temporary and not at the site where it should be (Oslo) with a new team and installation.

Therefore is not considered complete yet.

 

WLCG-07-08: Accounting Data published in the APEL Repository

On accounting the interface to SGAS is scheduled for June 2007.

 

PIC (G.Merino)

 

WLCG-07-01: 24x7 Support Defined

No progress on this milestone. Discussing the administrative and technical issues.

 

WLCG-07-04: VOBoxes SLA Defined.

Experiments contacted but milestones not completed.

 

RAL (J.Gordon)

 

No changes. Will send an update off line.

 

Update  via email:

-          WLCG-07-04 RAL has draft SLAs for Alice, ATLAS< and LHCb. They are not yet formally agreed but are close.

-          WLCG-07-10 Condition DB in production for ATLAS and LHCb. CMS use Frontier for this so it isn't relevant and shouldn't stop RAL being Green

 

SARA (J.Templon)

 

No progress.

 

TRIUMF (R.Tafirout)

 

WLCG-07-02: 24x7 Support Tested

No testing done yet; but is documented what to do in any circumstance. Except when real alarms have occurred.

 

WLCG-07-04: VOBoxes SLA Defined.

Should be completed in a week or two after approval from ATLAS.

 

BNL (M.Ernst)

 

24x7 Support is in production by now and people are on call and it is a service already being used.

 

WLCG-07-04: VOBoxes SLA Defined.

Defined, just waiting for ATLAS confirmation.

 

WLCG-07-06: Job Priorities Available at the Site.

Should be completed in the next couple of days.

 

Update via email:

-          WLCG-07-04 (VOBoxes SLA) was defined, proposed to and agreed by ATLAS

-          WLCG-07-06 (Job priorities available at sites) are available and published in the information system

 

FNAL (I.Fisk)

 

All done except the Job Priority milestones.

 

Question from I.Fisk: CMS needs to turn off SL3. When is the SL4 release going to happen?

I.Bird replied that the WN is past PPS testing and waits for confirmation from the Experiments. Should soon go to the Production System.

 

4.3         Updated HLM Dashboard AFTER the Roundtable

 

25.05.2007

WLCG High Level Milestones - 2007

 

 

 

Done (green)

 

Late < 1 month (orange)

 

Late > 1 month (red)

 

ID

Date

Milestone

ASGC

CC IN2P3

CERN

FZK GridKa

INFN CNAF

NDGF

PIC

RAL

SARA NIKHEF

TRIUMF

BNL

FNAL

24x7 Support

WLCG-07-01

Feb 2007

24x7 Support Definition
Definition of the levels of support and rules to follow, depending on the issue/alarm

 

 

 

Sep
2007

 

Jun
2007

 

 

 

 

 

 

WLCG-07-02

Apr
2007

24x7 Support Tested
Support and operation scenarios tested via realistic alarms and situations

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-03

Jun
2007

24x7 Support in Operations
The sites provides 24x7 support to users as standard operations

 

 

 

 

 

 

 

 

 

 

 

 

VOBoxes Support

WLCG-07-04

Apr
2007

VOBoxes SLA Defined
Sites propose and agree with the VO the level of support (upgrade, backup, restore, etc) of VOBoxes

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-05

May 2007

VOBoxes SLA Implemented
VOBoxes service implemented at the site, and tested by the Experiments

 

 

 

 

 

 

 

 

 

 

 

 

Job Priorities

 

 

Milestones suspended until there is a YAIM Installation Package (and Documentation for Sites not using YAIM) 15..06.2006

WLCG-07-06

Apr
2007

Job Priorities Available at Site
Mapping of the Job priorities on the batch software of the site completed and information published

 

 

 

 

 

 

 

 

 

 

 

 

WLCG-07-07

Jun
2007

Job Priorities of the VOs Implemented at Site
Configuration and maintenance of the jobs priorities as defined by the VOs. Job Priorities in use by the VOs.

 

 

 

 

 

 

 

 

 

 

 

 

Accounting 

WLCG-07-08

Mar 2007

Accounting Data published in the APEL Repository The site is publishing the accounting data in APEL. Monthly reports extracted from the APEL Repository.

 

 

 

 

 

 

 

 

 

 

 

 

3D Services

WLCG-07-09

Mar
2007

3D Oracle Service in Production
Oracle Service in production, and certified by the Experiments

 

 

 

 

 

 

 

 

 

 

 

n/a

WLCG-07-10

May 2007

3D Conditions DB in Production
Conditions DB in operations for ATLAS and LHCb. Tested by the Experiments.

 

 

 

 

 

 

 

 

 

 

 

n/a

 

 

5.      AOB 

 

 

H.Marten asked that the Accounting Reports should be distributed one last time and agreed by the MB before sending them to the Overview Board.

 

L.Robertson agreed on this but the sites should update the value in the database before the 8th of the month and must return promptly the input data sheet sent by F.Baud-Lavigne. The distribution to the MB will be only to spot errors, not for the sites to complete their accounting information.

 

 

6.      Summary of New Actions

 

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.