LCG Management Board

Date/Time

Tuesday 25 November 2008 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=45194

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 28.11.2008)

Participants

A.Aimar (notes), D.Barberis, I.Bird(chair), D.Britton, T.Cass, L.Dell’Agnello, F.Donno, M.Ernst, X.Espinal, I.Fisk, S.Foffano, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, M.Lamanna, P.Mato, S.Newhouse, A.Pace, R.Pordes, B.Panzer, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 2 December 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments on the minutes. The minutes of the previous MB meeting were then approved.

1.2      SRM V2 in the GridView Reports

The MB agreed last week on moving to use the SRM V2 for the GridView Reports. This means that the SRM V2 results replace the SRM V1 and SE previous tests in the reports.

Those old tests can stay in SAM but will not be considered critical in the GridView reports and do not impact the calculation of site availability and reliability.

 

The proposal is to switch to the new tests by December.

 

R.Tafirout noted that Sites will need some configuration. And therefore maybe a bit more time is needed.

The discussion went on few minutes (and at the end of the meeting) but January was seen as a time that will just delay what can already be done now.

 

Decision:

The MB concluded that the Sites should be prepared by the 1st December. Sites can already check that their SRM V2 setup is correct as the values are already collected.

 

New Action:

30 Nov 2008 - Sites should configure (if needed) their SRM V2 before the end of November. The SRM V2 SAM results are already available in the SAM tests and will be used for the December’s GridView Reports.

1.3      Connectivity between BNL and CNAF

A workaround has been found but there is a problem to have a link between BNL and CNAF. But a general problem is still there.

 

I.Bird reported that it seems that CNAF has an objection to support Tier-1 to Tier-1 direct traffic. The OPN should also be used for this purpose.

L.Dell’Agnello replied that the OPN originally was a Tier-0 to Tier-1 communication solution and now is stretched to be used to all Tier-1 to Tier-1 connections. But there is no coordination in the OPN meetings about these links: separately each site discusses with everybody else. The result is that actually often all traffic is passing via CERN and is not a sound architecture.

 

I.Bird reported that D.Foster said that the configuration is so that Tier-0 to Tier-1 traffic has always the precedence on any other transfer. And the 10 Gb bandwidth seems sufficient also for all Tier-1 to Tier-1 traffic.

L.Dell’Agnello added that CNAF is also looking for other backup links (via FZK) and the solution will be available in a few weeks.

 

2.   Action List Review (List of actions) 
 

  • SCAS Testing and Certification

M.Schulz reported that is still ongoing. SCAS can be released without the fix to gLexec. SCAS will not work with gLexec in logging-only mode. SCAS is being certified and will be released.

J.Gordon reminded that the security team is against logging-only mode.

  • F.Donno will distribute a document to describe how the installed accounting is collected and should describe in details the proposed mechanism for sites to publish their inhomogeneous clusters using the current Glue1.3 schema Cluster/Subcluster structures.

Done.

 

New Action:

7 Dec 2008 - Sites comment the proposal by F.Donno for reporting installed capacity

  • VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).

F.Hernandez reported that there is no news from CMS. M.Kasemann will follow up the issue.

    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

NL-T1 not completed yet.

NDGF is still modifying the SLA with ALICE.

·         DCache team should report about the client tools should present estimated time lines and issues in providing the porting to gcc 4.3.

  • HL Milestones and Metrics
    • The next steps will be to propose some new milestones and target dates.
    • Proposals for metrics are also needed from the Tier-1

·         Site request clarification of the data flows and rates from the Experiments. The best is to have information in the form provided by the Dataflow from LHCb

Not done yet.

  • Experiments should verify whether the gcc 4.1 binaries work and report on issues and problems. P.Mato will report in a couple of weeks.

D.Barberis noted that ATLAS will not test gcc 4.3 for the moment.

M.Kasemann noted that they ship their own software and compiler. CMS will only try gcc 4.3 next year.

Who is going to use and test gcc 4.3 then?

Done for ATLAS.

 

3.   LCG Operations Weekly Report (ASGC update; Slides) – J.Shiers

Summary of status and progress of the LCG Operations. It actually covers last two weeks.

The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary of the Week

ASGC – continuing problems with CASTOR service. The current issues are clearly with Oracle service and configuration.

Current recommendation from CERN experts is to do a complete reinstallation of Oracle. The preferred option is a cluster configuration (RAC) with the Oracle Automatic Storage Manager (ASM) to manage storage used for data – not OCFS2 (Oracle Files System, which will be abandoned) as currently used by ASGC.

This requires a certain level of Oracle experience, but which in any case is commensurate with that required to run a service of this type.

A possible stop-gap solution – but which would further increase diversity – would be a “single instance” – i.e. non-RAC installation to re-establish at least some service.

An appropriately qualified pro-active DBA (~1 FTE) is required asap together with participation in CASTOR external operations and Distributed Operations meetings.

 

L.Dell’Agnello asked whether it is acceptable that 1 FTE is used to follow-up Oracle.

J.Shiers replied that in the current period this is necessary but hopefully will not be the case once the procedures are automated and the DBs are more stable than now at some Sites.

 

FZK – clarification of LHCb LFC streaming issues.

 

CERN - CASTOR ATLAS degradation at CERN. Castor Database Jobs were not running on the right cluster node. Full report is available here: https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20081120

 

CERN - Due to a cooling problem the batch system was shutdown. Submission of new jobs is still accepted but the jobs will not be started. A fix is expected by the end of the day.

The cooling of the Computer Centre failed in the morning at around 08:00 due to a major leak in some cooling pipe. In order to limit the temperature increase in the machine room, the batch system was quickly stopped. The ventilation units were also restarted as soon as possible by the cooling expert in order to cool down the room with fresh air only. The batch system will likely remain down for a few hours, until the cooling system is properly fixed.

The leak in the cooling installation is fixed (11:40). The cooling system is being restarted but will only be fully operational at the end of the afternoon. Batch nodes are going to be progressively restarted in accordance with the available cooling capacity and with the importance of the nodes (nodes serving important queues will be restarted first).

 

OPN Network - There are repeated complaints about difficulties or lack of between various teams involved in providing / using network services. It is not clear how / if these are addressed by the recent proposal on network operations presented to the November GDB. The possible recommendations are problem needs to be owned and pursued.

 

M.Ernst added that the situation is to be considered unacceptable. The Service Providers do not monitor their system but rely on feedback from the users. This was the case of the issues between BNL and CNAF, plus other issues at FNAL and BNL. He tried to contact the US LHCnet and ESnet but there is never any information coming from them. The transatlantic providers involve both US and European but there is not adequate monitoring and information.

 

J.Shiers also presented possible useful actions to undertake:

-       Regular (quarterly?) WLCG <-> Oracle technical reviews should be established to prioritize and review open Service Requests. We need to make sure that problems that affect (degrade, stop) the services get fixed on an acceptable timescale. Days or weeks maybe are ok; Months or years are not.

 

J.Gordon suggested that the meeting with Oracle could take place in the week of the CASTOR F2F at RAL.

J.Shiers supported the proposal.

 

-       T0 / T1 storage services: urgent to minimize complexity / diversity and streamline operations / stabilize services. The dCache workshop being organized at FZK in January and the next CASTOR F2F w/s at RAL in February could be central to this.

-       Recommended configuration(s), routine operations, procedures and “best practices” need to be documented and shared

 

J.Templon added that the current Sites practices could be reviewed in another Tier-1 Review as happened in 2006.

I.Bird agreed but noted that this will not happen before the CASTOR and dCache workshops early 2009. A review could be organized by March or April 2009 not before.

 

4.   Reporting Installed Capacity (Slides; Document) – F.Donno

 

F.Donno prepared a document explaining to the sites how to report installed CPU and Storage capacity via the GLUE information.  The slides attached summarize the proposal in the document.

 

The proposal is to consider the number of machines at a Site (including those down and unavailable). Then this number should be multiplied by the number of cores.

Sites that also provide the resources to other VOs should consider the fair-share used.

 

L.Dell’Agnello asked how to deal with the number of cores and number of jobs (e.g. 8 cores and 10 job slots).

I.Bird and F.Donno replied that for accounting purposes is the number of cores that counts, not the number of job slots. The job slots are stored in another database and for other purposes.

 

The WN working group has proposed that non-homogenous sub-clusters are split in homogenous entities and each of them is separately reported for accounting. The approaches proposed are in the document.

Other 3 attributes are discussed as well as how to report Storage Capacity at the Site.

 

All information should be collectable by the Information Provider and not be manually set.

 

The document has been prepared involving all parties involved. Developers, Sites and Experiments’ users.

 

Decision:

I.Bird proposed that there are 2 weeks for comments and if there are no major changes should be implemented right after. Sites will be asked to verify and maintain the data that they publish.

 

J.Gordon asked whether there is a way to check that Sites implement this policy.

F.Donno replied that verification procedure already implements these checks.

 

I.Fisk noted that these values will be scrutinized in detail, why do we need to have another way to report?

I.Bird replied that the Tier-2 Sites are not reporting in any other way their accounting information. And the Experiments need to know this information in details to plan their work.

 

M.Ernst added that integration with the current middleware used made difficult to report the Storage information correctly. The information is very sensitive and should not be reported until it is sure it is correct. And should not be manual otherwise will always be difficult to maintain.

 

I.Fisk added that this information should be provided by the software used at the Tier-2 (dCache, DPM, etc) and then carefully validated before any distribution.

 

New Action:

10 Dec 2008 – Comments to the proposal for collecting Installed Capacity (F.Donno’s document) should be sent to the MB mailing list.

 

 

5.   Follow-up on New HL Milestones (Document)

 

The milestones were not discussed at the MB meeting but will be reviewed at the F2F in 2 weeks.

 

New Action:

10 Dec 2008 – Comments to the proposal of new High Level Milestones should be sent to the MB mailing list.

 

6.   Plans during the Xmas Period - Experiments Representatives

 

The Experiments presented their plans for the Xmas period.

 

ATLAS – Will be using a best effort basis but have enough people that will be available to maintain the system working. MB and user access will be the main activities. Functional tests will be restarted after the period.

 

CMS – Will do MC production. Therefore will rely on the infrastructure working (including the Tier-2).

 

ALICE – Will also do MD production and the Tier-2 SE should be working.

 

LHCb was not represented at the meeting.

 

7.   AOB

 

 

Y.Schutz asked about the feedback at the LHCC about the changes of resources asked by the Experiments.

I.Bird replied that there was no feedback in the public sessions. The proposed changes will be analysed in the next weeks only.

 

8.    Summary of New Actions

 

New Action:

30 Nov 2008 - Sites should configure (if needed) their SRM V2 before the end of November. The SRM V2 SAM results are already available in the SAM tests and will be used for the December’s GridView Reports.

 

New Action:

10 Dec 2008 – Comments to the proposal of new High Level Milestones should be sent to the MB mailing list.

 

New Action:

10 Dec 2008 – Comments to the proposal for collecting Installed Capacity (F.Donno’s document) should be sent to the MB mailing list.

 

New Action:

7 Dec 2008 - Sites comment the proposal by F.Donno for reporting installed capacity