LCG Management Board

Date/Time

Tuesday 21 July 2009 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=62550

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 27.7.2009)

Participants

A.Aimar (notes), J.Bakken, I.Bird(chair), D.Britton, T.Cass, L.Dell’Agnello, M.Ernst, X.Espinal, S.Foffano, Qin Gang, A.Heiss, F.Hernandez, M.Kasemann, M.Lamanna, M.Litmaath, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, R.Tafirout

Invited

M.Girone

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 4 August 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes. The minutes of the previous MB meeting were approved.

1.2      Approval of Security Documents (Grid-UserJobAccountingData.pdf; SecurityIncidentResponse.pdf; VOPortals.pdf) – I.Bird

D.Kelsey had distributed the Security documents for approval. No comments were received.

 

L.Dell’Agnello asked in the SIR document who is the coordinator following the incident. But he added that the document can be approved in his opinion

I.Bird replied that it is the Security Officer of the infrastructure (i.e. R.Wartel for EGEE).

 

Decision:

The MB approved the three Security documents.

1.3      SAM Availability Reports - June 2009 (T1_Summary_200906.pdf; Tier1_Reliab_200906.zip; Tier2_Reliab_200906.pdf) – A.Aimar

A.Aimar distributed for information the SAM report for June 2009.

 

2.   Action List Review (List of actions)

 

·         5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

L.Dell’Agnello stated that CNAF completed their internal tests and sent a report to R.Wartel.

The Italian ROC security manager also sent his report. Check with R.Wartel if the action is now complete.

 

·         Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar

 

Not done by: ES-PIC FR-CCIN2P3, NDGF, NL-Tier-1, US-FNAL-CMS

Sites can provide what they have at the moment. See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics

Sites should send URLs to existing information until they do not provide the required information. 

DE-KIT now is reporting correctly the data but ES-PIC has stopped. All Sites reported that they will probably not be able to provide the XML file before the end of the month.

 

 

3.   LCG Operations Weekly Report (Slides) – M.Girone
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting.

All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

The report covers two weeks, the 5-18 July.  With good participation, including better reporting from FZK and ASGC.

 

No alarm tickets for the two weeks.  STEP’09 post-mortem workshop held at end of first week. Here is the Agenda. \

 

The main issues this week were:

-       RAL move to new machine room successfully completed at beginning of this period. More info on Planet WLCG

-       NIKHEF cooling problem – 30% capacity off until move to new CC (foreseen for 10 – 21 August)

 

And the two incidents leading to post-mortem reports were:

-       ATLAS Central Catalogs Degradation. See https://twiki.cern.ch/twiki/bin/view/LCG/PostMortem13Jul09

-       CNAF LFC problem on 12th-13th July just received

 

Here are the GGUS tickets submitted. As usual ATLAS and LHCb sent more tickets than the other VOs.

 

VO

User

Team

Alarm

Total

ALICE

3

0

0

3

ATLAS

20

64

0

84

CMS

15

0

0

15

LHCb

1

41

0

42

Totals

39

105

0

144

3.2      Sites Availability

Slide 4 shows the matrix of the SAM availability tests results for each Site vs. VOs.

One can notice that:

-       ALICE no particular issues

-       ATLAS had problems with PIC, RAL and SARA

-       CMS problems with ASGC

-       LHCb problems with SAM tests for a couple of days.

 

More in detail:

ATLAS

-       RAL: 7th-8th July – SRM  instabilities (recovering from a long DT)

-       FZK: 7th-9th July – SRM instabilities

-       CNAF: 12th July – LFC DB problems (post-mortem?). Also some LFC instabilities on 13th due to network glitches

-       SARA: 17th-today – unscheduled downtime on one CE (job submission)

CMS

-       PIC: 7th-8th July – job submission failures due to batch system misconfiguration

-       ASGC: 7th – today –  Castor SAM tests timeouts (long queues in the castor job scheduler under load) 

LHCb

-       All T1s: 7th-8th July – Sam tests not running properly due to a misconfiguration on Dirac

3.3      SIR: ATLAS Central Catalog

DB service interruption with session kill for several connected session on Sunday 12th at 10:26. Full service connectivity was restored at 10:28 and again at 10:32. Full service connectivity restored at 10:36

 

The problem is understood. It was caused by a wrong DBA operation when increasing the recovery area (alert sent when it reaches 85%). Details at: https://twiki.cern.ch/twiki/bin/view/LCG/PostMortem13Jul09

3.4      SIR: CNAF LFC Problem

On Sunday 12 July at 01:13 am the ATLAS LFC standby database in Roma has become unreachable because of a storage problem. Moreover, at CNAF, on Sunday afternoon, a not well understood problem has caused the loss of connectivity to the storage area network from several Oracle clusters among which there was the ATLAS LFC one.

 

Due to this connectivity problem, several clusters have been automatically rebooted, after the reboot, the connection between the LFC front-end and the back-end has been automatically restored, but unfortunately the software wasn't functional. On Monday the 13th, the database was in hang with an error ORA-29702(error in cluster group service operation). We found a lot of connections (order of 100) on the database, while the usual number is 40.

 

The investigation of this problem is difficult because in the LFC front-end logs there is a hole between July 12 at 22:41 and July 13 at 10:19, probably due to the fact that the lfcdaemon was in hang. As the database in Roma was unavailable, the failover didn't succeed.

 

The service has been restored in the evening on Monday the 13th, in both CNAF and Roma sites.

 

L.Dell’Agnello added that they lost all logs in addition. Therefore a detailed explanation is not possible.

 

Qin Gang noted that the SAM pool at ASGC is very small and this causes the tests to fail even if their services were running correctly. They will increase the size of the SAM pool at ASGC.

 

 

4.   WLCG Technical Forum (Slides) – M.Litmaath

 

 

M.Litmaath presented the goals and his ideas on the WLCG Technical Forum that he will chair.

4.1      Overview of the TF

The TF scope and mandate are:

-       Discuss issues for improvement between WLCG stakeholders. Provide input on a common WLCG position to EGEE, EGI, OSG, Experiments, etc

-       Also covers longer term needs wrt. services and middleware. Focusing on the sustainability and evolution of the existing middleware in the light of changing technologies and experience.

-       Look (again) at common solutions in areas where existing practice is weak?

-       It needs to represent all the stakeholders. Experiments, sites, grid projects and they should bring in the appropriate experts depending on the topic.

-       It does not take decisions. But should produce clear documents for discussion in the GDB, and potential agreement in the MB.

4.2      TF Mandate

1. Provide detailed technical feedback and requirements on grid middleware and services, providing the WLCG position on technical issues:

-       Functionality, stability, performance, administration, ease

-       Interfaces, architecture, suitability

-       Maintainability, portability, standards compliance

 

2. Advise the MB on grid middleware and service needs and problems.

 

3. Discuss the evolution and development of middleware and services, driving common solutions where possible.

 

4. Prioritize needs and requirements.

4.3      Membership

It is very important that the Members must have the mandate to speak for their communities.

 

Each stakeholder should always have one representative + one alternate (who can also if the first is present).

 

Experiments:

-       1+1 per experiment

Sites

-       1+1 for Tier 0

-       1+1 for Tier 1s

-       1+1 for Tier 2s

Infrastructures

-       1+1 from each of EGEE/EGI, OSG, ARC

Experts

-       Typically brought in as needed

4.4      Possible Topics

A few initial topics could be the following.

-       Data management: Review where we are with SRM vs. actual needs and look further in the future.

-       Analysis model topics

-       Support for virtualization

-       Pilot jobs support

 

The group should also provide technical input to EGI discussions and be explicit and clear about the WLCG needs in the near future:

-       Needs from the NGIs

-       Input to proposal for the SSC for HEP

-       Input to the middleware discussion

 

Data Management

Efficient, scalable data access by jobs  ß  STEP’09 outcome

-       Local vs. remote

-       Protocols

-       Throttling

-       T3 farms vs. T2 load

Access Control:

-       ACLs

-       Quotas

Data Access

-       SRM

-       Xrootd

-       GPFS, Lustre file systems

-       NFSv4, Hadoop, REDDNet

-       File protocols

-       Data on clouds

Issues specific to some implementation(s)

-       BeStMan, CASTOR, dCache, DPM, StoRM

 

Job Management

A lot of issues here too.

-       CE: CREAM and WMS Issues with CREAM

-       ARC and Interaction with other middleware

-       Condor-G,-C, GT4

-       MyProxy failover

-       Pilot jobs (GlExec  and Frameworks)

-       Virtualization

-       Clouds

-       Shared SW area scalability (ALICE propose to replace with BitTorrent)

-       PROOF

 

Other Issues

-       Security: Vulnerabilities and consistency

-       Information system: Fail-over and GLUE 2.0

-       Monitoring: Jobs, consistency, consolidation

-       Accounting: Messaging system and storage

 

I.Bird noted that the MB should agree on the top issues for this Forum. Currently the most important ones seem those agreed at the STEP09 workshop. The current main problem is data access for all VOS and Sites.

 

T.Cass added that also the Virtualization workshop raised issues about jobs execution with local network addresses not visible outside a Site.

 

L.Dell’Agnello proposed to have several task forces to work on parallel on so many issues.

I.Bird agreed that different experts will discuss the different issues but separate task forces will be difficult to handle.

 

L.Dell’Agnello added that in addition for all Tier-1 and Tier-2 is difficult that a couple of persons knows the situations at the Sites. This person should collect information from all Sites and represent the Sites diversity.

 

How to get started? I.Bird proposed that the Forum sets up an all-inclusive mailing list should be prepared including the right people depending on the topic

M.Litmaath added that the great variety of Tiei-2 will have to be taken into account.

 

A.Pace added that also testing of the analysis models is a very important item to discuss and was not done in STEP09.

T.Cass seconded the proposal.

 

 

5.   Follow-up to STEP09 Post Mortem Workshop (Agenda; Slides) – I.Bird

                                                                                                                                                

 

I.Bird presented a summary of the discussion at the STEP09 Workshop.

5.1      Tier-0 and Tier-1

These issues below should be a kind of action list to follow one by one.

 

All Sites:

-       MSS metrics are needed and also real-time live transfer metrics are needed.

-       Need instant real-time monitoring of throughput (and per day overview) and to view transfers per experiment (WAN in/out; LAN – WNs). Tools for site and grid

NL-T1:

-       Communication and lack of SIRs

-       Lack of tape drives during STEP09 (now installed)

-       DMF tuning needed

-       Unexplained low performance

-       LAN b/w to WN too small

-       Shared SW area problems

-       Repeat tests

ASGC:

-       Castor version is the appropriate one to fix the BigID issue?

-       Job scheduling T2 v T1; ATLAS v CMS

-       Low efficiency for CMS reprocessing jobs

-       Repeat tests

FZK:

-       Improve communication to the outside world

-       SAN issues

-       Shared SW area problems

-       SRM server overload

-       dcap access problems

-       Too many lcg-cp ŕ overload gridftp

-       Repeat tests

NDGF:

-       No MSS performance monitoring

-       Low throughput

-       Analysis jobs overloaded network

-       No Panda pre-staging

-       What is action to improve?

CNAF:

-       Shared SW area problems

-       Site visits – planned for FZK + NL-T1

5.2      Tier-2

-       Shared SW areas affect CPU efficiencies in some sites (also T1) and need to be investigated.

-       The ATLAS efficiencies different between WMS and PANDA and is likely due to different data access methods.

-       Data transfer timeouts are an issue (see analysis summary)

-       Intra-VO fairshares how to manage them  (GDB should look into it)

-       VO communications. Need for VO-independent channel to reach all T2 admins

-       Network infrastructure not yet good enough for data rates.

5.3      Data Management

FTS

-       Timeouts for large files

-       Increase the number of concurrent jobs to increase bandwidth

-       Throttle/manage each VO bandwidth?

-       Deploy FTS 2.2?

LFC:

-       LHCb/Coral issues (Coral problem); Should LHCb use distributed LFCs (not essential if Coral problem fixed).

-       Deploy bulk methods (new additions)

 

Lcg-util – hangs at French sites (only?)

 

Dcap access for LHCb (root issue?)

 

dCache:

-       Need a clear strategy for Chimera (and risk analysis)

-       Explain “golden” release v what we have (and risk analysis)

-       Site decisions, but must be based on understanding of risks

Data transfer timeouts

-       dcap/rfio problems at high (analysis) load

-       90% of (analysis) job failures are read failures

-       Are these the same issue? (What?  SRM?)

Better data access strategies???

-       Pre-stage vs. copy locally vs. Posix-IO

-       General investigation of data access methods and problems

5.4      Other Services

WMS:

-       Stability/scalability (DNS load balancing, multi-core, ...)

-       WMS 3.2 to deploy?  But critical bug (for at least LHCb)

LCG-CE problems

-       Timeouts

-       Zombies

-       Globus-gma (>32k jobs) – workaround to clean up

 

MyProxy – single point of failure

 

GlExec/SCAS – need experience

 

CE:

-       Matching requirements to WN properties (blah integration with batch systems)

-       CREAM bugs to be fixed

BDII:

-       V5 more robust? Fix tools that cannot handle lists

APEL:

-       Tomcat problems

-       R-GMA problems (to replace with MSG)

Shared sw areas (came up many times)

-       Problems at NL-T1, FZK, CNAF + Tier 2s

-       ALICE use of BitTorrent? Is this a solution?

Monitoring:

-       ATLAS need more monitoring for analysis jobs (re-use CMS dashboard work?)

5.5      Summary

The many issues under work are:

-       Data access strategies need to be understood. Investigations under way. ATLAS, CMS, Massimo, Maarten, DM group

-       Shared software areas.

-       dCache strategies and risks

-       Monitoring: More real time on MSS transfers

-       MSS metrics

-       Improve dashboards – ATLAS vs. CMS

 

I.Bird added that in the next few weeks the MB should send him input on topics and priorities.

 

Action:

The MB sends input on topics and priorities for the Technical Forum.

 

Action:

Sites will be asked to report on the items resulting at the STEP09 Workshop.

 

J.Bakken asked what the proposal for real-time metrics is.

I.Bird replied that CASTOR will send a proposal, and one will see if can be adopted by others.

T.Cass clarified that the statistics are on the firewall on the nodes.

 

M.Kasemann added that the motivation was that one cannot see the overall transfer metrics.

 

T.Cass noted that the current rates in SLS could be instantaneous values and not collected over a few hours.

 

Action:

A.Aimar will contact T.Bell for MSS real-time metrics.

 

 

6.    AOB

 

 

M.Schulz announced that OSG seems pleased with the glExec wrapper scripts and they will distribute them

 

Meetings in August will be on the 4th and on the 18th.

 

 

 

7.    Summary of New Actions

 

 

Action:

The MB sends input on topics and priorities for the Technical Forum.

 

Action:

Sites will be asked to report on the items resulting at the STEP09 Workshop.

 

 

Action:

A.Aimar will contact T.Bell for MSS real-time metrics.