LCG Management Board

Date/Time:

Tuesday 26 June 2007 16:00-17:00 - Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=13801

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 28.6.2007)

Participants:

A.Aimar (notes), I.Bird, T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, S.Foffano, D.Foster, F.Hernandez, C.Grandi, M.Kasemann, M.Lamanna, H.Marten, G.Merino, P.McBride, Di Quing, L.Robertson (chair), Y.Schutz, J.Shiers, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 3 July 2007 16:00-18:00 - F2F Meeting at CERN (513-1-027)

1.      Minutes and Matters arising (Minutes) 

 

1.1         Minutes of Previous Meeting

Comment received from C.Grandi:

 

Just one clarification on the VOMS Generic Attributes: the API call that retrieves the
GA is different from the one that retrieves the FQAN. So the existing middleware is not
affected by the introduction of GAs. This ensures full backward compatibility.
How to exploit GAs in the middleware is another story. An example is in the work done
for gLite-Shibboleth interoperability. I agree that the MWSG could be a proper forum for
further discussions.

 

The text above will be added to the minutes of last week. Minutes approved.

 

L.Dell’Agnello reported that the sites are doing their best to gain experience with replication using Oracle Streams to maintain copies of the VOMS database outside CERN. During the 3D workshop some tests of replication are being performed and the results will be reported in the framework of 3D. L.Robertson clarified that in the previous MB meeting he had simply mentioned that although Oracle stream can also be used,the requirement was for a read-only copy of the VOMS database and simpler solutions could also be adopted.

1.2         Points from the Previous Meeting

 

Job Priorities Working Group

From the minutes of the previous MB meeting:

S.Campana will be asked to lead the reformed Working Group. Plans will be rolled back
while waiting for a new proposal. The Operations Meeting will define the rollback
instructions to sites. To be confirmed at the next MB meeting (26 June).

 

L.Robertson reported that some discussion had taken place during the week. Exactly how the working group is reformed is a decision for EGEE, but WLCG is concerned in view of the previous problems and should be kept informed.

 

New Action:

3 July 2007 - I.Bird and C.Grandi will report on the structure and leadership reformed Job Priorities Working Group.

 

ALICE Resources Planned at GridKa

From the minutes of the previous MB meeting:

Following the meeting in an email sent to the MB by H. Marten, an additional exception was
communicated “Out of the total Alice resource increments planned at GridKa for 2008,
about 25% are planned to be provided in April and 75% in October.”
This should be discussed at the meeting on 26 June.

 

Y.Schutz stated that the estimation of requirements at GridKa had been done assuming that there would be an ion run at the end of 2008, and requirements for 2008 will be re-evaluated once the LHC schedule is defined. The decision to stage availability of resources at FZK was a local decision. L.Robertson said that, in order to avoid uncertainties, in the revision of requirements due in July ALICE should define the capacity required for protons and the additional capacity required for ions with the availability date for the latter if it is different from 1 April. While it looks at present unlikely that there will be an ions run in 2008 we have to assume the presence of the ion run every year until this is explicitly excluded.

 

H.Marten stressed the fact that changes cannot be asked to GridKa for 2007 and 2008. Any change in requirements must first be discussed with the GridKa Technical Advisory Board. The current resources are defined and agreed for the next 2 years; any modification for 2008 is difficult/impossible because the tenders for 2008 have been already launched.

1.3         Sites Availability Reports for May 2007 (Site Reports; Slides) - A.Aimar

A.Aimar presented a short summary of the Sites Availability Reports for May 2007 (see the completed Site Reports).

The table below shows the reliability values since January 2007.

 

Reliability >= 88%   (>= Target)

Reliability >= 79%   (>= 90% of Target)

Reliability < 79%   (< 90% Target)

Site

Jan 07

Feb 07

Mar 07

Apr 07

May 07

CERN

99

91

97

96

90

GridKa/FZK

85

90

75

79

79

IN2P3

96

74

58

95

94

INFN/CNAF

75

93

76

93

87

RAL

80

82

80

87

87

SARA-NIKHEF

93

83

47

92

99

TRIUMF

79

88

70

73

95

ASGC

96

97

95

92

98

FNAL

84

67

90

85

77

PIC

86

86

96

95

77

BNL

90

57*

6*

89

98

NDGF

n/a

n/a

n/a

n/a

n/a

* BNL: LCG/gLite CE probed by SAM but not installed with the SL4 upgrade

 

The target of 88% reached only by 6 sites. And only 8 sites were within 90% of the target:

The table below (not discussed at the meeting) summarizes the issues and solutions sent by the sites.

 

SITE: Problem à Solution

SRM/MSS/DB

CERN, INFN: CASTOR overload or instabilities

INFN: SRM servers timeouts

FZK: SRM instability à SRM restarted

PIC: Problems with STK robot and with dCache GridFTP doors

BNL: dCache problems with load balancing and choice of the pools à rebalanced dCache cost model

BNL: HPPS core server crashed à Restarted HPSS + GridFTP powered off à Restarted GridFTP

RAL: CASTOR overloaded à dCache as temp SE, then CASTOR could recover

TRIUMF: SRM blocked port 8443 not listening à restarted SRM

SARA: Problems with the disks of the Oracle server

BDII

FZK: BDII timeouts with CERN and with DNS à Fixed DNS information

IN2P3: Local Top-level BDII timeouts, re-indexed the database

CE

CERN: Problems with LCG-CE stability and scalability à Added 8 new CEs

FZK: CE locked up for several days, probably overloaded à investigating

RAL: CE overloaded à no solution yet

TRIUMF: CE Gatekeeper connection problems à GK restarted

INFN: CE problems with the LDAP service à Not understood. LDAP restarted

Operational Issues

FNAL: Configuration errors in the IS à Fixed the configuration

FZK: Power cut of the administrative rack + CE locked for several days and sites admin absent

PIC: Wrong manual CE configuration à Fixed the configuration

PIC: Wrong patch to the top-level BDII à Introduced correct indexes in top-bdii db

RAL: /temp full caused SAM to fail

SARA: Migration of DNS servers caused several reverse lookup failures

TRIUMF: Grid Canada CRL certificate expired, all users and services with GC credentials were stopped because of this.

SAM

Several SAM unavailabilities on the 28-31/05 replica-mgmt tests failing

ASGC: replica-management test fails because of wrong ACL permissions

IN2P3: Problems with SAM and scheduled downtime not taken into account

PIC: replica-management tests failed because of bad CloseSE configuration

 

The problems are very similar to the ones of the previous month, without considerable improvements or solutions found.

 

  • Only 6 sites > 88%               
  • Only 6+2 sites > 79%

 

The global averages are good because a few sites with 94/ 95/ 98/ 98/ 99 % reliability raise the overall average, compensating for several other sites with low values.

  • Average 8 best sites: 94%  
  • Average all sites: 89%

 

The main issues for May 2007 were:

-          CE overloaded at several sites, requires increasing the resources for the CE

-          CASTOR overloads are sometimes present

-          dCache GridFTP problems

-          SAM instabilities may have affected the results of end of May

 

We will continue to collect reliability data at the Operations meeting.

The MB members are asked to pass the message to the representatives at the Operations meeting:

The weekly reliability reports have to be as clear and complete as possible and always include:

-          Day:

-          Reason:

-          Severity:

-          Solution:

 

 

H.Marten asked what is being actually done to cure the problems that are the same ones (CE, SRM, etc) as in the past few months? How is this made available to the developers?

 

I.Bird replied that the only solution is to submit GGUS tickets so that the problems are defined and developers and coordinators can intervene. For instance, the problems with the CE performance require some CE experts to login and see what is happening on the site (analysing log files, process tables, etc). He also noted that it was known that CE problems could arise from overloaded CEs, and that sites should ensure that there are sufficient CEs to handle the load.

 

J.Templon noted that some sites with similar job loads do not have problems. NIKHEF can handle high loads using a single well configured CE.

 

M.Lamanna pointed out that the analysis of the job failures (for which tools are available) can highlight clear patterns of the failures and helps the system manager at the site to find out what is happening.

 

Suggestions for the Operations Workshop

-          H.Marten suggested that at the Operation Workshop at CHEP the sites should have the top 5 issues for each service.

-          I.Fisk also suggested a session for feedback from the sites to the developers, in order to improve the services and the software
developed.

-          R.Tafirout suggested that sites should share solutions; sometimes the information is known by others but not easily accessible.

 

I.Fisk asked for the correlation between “availability as seen via SAM” vs. the “real jobs success rates of the Experiments (via Crab, Ganga, etc)”.

1.4         Trends in efficiency for Tier1 sites (document) - M.Lamanna (for P.Saiz)

Material in preparation of the referees' meeting (new section in the monthly report) that shows the performance at each site (nine out of twelve) of the efficiency for each LHC Experiment. Page 11 shows the summary table for all sites and for all experiments.

 

J.Templon asked how the sites could investigate the reasons of the failures and find the causes of the problems.

M.Lamanna replied that he will send the link to the page where this information is available.

 

Update: Link received after the meeting: https://twiki.cern.ch/twiki/bin/view/ArdaGrid/GridReliability

 

 

2.      Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

No action due.

 

3.      WLCG Service Interventions (Targetted Interventions; WLCG Intervention Logs; Document) - J.Shiers

 

 

J.Shiers presented the status of the daily and weekly logs of the WLGC services interventions.

The Document attached provides a written summary of the service interventions at the Tier-0 and Tier-1 sites.

 

There is almost one intervention a day unscheduled or not properly scheduled. In addition the scheduled intervention should always try to be transparent and should not impact the running services.

3.1         Critical Issues

The critical issues that currently have a major impact on the overall services are:

-          Infrastructure issues, power, cooling and network problems, responsible in the worse cases for complete
site downtime for several hours or even days;

-          Problems with storage and related services, e.g. CASTOR or dCache (by definition);

-          Problems with back-end database services or the interaction between the application layer and the
database, affecting services such as LFC, FTS, VOMS, SAM etc.

3.2         Proposed Actions

The action proposed at the sites are:

-          Infrastructure: Each site investigates into its infrastructure problems and explains the reasons/solutions.

-          Storage Services: ensure there is enough redundancy in order to guarantee the service in case of interventions (SRM, FTS)

-          Database Interaction: There are well known techniques for database load-balancing and fail-over; they need to be implemented at the sites.

 

The goal is that the interventions should become often transparent to the users and a periodic analysis of the interventions should show improvements in the next months.

 

J.Templon stated that the problem of unreliability is that the middleware or the MSS stop working. Those problems are difficult to solve for the site.

J.Shiers replied that every day, even when there is not a middleware or MSS problem, there are scheduled or un-scheduled interruptions that should become transparent and keep the services fully running for the VOs and for the SAM tests.

 

J.Templon commented that receiving an external input on what are the problems for each site would help the sites.

J.Shiers replied that sites should know the causes of the interruption, but if needed a more detailed analysis will be prepared with the frequency of problems at the sites. For instance power cooling is a common issue that sites could already investigate.

 

H.Marten asked whether there is a way to freeze the middleware installation for the next year(s), except for occasional bug fixes.

I.Bird replied that it is not realistically going to happen. Upgrades of the middleware are going to be needed both for bug fixes and to add features that the experiments or the services still need (e.g. more scalable solutions, etc).

 

New Action:

31 July 2007 – J.Shiers – Analysis of site scheduled and unscheduled interruptions, showing relative impact at sites, frequency and duration of different causes, etc.

 

 

 

4.      High Level Milestones Update (Feedback Received; Milestones) - A.Aimar

 

 

A.Aimar summarized the feedback received on the High Level Milestones and asked the opinion of the MB about the changes proposed.

 

See the Feedback Received. for details:

-          O.Smirnova:
Proposed that the middleware milestones be “n/a” to NDGF. Accepted.

-          H.Marten:
Asked the BDII milestone (WLCG-07-22) about installing a local BDII be moved for later than June 2007.
As the installation of a local top level BDII has been discussed for months in GDB and MB the milestones WLCG-07-22 stays unchanged.

WLCG-07-31 on SL4 WN installation refers to the installation of SL4 WN native binaries.

WLCG-07-34, on the installation of the UI at each site. Removed because not all sites have a central interactive facility for the grid users (e.g. the central “lxplus” cluster at CERN).

WLCG-07-41 (xrootd Interfaces Tested and Accepted by ALICE) is there to make sure that the xrootd interfaces suit ALICE’s requirements and that have been tested. Support and installation will be dealt separately and may vary depending on the SRM implementation (CASTOR, dCache or DPM) and on the site.

-          G.Merino:
Asked whether WLCG-07-20 (FTS 2.0 deployed in production at the sites) is suitable and not too late for CSA07. The reason for setting September as deadline is because, after its release in July, August was considered to short for installing FTS 2.0 at all sites. But if sites can install it in August so much the better.
P.McBride replied that CMS will use SRM 1.1 independently whether will be provided via FTS 1 or 2.
I.Fisk added that CMS was expecting FTS installed before CSA07, but will be using the SRM 1.1 endpoints in any case.

-          J.Templon:
Asked that WLCG-07-23 (Run the CE Info Provider on the Site-Level BDII) not be mandatory because is not needed by all sites.
Milestone removed as high level milestone.

 

The High Level Milestones were updated after the discussion

 

In RED the new milestones due by June and that will be checked next week at the F2F meeting.

Updated WLCG High Level Milestones - 28.06.2007 (PDF, XLS)

 

 

5.      AOB 

 

 

Talk on VO box SLAs (Slides) by H.Marten postponed to next week.

5.1         April and May Accounting Report (Paper) - J.Gordon

A comparison between APEL accounting and the WLCG Report for MB data are compared for April and May.

 

There is agreement for 8 sites (i.e. the data in APEL is correct) while for other sites either do not match or the data is not published in the APEL Repository.

 

The MB agreed that the report should be presented by J.Gordon and discussed at next F2F MB meeting.

 

J.Templon was surprised that the SARA data is considered correct because they had to correct manually the data extracted from APEL and received from F.Baud-Lavigne.

5.2         gLite Proxy Renewal

Y.Schutz raised the issue that ALICE is currently blocked in the use of the gLite WMS. The discussion with the security group is taking already 3 weeks.

They have asked for a change of the lifetime of the VOMS proxy (currently 24h). ALICE asked for 72h at least, for next year.

 

Ph.Charpentier supported this request. For the jobs in the queue, when not using gLite WMS, the proxy token is not renewed and this is an issue for the Experiments.

 

I.Bird informed the MB that the ALICE request has just been accepted. But it should be considered a temporary solution and instead the experiment software should properly renew the VOMS proxy.

 

C.Grandi asked whether when the new gLite WMS will be available ALICE and LHCb will use it.

Both experiments replied positively.

5.3         MB Alternate Representatives

J.Templon asked how one member can specify alternate contacts for the MB.

L.Robertson replied that one should contact F.Baud-Lavigne in order to modify the “alternate” mailing list and membership table accordingly.

5.4         Management of Stalled Jobs

T.Doyle will distribute a proposal on the “Policy on UK Sites Stopping Stalled Jobs”. It will be discussed at the F2F Meeting next week.

 

6.      Summary of New Actions

 

 

3 July 2007 - I.Bird and C.Grandi will report on the structure and leadership reformed Job Priorities Working Group.

 

31 July 2007 – J.Shiers – Analysis of site scheduled and unscheduled interruptions, showing relative impact at sites, frequency and duration of different causes, etc.

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.