LCG Management Board

Date/Time

Tuesday 18 November 2008 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=39180

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 22.11.2008)

Participants

A.Aimar (notes), D.Barberis, L.Betev, I.Bird(chair), T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, X.Espinal, I.Fisk, F.Giacomini, J.Gordon, M.Kasemann, M.Lamanna, U.Marconi, H.Marten, P.Mendez Lorenzo, P.Mato, A.Pace, M.Schulz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 25 November 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments on the minutes. The minutes of the previous MB meeting were then approved.

1.2      Scope of the User Analysis Working Group

M.Kasemann expressed, via email before the meeting, concern about the scope of the User Analysis working group. I.Bird read the email at the meeting.

 

[...]

I am concerned this is a WLCG effort to rework the experiment's analysis models and enforce some lowest common denominator services. As I understood, It was originally proposed as a documentation and requirements gathering exercise,

[...]

 

 

I.Bird commented that the goal of the working group is not to interfere with the Experiments’ analysis models but to collect enough information for the Sites to be prepared for the Experiments’ expectations and type of usage they will do.

M.Kasemann explained that his worry was motivated by the request of adding new members from the Experiments to the working group.

 

M.Schulz replied that the Experiments had asked to have more representatives from the Tier-2 Sites in order to understand the impact on Tier-1 and Tier-2 Sites.

I.Bird suggested that initially the information should be discussed among the current representatives and then proposed to a larger audience.

M.Kasemann proposed a two-step solution: (1) gather and document the requirements of the Experiments and then (2) see the impact on the Sites.

 

J.Templon asked what will be the impact is if the models do not match with what the Sites can actually provide.

 

I.Fisk replied that the Tier-2 mostly support one VO only and could discuss directly with the Experiment. Only 4 are shared by CMS with other VOs. Not all sites are impacted in the same way; it depends on which VOs they support. The requirements should be collected but the impact on the Sites and where the efforts are spent should also be discussed.

I.Bird replied that this statement true for CMS; instead the three other Experiments share many sites. The goal of the working group is not to generate any development but agree on which of the current services should be installed and configured taking into account the Experiments’ needs.

 

The MB agreed on this approach and will discuss again the situation when the Experiments requirements and models are clarified by the UAWG.

 

2.   Action List Review (List of actions) 
 

  • F.Donno will distribute a document to describe how the installed accounting is collected and should describe in details the proposed mechanism for sites to publish their inhomogeneous clusters using the current Glue1.3 schema Cluster/Subcluster structures.

Not done yet. Scheduled for next week.

·         Proposals for metrics are also needed from the Tier-1

  • VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).
    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

·         The DM and dCache teams should report about the client tools should present estimated time lines and issues in providing the porting to gcc 4.3.

 

3.   LCG Operations Weekly Report (Slides) – J.Shiers

Summary of status and progress of the LCG Operations. It actually covers last two weeks.

The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Overview

The agreed overall goal, from last week, is to move rapidly to a situation where weekly reporting is largely automatic.

 

This was a relatively smooth week – perhaps because many people were busy with numerous overlapping workshops and other events. Two problems were mentioned at the meeting:

-       ASGC - After a few weeks of malfunctioning ASGC was back, for a few days, but more problems started on Friday and by Monday only 70% efficiency seen by ATLAS (CMS also sees degradation) – the expectation is that this will degrade further. In contact with ASGC – possible con-call tomorrow – but is very difficult to obtain clear information.

-       FZK – Some confusion regarding the LFC r/o replica for LHCb. A simple test shows that entries do not appear in the FZK replica whereas they do in others (e.g. CNAF). This should be relatively low-impact but needs to be followed-up and resolved. It seems that this problem is there since May at FZK but was not reported.

-       It is not entirely clear how LHCb distinguish between an LFC with stale data that is otherwise functioning normally and an LFC with current data.

 

In both cases there is no adequate communication from the Sites about the issues.

-       ASGC is again communicating very little at the Operations Meetings  

-       FZK should have reported this issue months ago it if is really there since May. There are daily Operations meetings and there were 3 workshops since May.

 

Important problems must be reported to the Daily Operations meetings. And the solutions described in detail. The daily meeting now has a new format and always checks::

1.    New/open GGUS tickets

2.    GOCDB intervention summary

3.    Baseline versions to be installed

 

Decision:

The MB agreed to ask for a report on the situation at ASGC.

 

4.   WLCG 2009 Data-Taking Readiness Planning Workshop (Slides, from slide 4 onwards) – J.Shiers

 

Over 90 people registered and at (many) times “standing room only” in IT Amphitheatre (100 seats)

4.1      Few Comments

Attendance pretty even throughout the entire two days – even early morning – with slight dip after lunch

Not really an event where major decisions were – or even could be – made; more an on-going operations event.

 

Probably the main point was about stable production services: On-going, essentially non-stop production usage from the experiments means on-going production service.

 

Overlapping of specific activities between (and within) VOs should be scheduled where possible.

 

J.Shiers expressed his personal concern that there are still too frequent and too long service / site degradations. Maybe can “tolerate” this for the main production activities – what will happen when the full analysis workload is added?

 

Compared to this time last year – when we were still “arguing” about SRM v2.2 service roll-out – one has made huge advances and can expect significant service improvement in coming months/year. But remember: late testing (so far) has meant late surprises.

4.2      Actions

Some actions and follow-up were agreed at the workshop:

-       A dCache workshop – most likely hosted by FZK – is being organized for January 2009

 

-       Discussions on similar workshops for other main storage implementations with overlapping agendas:

1.    Summary of Experiment problems with the storage

          what does not work that should work because it is needed

          what does not work but you just have to deal with it

2.    Instabilities of the setup

          timeouts, single points of failure, VO reporting, site monitoring known (site specific) bottlenecks

3.    Instabilities inherent with the use of the storage MW at hand

          bugs/problems with SRM, FTS, gridFTP, used clients

 

The rest of the slides contains a summary of the workshop (with J.Shiers noted in red) but was not discussed at the MB.

 

J.Gordon added that at the workshop Sites asked, in many occasions, for clarifications about job rates and the type of data flow and rates expected by the Experiments.

 

The action about Experiment describing their data flow at the sites should be tried again. We have already an action in from last week.

 

5.   SAM VO-specific test results (ALICE; ATLAS; LHCb; SAM_VO_Results, VO-Tests-Doc) -

 

The decision is to start publishing the VO SAM results for monthly reliability and availability, as it is happening for OPS tests since beginning 2007.

But before using this data for reporting it should be analysed in more detail in order to verify that the test are adequately representing the status of the Site and the results collected are a meaningful summary of the availability to each VO..  

5.1      ALICE – P.Mendez Lorenzo

There are several control improvements in the past months since the last report. Now there are redundant monitoring nodes – both monb002 and monb003 are in production. Their status is now monitored via Lemon sensors avoiding any overload of those systems.

 

Concerning the results for October 2008: the user proxy issue at monb002 and monb003 prevented the launching of SAM test during several days:

-       The issue is independent on the sites and also on the Experiment

-       Issue  was identified through Nagios and fixed

-       Coincided with interruption in the ALICE production and the SAM test results were not followed in detail at that time.

 

There are no plans to change the tests associated to the CE sensor. The tests are on job submission and status of the s/w area.

 

Instead ALICE is planning to implement a new test for the WMS sensor before end 2008. This is a pure ALICE test and is not exploitable by other VOs

 

New WMS Tests

ALICE has fully migrated to WMS submission mode. The «keep jobs» feature of the WMS is undesirable

-       CASE1 (CE down): The jobs are kept for further resubmissions

-       CASE 2 (WMS overload or in draining mode): same as above

These «pending« jobs are ghosts form the ALICE point of view

-       They are not counted by the LB nor the IS

-       Once the CE or the WMS comes back in production, the bunch of pending jobs are submitted

The feature can be disabled in ALICE dedicated WMS but not in those WMS shared with other VOs.

 

ALICE has already implemented into their own submission system a WMS selection procedure similar to that proposed at CERN (with the AFS UI)

-       Use RB1 OR RB2 OR RB3. If all these WMS fail…

-       Use RB4 OR RB5 OR RB6

 

Once the «drain flag» is implemented, this procedure will avoid the CASE2 from happening.

However until that moment and also to avoid the CASE1 a WMS test is necessary to avoid uncontrolled number of jobs being submitted to the sites and potential overload of site CEs

 

Nagios should send an email in case of warnings.

 

I.Bird asked whether the list is documented and whether the reliability reported corresponds to the effective one observed by ALICE.

P.Mendez replied that the CE tests are present but the VOBoxes tests are still missing. Only the CE tests represent the reliability of the Site for now. The WMS tests are not local to a Site and cannot be used for Site reliability. For the VOBoxes tests she will check with the Experiment. There are no ALICE tests associated to SRM, FTS and SE.

5.2      ATLAS – D.Barberis

There is a text summarizing the situation (from an email of A.Di Girolamo).

 

ATLAS Critical Tests now

            SE (put, get, del): n.b. tests for the "old" srmv1 endpoints

            CE (JobSubmission, VOswdir, VO atlas lcgtag list)

            FTS (check channel list)

            LFC (lfc-ls, add an entry)

 

Critical Test: to be updated with SRMv2, as soon as we have green light from GridView (that now cannot have Critical Tests in the SRMv2 services)

 

SRMv2 (put, get, del for each spacetoken):

Already launched on the sites since >2 months. Testing all the ATLAS spacetokens. Site result is the logical AND of all the results on the site spacetokens.

 

To Be Discussed:

Is it really impossible to improve the granularity of the SAM db? The "nodename" could be not enough for the experiment. I.e. ATLASDATATAPE spacetoken not working is highly critical, RAW data not saved on tape, while e.g. ATLASMCTAPE could be less critical, since production data should also be found on the Tier2 that produced them. An algorithm mixing the results could be not reliable.

 

Other work in progress:

Agreed with the SAM team (very good collaboration with them) to have a new SAM service (called "Analysis"). In this service it will be possible to include tests like gangarobot_lcg and gangarobot_panda that now could not be included since the queues that they use are not in the BDII. The lists of "nodes" (queues) that should be tested will be provided by ATLAS (like ALICE is doing for the VOBoxes).

 

October Reliability results:

overall good results (~ 90%) link

 

Site details:

ASGC: storage problems (since ~the 24th of Oct, Castor problems, now solved)

FZK: storage problems (since 28th, dCache overloaded, solved with the new upgrade)

INFN: computing element problems (17-21 Oct)

RAL: storage problems (castor problems, 14-15, 18-20 Oct)

NDGF: storage problems (19-22Oct)

 

 

The Critical Tests are those used in the GridView calculations, but the reliability is only on the Critical Tests.

For the moment ATLAS needs to test the SRMV2 endpoints and not V1 as it is done now/.

The results that show problems at some sites are correct and the failure showed real issues.

 

J.Gordon asked why BNL is not reporting any SAM results in the ATLAS tests.

M.Ernst replied that he will investigate the issue but it is likely due to the fact that an LFC service is needed and BNL will migrate to the LFC in the near future.

 

D.Barberis confirmed that the total result, once the SRM V2 tests are included, will be adequate to represent the status of the Sites for ATLAS.

5.3      LHCb – Ph.Charpentier

Email distributed before the meeting by R.Santinelli and Ph.Charpentier.

 

 

October 2008- Storage service analysis:

 

SRM tests were running (and not publishing results because not service was longer advertised as SRMv1) since the 9th of October.

Accordingly the (internal) elog entry http://lblogbook.cern.ch/Operations/768 (for the contingent reason explained there) we stopped

running the SRM sensors suite at all.

 

The SE sensor was on the other hands *not running* any test (not critical tests were defined for the SE sensor).

After the 9th of October not critical tests were however defined for the SRM too (the other sensor used by GridView for site availability and reliability computation). We definitely resolved this inconsistency (turning out into a lack of test results for crucial sensors for the Storages) by converting all SRM tests into SE sensors and restarting them *but* publishing them as SE test results. This happened the 20th of October accordingly the elog http://lblogbook.cern.ch/Operations/831.

 

Please note that this was just a trick that re-enabled some test results publication and storage service evaluation being still SRMv2 not in the list of crucial sensors for GridView.

This would explain why from the 10th to the 20th of October we got the gray zone f or the SE and for the SRM results.

After that for the Storage service we are running smoothly some LHCb specific tests documented in the usual TWIKI. We have now under test a fully exhaustive unit test from DIRAC too. Before setting it as critical following our internal policies to be sure 100% about it, we are monitoring these new sensors for a week.

 

October 2008- Computing element analysis

-----------------------------------------

SAM suite for the CE has been upgraded in two times: the first time to insure that very basic tests were only critical and guaranteed to run at all CEs. (This because several subsequent bugs found in the LHCb application).

The full original suite was indeed too demanding and in general not clear whether the problems and failures were due to the infrastructures or rather were due to the LHCB application.

 

However for CERN the software installation test couldn't have worked too because of a long standing issue with gsiklog utility and DIRAC proxies; we set finally critical just a couple of tests of the original rich suite run via DIRAC: lhcb-os and LHCb-queue.

These same critical tests however didn't always run everywhere (exclusively depending on internal DIRAC operations). At CERN they were simply not running at all. (This justifies the lack of results for CERN CE in the last month and half of November).

Then (but only the 12th of November) we decided to brutally introduce new "old-fashioned" critical tests for the reasons explained in the internal elog: http://lblogbook.cern.ch/Operations/911.

 

As also reported at the weekly OSG-EGEE meeting on Monday

https://cic.gridops.org/index.php?section=vo&page=weeklyreport&view_report=2140&vi ew_week=2008-47&view_vo=3#rapport we have introduced now (in parallel with DIRAC based suite for the CE) very basic-infrastructure tests. This are disentangled from DIRAC and are guaranteed to run always and everywhere.

 

Are very basic tests inherited from ops (sharedarea test, CSH and JS) but running as LHCb production; for T1 we also have the ConditionDB access tests that give a fair insight on the health status of the T1 as far as concerns LHCb perspective.

So for the next coming months LHCb will always provide test results for both CE and SE while SRM will have all critical tests unset (so will be irrelevant for site availability computation).

 

These tests will be more close/targeted to test the infrastructure than the specific application and having them failing seriously implies some critical problem with the site.

 

Final remark: SAM portal reports for all T1 green status (not failing these new critical tests + old basic tests from DIRAC)

 

Also LHCb needs the SRM V2 tests to be introduced as critical. The tests on the V1 end points are not anymore useful.

LHCb would like to have the possibility to test each spacetoken separately. The services should not be an OR of the services but should be defined by the VO as appropriate.

 

LHCb is still lacking tests on the CE for the moment. During the migration to DIRAC3 all tests were not updated and were run by hand by someone that has left since. Since a week LHCb is resurrecting some CE tests that are now going to be set as critical. In addition SAM and GridView do not always have the same result when combining tests not all reporting.

 

The LHCb VOBoxes are under the responsibility of the Experiment therefore they would not want to insert them in the Critical calculations. They are used by the Experiment as sensors. Having sensors on the streaming would be useful for the Experiment.

In addition where should the LFC tests belong? All SAM tests can only be grouped by CE, SE and SRM.

 

M.Lamanna added that LFC is an aspect of a more complicated issue for tests that are not in the present schema. Tests for analysis are attached to the CE tests but this is an arbitrary choice and instead the group should be expanded (or user definable).

 

Ph.Charpentier added that NL-T1 is seen as two sites (SARA and NIKHEF) and so parts of the tests fail on each of the two sites.

J.Templon agreed with the statement but the two sites have different domains and DNS and fusing them is not a feasible operation. And the SAM team had agreed to somehow support this kind of situation.

 

J.Gordon noted that there was also an APEL test that should be critical.

 

I.Bird agreed with the changes but at some point the situation must become stable otherwise every month we measure different metrics.

5.4      CMS – M.Kasemann

For CMS there is no change in the tests and the existing one represents well the availability of the Sites to CMS. Also CMS requested that the SRM V2 tests become critical. The document describes correctly the situation.

 

 

6.   Follow-up on WN software on SL5 with gcc 4.3 (Slides) – M.Schulz

 

The Applications Area decided on 30th Oct that their new platform would be SL5/gcc4.3.2/Python2.5.

See  http://lcgapp.cern.ch/project/mgmt/AFMinutes20081030.html

 

Consequently compatible client libraries must be made available but gLite must continue to support the standard SL5/gcc4.1/Python2.4 officially supported by EGEE. Because EGEE must support the standard compiler of an OS (i.e. gcc 4.1 for SL5).

 

In addition most services will stay on 4.1, only the WN software needs to move urgently to gcc 4.3.

While support for gcc3.4 (SL4 reference) needs to be clarified. As ALICE will run in compatibility mode on SL5 (with their gcc 3.4 binaries).

6.1      GLite-AA

Here are the packages involved supplied to the AA (gLite-AA):

– DPM-client

– DPM-interfaces

– lcg-dm-common

– LFC-client

– LFC-interfaces

– GFAL-client

– lcg_util

– CGSI_gSOAP_2.6

– lcg-infosites

– vdt_globus_essentials

– vdt_globus_sdk

– gLite-security-voms-api-c

– gLite-security-voms-clients

– gLite-security-voms-api-cpp

– myprox

 

Note the absence of dCache (dcap). Some of this (voms-clients) is to invoke as an executable, not for linking.

 

P.Mato added that dCache and Castor are not provided with the middleware. Are picked up directly from the projects.

6.2      Python 2.5

The following will be provided

-       GFAL-client-py25

-       lcg_util-py25

-       (LFC-interfaces-py25)

-       Few RPMs to be named.

Akos has informally provided this to ATLAS for evaluation:

http://egee-jra1-data.web.cern.ch/egee-jra1-data/python25-preview/

6.3      Gcc 4.3

With the exception of voms-api-cpp, everything on the list is C (or python).

-       The C ABI has not changed between 4.1 and 4.3 and there is no performance benefit from a rebuild

-       SA3 will supply the gcc 4.1 binaries initially. This avoids a VDT rebuild, which involves external partners.

-       What about dependencies?

 

We need to establish whether anyone is actually linking against voms-api-cpp or only with C API.

 

This requires a re-build

-       Voms runtime (e.g. voms-proxy-info) can still run properly with the gcc4.3.2 tool chain

-       If there are any other issues with cpp executables run directly in a gcc4.3.2 environment

 

SA3 will provide ‘evaluation’ binaries to ensure everything works. Without waiting for ETICS developments.

This will then be merged into our standard release when ETICS support has been introduced: It has been understood how this should be done. The work has not yet been triggered because they have a few other things on their plate

 

Action:

Experiments should verify whether the gcc 4.1 binaries work and report on issues and problems. P.Mato will report in a couple of weeks.

 

7.   Use of SRM 2 Tests in SAM reports (Mail J.Shade)

 

Email from J.Shade.

 

SRMv2 tests were presented by Konstantin to MB on Tuesday 9 September 2008

 

(See the minutes at https://twiki.cern.ch/twiki/pub/LCG/MbMeetingsMinutes/LCG_Management_Board_2008_09_09.htm).

Test results are visible in both SAM and GridView.

 

In the meantime, COD alarms have been turned on, and both CMS and Atlas have been asking to have the tests included in availability calculations.

 

To compare availability figures with and without SRMv2, the following page can be used (displays previous day's figures):

 

http://dashb-ops-sam/dashboard/request.py/metricvalues

 

The MB needs to give approval to have the SRMv2 tests added to the availability calculations:

        - either by simply adding another AND to the big four (CE, SE, SRM, sBDII)

        - or declaring that both SE and SRMv1 are deprecated and no longer supported

        - or keeping the generic class SE (Storage Element – not the Classic SE that it currently stands for) and putting the SRMv2 tests under that. Might be easier from SLA/MoU wording point of view

 

In terms of deciding which tests should remain critical (at the moment, they all are), the following dependencies might help:

 

Dependency tree for the tests in SRMv2 sensor:

 

         1:get-SURLs

          ^     ^

         /       \

    2:ls-dir  ___3:put___

               ^  ^  ^  ^

              /  /    \   \

          4:ls 5:gt  6:get 7:del

 

ls-dir can be thought of as a ping operation. In rare cases, an SRM is still useable even if that particular test fails, but the test provides useful debugging information. The SAM team recommends that all 7 tests remain critical.

 

 

 

M.Schulz noted that the results are almost identical except for Sites that incorrectly claim to support SRM V2 endpoints but they do not.

 

Ph.Charpentier asked why the reliability is often identical to the availability.

D.Barberis explained that the reason is because the example is calculated only over one single day.

A.Aimar will check the values for availability and reliability.

 

J.Gordon commented that the description of the tests do not seem complete. Two Experiments seem to point to invalid information.

 

Action:

Experiments should update the information describing their SAM tests, at this wiki page: https://twiki.cern.ch/twiki/bin/view/LCG/SAMVOSpecificTests

 

Decision:

The MB agreed to move to the SRM V2 tests from the month of December.

The Experiments will add their VO tests for specific spacetoken.

 

8.   AOB

 

 

No AOB.

 

9.    Summary of New Actions

 

 

Action:

Experiments should verify whether the gcc 4.1 binaries work and report on issues and problems. P.Mato will report in a couple of weeks.

 

Action:

The Sites are asked to confirm that the information http://dashb-ops-sam.cern.ch/dashboard/request.py/metricvalues  is correct.

 

Action:

Experiments should update the information describing their SAM tests, at this wiki page: https://twiki.cern.ch/twiki/bin/view/LCG/SAMVOSpecificTests