LCG Management Board

Date/Time

Tuesday 05 August 2008, 16:00-17:00 - Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=33706

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 3 - 15.8.2008)

Participants

A.Aimar (notes), O.Barring, I.Bird(chair), D.Britton, Ph.Charpentier, L.Dell’Agnello, A.Di Girolamo, D.Duellmann, M.Ernst, I. Fisk, S.Foffano, J.Gordon, M.Kasemann, M.Lamanna, E.Laure, H.Marten, P.Mendez, G.Merino, Di Qing, H.Renshall, R.Santinelli, O.Smirnova, R.Tafirout

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 19 August 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

1.1      Minutes of Previous Meeting 

The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions)

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

On going. It is installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it should announce it.

The only information still missing is: CMS list of 4 users DNs that can post alarms to the sites’ email address.

Update11.8.2008: A.Aimar asked to M.Kasemann, P.McBride and D.Bonacorsi.

  • 19 Aug 2008 - New service related milestones should be introduced for VOMS and GridView.

To be discussed at the MB.

  • M.Schulz should present an updated list of SAM tests for instance testing SRM2 and not SRM1.
  • J.Shiers will ask SAM to review the MoU requirements and define SAM tests to verify them.

These actions above should be discussed with M.Schulz and J.Shiers.

  • For the ATLAS Job Priorities deployment the following actions should be performed :

-       A document describing the shares wanted by ATLAS

-       Selected sites should deploy it and someone should follow it up.

-       Someone from the Operations team must be nominated follow these deployments end-to-end

 

M.Lamanna reported that for the moment there is no progress.

 

J.Gordon noted that ATLAS had decided to set a set of Italian Tier-2 Sites.

 

I.Bird asked whether the documentation on the middleware is ready.
O.Keeble replied that VO should provide a written description of what they need and the middleware team will document how to implement such requirements.

 

New Action:

19 Aug 2008 - Feedback from ATLAS on their Job Priorities installations and description of shares required at each Site.

  • 1 July 2008 - Experiments to provide their 2013 resource requirements by 01/07/08 to the LCG Office.

Done.

  • 8 August - WLCG Grid Deployment: all software required at Tier-1 and Tier-2 sites should be described in a wiki page.

Done, after the meeting. https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions

 

J.Gordon asked that clear change control is adopted for the versions of the software to be deployed. Every change should be discussed before at the GDB and MB, if needed.

O.Keeble agreed that a clear procedure should be defined and followed.

 

J.Gordon noted that, for instance, moving forward in version should only be caused by needs clearly stated and not be the availability of a new version.

O.Keeble reminded that some change could cause conflicts between VOs: a new version could include improvements for some VO and cause problems for others.

 

3.   LCG Services Weekly Report (Slides) - H.Renshall

H.Renshall presented a summary of status and progress of the LCG Services. This report covers the last two weeks. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      GGUS Alarm Tickets Introduced

At the beginning of July the GGUS operator alarm tickets template was deployed allowing 3-4 users per experiment, identified by their grid certificates, to submit GGUS tickets directly to Tier-1 operations via a local mechanism. The end points are given in here: https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage

 

The mechanism was tested over the last few weeks and by the 25 July only FNAL and CERN were failing.

This was reported at CERN to the official GGUS contact list of grid-cern-prod-admins@cern.ch with no reaction. In fact the source email address at GGUS simply had to be added as owner of the CERN Simba lists <experiment>-operator-alarm.

 

A trailing issue is site dependency of what can be in these tickets. The SA1 USAG group proposal states: Grid Partners, especially VOs, require a direct way to report urgent problems, which must be solved within hours, to the service experts/site responsible.

 

CERN, for example, restricts this to Data Management services (Castor, FTS, SRM, and LFC). Should there be a common expectation or simply a per-site definition (to be added into the Twiki)? It will be addressed at the GDB meeting.

 

Ph.Charpentier asked why is CERN restricting these GGUS alarms to the DM services only.

H.Renshall replied that all other services are already monitored and dealt automatically; but is felt that the DM alarms instead need to be dealt manually. Ordinary tickets and the operator phone are always available for all issues.

3.2      Sites Reports

CERN:

Friday 24 July a replacement of NAS disk switches in front of infrastructure Oracle databases stopped Twiki editing and indexing from 13.50 to 15.05. Both ATLAS and LHCb have since emphasised the importance of the Twiki service to them.

 

A post-mortem of last week’s failure of the Atlas offline DB streams replication to Tier 1s between Saturday 26 July and Wednesday (4 days) has been prepared (see https://twiki.cern.ch/twiki/bin/view/PSSGroup/StreamsPostMortem). The issue was caused by the occurrence of a gap in the archive logs sequence propagated to the downstream capture database. Improvements in the monitoring will be needed to spot similar problems (already assigned to development)

 

There was a problem submitting jobs to the CERN T0-export service (FTS-T0-EXPORT) on July 28 between 18.06 CEST and 20.15 and the day after 29-7 between 10.49 CEST to 14.42 CEST. During this period, all FTS job submission attempts failed with an Oracle error. The root cause of this was an unexpected behaviour of Oracle data pump (export/import). The problem has been fixed and is documented in https://twiki.cern.ch/twiki/bin/view/FIOgroup/FtsPostMortemJul29

1 August a converter between fibre and router at P5 failed at 17.42 cutting off the CMS pit. Fixed by network piquet at 19.27.

 

BNL:

30 July there was an unexpected change in the published BNL site name where a legacy value was previously being used. It was changed in OSG which propagated to WLCG and stopped the FTS channel to BNL. CERN and T1 sites had to run a daily reconfig tool by hand (the following morning).

 

M.Ernst noted that this change was supposed to be a trial change and was then removed but as it propagated immediately without BNL knowing.

He added that sites have manually to refresh their FTS caches in order to have updated list of sites for the data transfers. This ought to be automated in the future.

 

Atlas Tier-2:

Site AGLT2 (Univ. Michigan) – one of 4 used for ATLAS muon calibration – disappeared from WLCG BDII on 16 July. Found to be a new version of the way OSG-WLCG mapping is made – sites have to be marked as interoperable at several levels in order to be propagated. Fixed 30 July a few days after the problem was noticed.

 

CNAF:

21 July installed LFC 1.6.10 and experienced the repeated crashes (recommended version is 1.6.8). Installed a cron to check/restart but put in negative logic so restarted each 2 minutes. Corrected now so back to 1-2 days between crashes.

 

NDGF:

Pnfs log full over weekend of 26/27 stopped their dCache from working.

 

General:

LFC 1.6.11 (which fixes the periodic LFC crashes) has entered PPS in release 34 last Friday. The plan is to accelerate its passage through PPS.

3.3      Experiments Reports

ALICE:

Presented results of integration of CREAM-CE with ALIEN to recent ALICE task force meeting (by P.Mendez). The FZK-PPS VOBox is fully configured with CREAM CE access with 30 WNs behind, being increased to 330.

 

The access to the local WMS is ensured and tested the submission following 2 different approaches

-       Submission through the WMS: ALICE  submit from a VOBOX with many delegations in the user proxy – from the UI, entering the VOBox, passing through WMS, passing through the CREAM CE then through BLAH into the CE exceeds current limit of 9 delegations. Bug report made. No further testing through the WMS yet.

 

Ph.Charpentier added that some sites have already a few delegations in their proxy and the limit of 9 delegations is always passed making the site unusable. The limit should be increased above the current limit of 9 delegations. OpenSSL should be rebuilt with another parameter and VDT should be updated or patched.

 

I.Bird asked whether this issue is related only to the CREAM CE or also to the LCG CE.

P.Mendez replied that the CREAM CE adds one additional delegation and just goes above the limit of 9.

 

-       Direct submission to the CREAM CE: job submission takes 3 seconds compared with 10 seconds for LCG-RB. Needs definition of a gridftp server at each site to return the output sandbox (currently leaving on CE with messy recovery/cleanup. Note such servers were dropped with gLite 3.1)

 

I.Bird asked whether the CREAM CE performance is satisfactory for ALICE.

P.Mendez replied that the performance it is very satisfactory but the site must have gridftp installed already in order to work.

 

J.Gordon asked whether ALICE will ask to sites to install the CREAM CE.

P.Mendez replied that is the Experiment that has to ask it officially. For now some development is still needed and CREAM will have to be certified first.

O.Keeble added that the certification of the current version of CREAM CE was done but a next version will come and need to be certified.

 

Ph.Charpentier noted that the CREAM CE could run in parallel with the other CE once it is installed. He reminded that if an upgrade is agreed it should be implemented very early next year.

I.Bird agreed but added that, as CREAM can be run in parallel with the LCG CE, it will be installed and used progressively by the VOs wanting to do so.

 

LHCB:

Heavy load on their CERN gLite 3 WMS around 23rd with 20000 jobs/WMS/day. Borrowed a 3rd from Atlas with plan to convert one to gLite 3.1 (can handle more jobs) for testing.

 

In testing pilot jobs were found to exceed the limit of 10 proxy delegations in VDT 1.6 going through 3.1 WMS. Patch is said to be available – will need coordinated priority rollout.

 

R.Santinelli added that LHCb was able to reproduce the issue and reported it.

O.Keeble added that the patch of VDT - fixing the delegation limitations- is available and will be certified in about one month.

 

Proxy mix-up bug preventing multiple FQANs found in 3.1 WMS when user wants to submit to same instance with different VOMS roles.

 

Ph.Charpentier added that this is a very urgent issue for LHCb.

 

CMS:

Continuing the pattern of cosmics runs Wednesday and Thursday

Oracle security patch upgrade of devdb10 caused unexpected interruption of pit to Tier-0 testing. Will be followed at regular CMS meeting tomorrow.

 

D.Duellmann clarified that this issue was just on a development machine and not on a production system.

 

ATLAS:

Since end July ATLAS are in quasi-continuous cosmics data taking mode. System tests are done during the day, and there is combined running (with as many systems as possible) usually every night and over the weekends.

 

ATLAS process these combined data at the Tier-0, which includes also registration with DDM. There is hence cosmics data flowing to the Tier-1s any time, without previous dedicated announcement. Data rates and volumes are hard to predict; during last week, e.g., there was ~9 TB of RAW, ~2 TB of ESD.

 

In addition functional tests at 10% of nominal rates continue. Last week ALL Tier-1s received 100% of FT data from CERN (at the same time all cosmic data is also replicated) and there was no double registration for Tier 1 and Tier 2 (an old problem).

 

Ph.Charpentier noted that some ticket where replied that there was a reduced support because the site was in a holiday period. Holiday periods should not affect the functioning of the Services and should not be used as reasons for not fixing issues. Sites should have knowledgeable support on a 24x7 basis as agreed.

 

I.Bird agreed that User Support should cover all weeks of the year at an adequate level of expertise even when some particular persons are on holiday.

 

1.   Automatic Distribution of Middleware Clients - O.Keeble

 

O.Keeble presented a proposal for the distribution of the middleware client.

There is now a mechanism to distribute the WN middleware similar to the one used by the Experiments to distribute their applications.

 

There is no more need of a roll-out phase and also roll-back would be much easier.

 

The problem is that the OPS VO cannot do software installations, and the Experiments VOS could test these installations. Dteam is allowed to do software installations in some sites and could e considered.

 

J.Gordon noted that a paper was due describing the issues and the WN Installation proposal should go to the GDB and discussed with the sites.

 

New Action:

O.Keeble should distribute the proposal for the WN distribution (and changes needed at the sites) in writing to the MB and to the GDB.

 

Update:

O.Keeble distributed the proposal. Here is the link:

https://twiki.cern.ch/twiki/bin/view/EGEE/ClientDistributionProposal

 

2.   VO-Specific SAM tests - July 2008 (VO_SAM_200807)

 

 

I.Bird introduced the subject proposing that the VO-specific SAM results should be reported and therefore reviewed monthly at the MB.

2.1      ALICE (Slides) - P.Mendez

ALICE VOBOX Tests

ALICE has a specific test suite created by the experiment for its VOBOXES. They are executed each 2h from monb002@cern using ALICE credentials. At this moment 5 tests defined as critical. Not (yet) included as sensor into the site availability calculation.

 

No special issues to report during July 08. From the list of usual Critical Test (CT) defined for ALICE, the RB related test has been removed from the list.

Specific tests have to be defined for the WMS and for the RB from the same test suite.

 

The remaining tests are therefore five and check access to the software area from the VOBOX and checking all of different machine and user proxies

 

Still the email notification procedure is in placed based on site admin preference, e.g. removed for NIKHEF and SARA.

 

All T1 sites and CERN up and running.

 

Problems associated to several VOBOXES at T2s but under control and following it up with the site responsible. Today 5 T2s showing problems in their VOBOXES but this is a common variation which can be observed regularly and is not affecting the total production (decreasing the number of available resources).

 

CE:

Standard ops test suite for the CE. Executed each 2h from monb002@cern using ALICE credentials. Up to now using monb003 as backup solution. No new developments or changes are foreseen for this sensor

Two Critical Tests are defined:

-       Job submission

-       Access to the software area

 

Although the 2 tests are critical for ALICE, the experiment has also their won methods to check the CEs. In many cases, the pilot submission gives already the status of these tests. For the moment the experiment keeps both tests.

 

I.Bird noted that is important that there are tests that are not internal to the VO only. Otherwise cannot be easily monitored from the sites and via general tools

 

With this CE sensor, the experiment has suffered an already well known bug during this month.

In the test suite submission phase, from time to time the log files are blocked preventing the execution of the new tests suites.

 

Only CCIN2P3 was below target for ALICE but this probably due to instabilities while migrating the CE to gLite3.1.

 

SE and SRM:

No ALICE tests for SE and/or SRM

 

I.Bird asked whether the VOBOXES tests are Critical Tests and, if not, when they will become critical. The WLCG needs to start publishing the VO-specific tests.

P.Mendez replied that they are not CT but hopefully soon ALICE will decide to make them critical.

 

J.Gordon noted that the criticality of the VO specific tests should be described and agreed also with the sites not just internally by the VO. He reminded the MB that there is a wiki page that should be updated with the VO Test descriptions: https://twiki.cern.ch/twiki/bin/view/LCG/SAMVOSpecificTests

 

2.2      ATLAS (Slides) A.DI Girolamo

 

BNL

BNL is not tested with ATLAS SAM specific storage element tests since these are sent only on the sites belonging to clouds that are using LFC in production. BNL will have an LFC in production soon.

 

NGDF

The same is for NDGF, but for this cloud is up to ATLAS to include the sites in the tested ones, since they have put their LFC in production.

 

IN2P3-CC

IN2P3-CC reliability has been mainly affected by problems related to the storage elements. Problems always reported to the site admins and often quickly solved by them. SE endpoints to be updated.

 

RAL

The storage element that was tested has been decommissioned, thus the ATLAS SE list should be updated.

 

SARA

Problems with their SE (“lcg-cr” often in time-out) have been solved (~ 9th of July), since then their reliability is good.

 

I.Bird asked whether the tests are representatives of the sites’ situation for ATLAS and whether their result can be used for the reporting.

A.Di Girolamo replied that are still many improvements planned but this is an initial representative set of the ATLAS SAM tests.

2.3      CMS - M.Kasemann

The CMS tests are unchanged since the presentation by A.Sciabá about two months ago and CMS follow the SAM test results in their weekly meetings.

Most sites are above target. CNAF had problems with the storage and with the network, while the CERN values are unclear.

 

I.Bird asked whether for CMS these tests are representing the situation and can be used in the reporting.

M.Kasemann agreed and added that CMS uses on these tests regularly.

2.4      LHCb - Ph.Charpentier

The tests results are what LHCb calculates. The results are overall correct. The SRM tests are severe and represent correctly the situation For instance at SARA is difficult to find reliable TURL because is delete before is used. A job that need 20 files does not get 20 TURL kept long enough.

 

The CE tests framework was changes and the CE tests were not available for more than one week. This has an influence on the reliability, if the CE is unavailable and the SRM fails one test all day is considered negative. Sites should have a way to send OK while they are modifying a test. Or a setup.

 

I.Bird asked whether the tests are realistic and the result meaningful for LHCb.

Ph.Charpentier agreed but noted that not all VO do tests at the same level of details. LHCB tests really try everything in detail, to find/copy files, request TURLs, etc

 

M.Lamanna mentioned that there is a visualization dashboard for the VOs SAM tests and he will distribute a link to the MB list.

 

Conclusion:

I.Bird concluded that the MB should review the values every month and the tests should be documented properly in order to exactly know what they check.

 

3.   AOB
 

3.1      Status of Fermilab KCA - is it used?

I.Bird asked whether the FNAL KCA - that is still distributed as approved CA and is causing some confusion among sites - is still used or can be removed from the standard distribution.

 

L.Dell’Agnello noted that probably the FNAL KCA is used by non-LHC VOs, e.g. CDF, and maybe should still be supported by sites that support other VOs.

 

I.Fisk was not any more present at the meeting therefore the issue will be followed by I.Bird outside the MB meeting.

 

4.    Summary of New Actions

 

 

 

New Action:

19 Aug 2008 - Feedback from ATLAS on their Job Priorities installations and description of shares required at each Site.

 

New Action:

O.Keeble should distribute the proposal for the WN distribution (and changes needed at the sites) in writing to the MB and to the GDB.