LCG Management Board

Date/Time

Tuesday 24 June 2008, 16:00-17:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=33700

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 27.6.2008)

Participants  

A.Aimar (notes), D.Barberis, I.Bird (chair), T. Cass, J.Casey, Ph.Charpentier, D.Collados, L.Dell’Agnello, A.Di Meglio, F. Donno, M.Ernst, I. Fisk, S.Foffano, F.Giacomini, J.Gordon, A.Heiss, F.Hernandez, M.Lamanna, U.Marconi, A.Pace, R.Pordes, Di Qing, Y.Schutz, J.Shiers, R.Tafirout

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 8 July 2008 16:00-17:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      Follow-up to the LHCb Request of Using Pilot Jobs

As agreed during the previous MB meeting, Ph.Charpentier distributed a note about LHCb multi-user pilot jobs. Here is the text:

 

Request from LHCb to use multi-user pilot jobs for testing DIRAC

 

In order to be able to further assess the DIRAC framework for being able to use multi-user pilot jobs, and in absence of a large deployment of gLexec, LHCb would like to run for a few weeks in such a mode but without gLexec. In concrete this means:

 

-          Pilot jobs will be submitted with a specific FQAN (role=pilot) using a specific DN (e.g. that of the Computing Coordinator)

-          Jobs will be submitted to DIRAC using their own DN by either productions managers (using role=lcgprod) or physicists (using no specific role, i.e. regular users)

-          Pilot jobs will execute any of the above payloads. All DM operations will be performed using the credentials of the user who submitted it to DIRAC (i.e. not that of the pilot job).

-          Pilot jobs will possibly execute more than one payload in order to test the following: proper functioning of the "time left" utility; proper clean up of the job workspace; proper switch of credentials.

-          Pilot jobs will use a fake "gLexec" that will log on the central DIRAC logging system all transitions, in particular payload owner identity, gLite id, local job id, job type, CPU/wall-clock consumed etc... The VO-manager will have the possibility to trace the history of a given gLite ID and transmit on request the identity of the owner of the job.

-          "User analysis" consists of running LHCb applications (i.e. Gaudi jobs) on Tier1's, accessing input data on the Tier1 SE and producing output stored on the LHCb_USER space token if required.

 

This mode of running will be enabled in DIRAC3 during a "CCRC-like" exercise to be held in July (date to be better determined), for a duration of a couple of weeks. LHCb will produce a detailed report on this activity after its completion. LHCb commits to providing the identity of the owner of a job at any given time, given the gLite id, be there any problem noticed by the sites. LHCb commits to ban any user misbehaving at their DIRAC authorisation level, upon request from the sites.

 

It is understood that in the very unlikely event of misbehaving jobs, the pilot job owner will be banned from the site as an immediate measure; the VO manager will be informed and will provide the information on the actual user identity as described above. Explanation will be requested to the user who will be banned from submitting jobs to DIRAC if the misbehaving was intentional. After which the pilot jobs will be re-enabled at the site. A report will be transmitted to the site and the WLCG MB.

 

 

J.Gordon asked where LHCb plans to run their pilot jobs.

Ph.Charpentier replied that LHCb plans to run pilots jobs on all their Tier-1 Sites.

 

J.Gordon reminded that the security “Multi-User Pilot Jobs” document should be approved by the MB in the next few weeks, as requested by the JSPG. Sites and Experiments agreed on the policy and should follow it.

I.Bird replied that this request is for 2 weeks in July only, and not a permanent change of policy. LHCb asked to do some time-limited testing.

 

Ph.Charpentier noted that the request is for the Tier-1 Sites only, but all the Tier-1 Sites are needed for the test. He pointed out that one (or two) other VOs are already doing what LHCb is asking for and nobody is actually complaining.

 

A.Heiss for FZK added that only few German sites are running LHCb jobs and therefore it should not be a major issue.

F.Hernandez agreed that for the IN2P3 Tier-1 there is no problem. And also for activities on the French Tier-2 sites it is likely to be accepted (but he will have to check and confirm).

L.Dell’Agnello stated that also for the CNAF Tier-1 and for the Italian Tier-2 he will have to verify the possibility. He will mail the answer to the MB list.

 

I.Bird recommended that, also considering how LHCb has always provided all information requested, the WLCG MB should accept their request and the issue be solved before next week.

 

New Action:

1 July 2008 - The MB should decide on the LHCb’s request to use multi-user pilot jobs for testing DIRAC in July.

 

2.   Action List Review (List of actions)

 

  • 30 Apr 2008 - Sites send to H.Renshall plans for the 2008 installations and what will be installed for May and when the full capacity for 2008 will be in place.

Will be discussed at the LHCC mini review next week.

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

On going. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it up should confirm it.

The only information still missing is: CMS list of 4 users DNs that can post alarms to the sites’ email address.

  • H.Marten and Y.Schutz agreed to verify and re-derive the correct data and the normalization factors applied (at FZK for instance). And compare the local accounting, APEL WLCG accounting and the ALICE MonALIsa accounting.

Ongoing. H.Marten reported that follow-up to the previous MB meeting, about ALICE accounting, has started.

  • A.Aimar will ask information about the CERN reliability data during the power cut in May.

Asked to J.Casey but information not received yet.

  • New service related milestones should be introduced for VOMS and GridView.
  • Tier-1 Accounting Report for May to be analysed, corrected and an explanatory email sent to the MB.
  • M.Schulz should present an updated list of SAM tests for instance testing SRM2 and not SRM1.
  • J.Shiers will ask SAM to review the MoU requirements and define SAM tests to verify them.

 

3.   LCG Services Weekly Report (Slides) - J.Shiers

J.Shiers presented a summary of the LCG Services activities.

Here is the link to the minute of last meeting: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek080616

 

Below are the versions of the MSS software. In bold are those released, in normal text those not yet release.

 

Component

Version

Comments

CASTOR core

2.1.7-10

will be released this week

Tier1s are recommended to upgrade by mid-July

2.1.8

will be released the first week of August

    - Tier0 will upgrade before the end of August

    - Tier1 will follow

CASTOR SRM

1.3-27 on SLC3

2.7-1 on SLC4 as soon as released

dCache

1.8.0-15p6

fixes a bug with caching credential produced through grid-proxy-init

1.8.0-15p7

Is about to come out. It fixes a problem with checksum verification when copy a file in push mode between 2 dCache sites

StoRM

1.3.20 on SLC4

DPM

1.6.10 on SLC4

3.1      Sites Issues

A few sites seem to have problems with power and cooling.

There were two incidents this week; as agreed during CCRC a post-mortem should be produced every time.

 

Site

Comments

IN2P3

Had a serious problem this w/e with A/C. Had to stop about 300 WNs - waiting for action this week to repair A/C machine. Keep info posted on website.

INFN

CNAF - suffered serious problem. UPS too heavy and floor collapsed.

 

L.Dell’Agnello added that since Sunday the power is back and the systems are being powered up. Actually CNAF took the occasion to do a planned update of some equipment. The resources should be all back now and the UPS will be in place soon.

J.Shiers asked that the sites produce a short report to the MB in case of such incidents.

 

Site

Issue

RAL

Information gathering plug-ins timing out? Resulting in “0” published as number of running jobs hence not attractive - under investigation (Friday)

Rogue CMS user submitted ~10K jobs and overloaded batch system - jobs killed, AOK.

CERN

CERN availability currently at 0 (zero) failing gstat tests. No published tests since yesterday. Last tests failed.

Publishing problem from gstat to SAM (is the limit 100?). Being investigated (Taipei). https://gus.fzk.de/pages/ticket_details.php?ticket=37710

GRIF

Raised question of (possible) additional ATLAS space tokens at their Tier2 (a suivre).

 

J.Gordon clarified that so many jobs cause a time out on the Information provider. It is not a problem due to the RAL batch system because it happened also at other sites.

3.2      LCG Services Issues

Database Services - (see slide 5 for details) there were some problems with the cabling of the RAC6 Ethernet switches and with GridView. In addition the SRM for different Experiments were moved to different RACs. In order to isolate the Castor SRM service from potential problems on the Castor stager backend, the change will consist in lowering the stager timeout. This should prevent exhaustion of the SRM thread pool that was observed during CCRC.

 

GridView - some instability of service wrt Oracle cluster-ware. DBAs recreated the service definition and the situation looks better. Opened service request with Oracle. There are some incredibly long transactions and a review of the DB schema has been launched.

Monitoring / dashboard report: Two issues after the GridView upgrade

-       the TNS was slightly out of date so the services had to be restarted

-       A query has stopped working and will be fixed tomorrow so there will be a few hours when the summary data for availability will not be updated.

-        

CMS dashboard – performance / load problems. Working with DBAs to identify slow / heavy queries. A specific query felt to have been the cause of some such problems is now using an index and runs much faster. A review has been proposed by the DB team and should be scheduled asap.

3.3      Other Services

VOMS service issues - FIO do not have a procedure since it is basically impossible to measure when the service is not available since there is no monitoring agent that is a member of the VO. And it seems to be hard to work out externally without being in the VO if it is working.

 

# threads for LHCb LFC - increased to 60 taking advantage of DB intervention

 

LFC - VOBOXes of LHCb - to be recognized as host certificate by the LFC, a VOBOX DN must finish with CN=myhostname or CN=host/myhostname. So, the host certificate - just renewed - for VOBOX for LHCb at NIKHEF is incorrect.

3.4      Experiments Issues

CMS – fairly quiet week. Transfer problems to FZK reported Tuesday, CMS SAM SE testing (CASTOR@CERN) failing with write permission denied (Friday), summary of CRUZET-2 (ended last Sunday) reported Wednesday. (100% job success at Tier0, some issues in RECO step – unsuppressed zeroes in ECAL giving very large reconstructed files. iCSA08 production finished for all workflows, cleaning up un-needed datasets. HI production runs at Tier1s. Some srm-cms ‘misbehaviour’ seen well before announced intervention time. All problems disappeared afterwards but it was not clear what happened.

 

ALICE – only occasional FTS testing – not enough Tier1 disk space to fully stress sending custodial raw during commissioning exercise. First pass ESD at CERN for transmission to Tier1s for eventual reprocessing

 

LHCb – restart CCRC activities with ‘intermediate’ version of DIRAC. NIKHEF & IN2P3 to support dcap instead of gsidcap. More info in elog LHCb observation

 

F.Hernandez added that the change requested influences also CMS. IN2P3 would like to understand the reasons of why gsidcap is not working properly. They are preparing some tests. After the tests they are ready to move to dcap, if it is necessary.

 

ATLASNDGF migrating from RLS to LFC. (Wed). Clarification of naming scheme to be used from now on – data deleted after 8 days: Naming convention for the ATLAS Functional Test datasets: ccrc08_run2.0XXYYY.physics_blabla; where XX is the week number, this week (started Monday at 0:00 am) is 25, and YYY is sequential. This is just to give the information that we are going, starting tomorrow, to delete the ccrc08_run2.024yyy.blabla data. For the sites this operation is transparent.

 

J.Gordon asked that in the LHCb QR the problem with the CA is not related to RAL: This is a problem that depends on the procedures and on the way certificates can be changed in the middleware. This will happen to other sites because there is no alternative method.

 

4.   Preparation of the LHCC Review (HL Milestones) - I.Bird 

 

I.Bird asked for a final verification of some HL Milestones before the LHCC Review of next week. He asked that all sites reply to J.Gordon’s questionnaire clarifying exactly the percentage of pledges installed and when will be completed.

 

24x7 Support and VO Boxes Support - the milestones are late and not progressing since a long time.

 

FZK (A.Heiss) and IN2P3 (F.Hernandez) reported that they had administrative procedures to fulfil and could not be done sooner.

T.Cass reported that for CERN the procedures are implemented (i.e. 07-05 green) but the Experiment have not formally approved it (i.e. 07-05b still red).

 

The Definition of the CAF by the Experiments should be completed.

ATLAS (D.Barberis) and LHCb (Ph.Charpentier) declared the CAF definition completed for their Experiments.

I.Fisk said that CMS is discussing it in their workshop this week, and will define it in the next few weeks.

 

This is the dashboard, updated after the meeting: 
https://twiki.cern.ch/twiki/pub/LCG/MilestonesPlans/WLCG_High_Level_Milestones_20080626.pdf

 

 

5.   Status of OSG RSV test and Equivalence to SAM (RSV_Crit_Tests; WLCG-OSG) - R.Pordes, D.Collados

D.Collados proposed a set of critical tests for OSG (RSV_Crit_Tests)

 

The existing critical tests are described at: https://twiki.cern.ch/twiki/bin/view/LCG/OSGCriticalProbes

5.1      OSG CE Tests

The list of critical tests (all OSG CE tests) is:

-       org.osg.certificates.crl-expiry (check if CRLs are still valid)

-       org.osg.general.osg-directories-CE-permissions

-       org.osg.general.osg-version (version of OSG CE running)

-       org.osg.globus.gridftp-simple (a test file is gridftp'ed to remote host, and back. Then, files are checked for consistency)

 

The proposed list of critical tests for OSG-CEs has 4 new tests and one removed:

-       org.osg.certificates.crl-expiry(check if CRLs are still valid)

-       org.osg.general.osg-directories-CE-permissions

-       org.osg.general.osg-version

-       org.osg.certificates.cacert-expiry: check if CA certs are still valid

-       org.osg.general.ping-host: check if CE responds to pings

-       org.osg.globus.gram-authentication: authenticate to remote CE using GSI

-       org.osg.batch.jobmanager-default-status: check status of default job manager) .This one only available early July.

 

These are equivalent to EGEE tests in terms of functionality. But will be executed from the WN in next OSG releases (using jobmanager-fork at present time).

 

F.Hernandez asked whether the EGEE test checking that the sites accept the CA recommended is missing.

J.Casey agreed that this test is missing and will check it and report to the MB.

 

New action:

J.Casey to report whether the EGEE test checking that the sites accept the recommended CA is critical and if so implement it for OSG.

5.2      OSG Gridftp Tests

Proposed list of OSG-Gridftp critical tests:

-       org.osg.globus.gridftp-simple: a test file is gridftp'ed to remote host, and back. Then, files are checked for consistency.

 

Is equivalent to EGEE ‘CE-sft-lcg-rm-cr & CE-sft-lcg-rm-cp’ tests. Will be even more equivalent to EGEE ‘SRMv2-lcg-cp (lcg-cp --nobdii)’ that will be out in the coming weeks. In this case the OSG test is in advance on the existing EGEE’s one.

5.3      OSG SRM Tests

Proposed list of OSG-SRMv1/v2 critical tests:

-       org.osg.srm.srmping: check if SRM server is responding to srm-pings

-       org.osg.srm.srmcp-readwrite (srmcp  a file to and from a storage element using the srm protocol)

These are equivalent to EGEE SRMv1 tests (SRM-put, SRM-get, SRM-advisory-delete)

5.4      Previously Found Issues

There is now a single source of information for OSG resources (OIM list) http://oim.grid.iu.edu/publisher/get_osg_interop_monitoring_list.php and all OSG services are now published and automatically stored in SAM Database.

 

The GridView database is ready to accept OSG downtimes in production mode. End-to-end tests are successful.

 

I.Bird concluded that, provided that the verification about the CA tests (mentioned above) is positive, the OSG RSV tests should be considered equivalent to the existing SAM tests.

5.5      Status of the OSG RSV Milestones

R.Pordes reported the status of the OSG RSV milestones present in the HL Milestones dashboard.

 

WLCG-08-01b Jun 2008 RSV:  Tier-2 SE Tests Equivalent to SAM; Successful WLCG verification of OSG test equivalence of RSV tests to WLCG SE tests

-       Work completed pending approval by the WLCG board?

 

WLCG- 08-02 Jun 2008: OSG Tier-2 Reliability Reported: 

 

OSG RSV information published in SAM and GOCDB databases.

-       Work completed pending approval by the WLCG board, from RSV through to GridView.

 

Reliability reports include OSG Tier-2 sites.

-       Excel output from GridView being checked by OSG against local RSV reports; it is being reviewed by US ATLAS and US CMS management.

 

I.Bird asked whether this validation of GridView is only for June.

R.Pordes replied that is only a one-time verification; in order to replace the N/A with real reliability data in the future.

 

OSG also requested that:

-       The monthly WLCG report (PDF file) from GridView with OSG Tier-2s is distributed to OSG as early as possible at the beginning of the month. They will verify it and reply whether it can be included in the official June reports.

-       The OSG reliability data are needed also inside OSG. They have asked to GridView an API or an XLM feed to retrieve the reliability data online.

 

6.   Dynamic Megatable Proposal (Slides) - F.Donno

 

Postponed to next meeting (F2F Meeting, 8 July 2008).

 

7.   AOB
 

 

No AOB.

 

 Summary of New Actions

 

 

New Action:

1 July 2008 - The MB should decide on the LHCb’s request to use multi-user pilot jobs for testing DIRAC in July.

 

New action:

J.Casey to report whether the EGEE test checking that the sites accept the CA recommended is critical and if so implement it for OSG..