LCG Management Board

Date/Time:

Tuesday 30 October 2007 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=22184

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 2.11.2007)

Participants:

I.Bird (chair), K.Bos, T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, C.Grandi, F.Hernandez, M.Lamanna, E.Laure, U.Marconi, P.Mato, G.Merino, R.Pordes, Di Qing, H.Renshall, L.Robertson, Y.Schutz, R.Tafirout

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 6 November 2007 16:00-18:00 – F2F Meeting at CERN

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous meeting were approved.

1.2      Sites Names

A.Aimar is collecting the name of the sites in order to use single names in the several reports, tables, etc.

 

Ph.Charpentier asked what the meaning of “DE-KIT" is.

I.Bird explained that the grid centre is now going to be called “Karlsruhe Institute of Technology”, as merge of FZK and Karlsruhe University.

 

Information as on the 2.11.2007, in red the names confirmed.

 

GOCDB Id

Site

Site Name

 

New Name

Taiwan-LCG2

ASGC

TW-ASGC

BNL-LCG2

BNL

US-T1-BNL

CERN-PROD

CERN

CH-CERN

INFN-T1

CNAF

IT-INFN-CNAF

USCMS-FNAL-WC1

FNAL

US-FNAL-CMS ?

FZK-LCG2

FZK

DE-KIT

IN2P3-CC

IN2P3

FR-CCIN2P3

NDGF-T1

NDGF

NDGF

pic

PIC

ES-PIC

RAL-LCG2

RAL

UK-RAL ?

SARA-MATRIX

SARA-NIKHEF

NL-T1

TRIUMF-LCG2

TRIUMF

CA-TRIUMF

 

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

  • 18 Sept 2007 - Next week D.Liko will report a short update about the start of the tests in the JP working group.

 

  • 21 Sept 2007 - D.Liko sends to the MB mailing list an updated version of the JP document, including the latest feedback.

 

Update by I.Bird during this meeting.

 

  • 21 October 2007 - Sites should send to H.Renshall their resources acquisition plans for CPU, disks and tapes until April 2008

 

Not done. Only one new plan received from ES-PIC.

The Sites that have sent their acquisition plans are: TW-ASGC, US-T1-BNL, DE-KIT, ES-PIC and FR-CCIN2P3.

The others should send them to H.Renshall as soon as possible.

 

·         21 Octobers 2007 - D.Barberis agreed to clarify with the Reviewers the kind of presentations and demos that they are expecting from the Experiments at the Comprehensive Review.

 

Ongoing. D.Barberis started the discussions with the Reviewers and with the other Computing Coordinators. He will send a summary via email in the next days.

 

3.    SRM Update (Minutes) - H.Renshall

H.Renshall presented the weekly update on the SRM 2.2 deployment.

 

DCache Patches - The dCache Patch 23/gridftp problem reported last week was quickly resolved and was a configuration problem.

Yesterday NDGF started upgrading their production SE to dCache 1.8.0-0 with SRM v2.2 and could be the first Tier-1 to advertise SRM v2.2 in production.

However, NDGF have decided to not advertise SRM v2.2 in production but to continue to run SRM v1.1 as SRM interface to dCache 1.8.0-0.

F.Donno is going to talk to them in order to understand the reason of this decision. In fact the high-level utilities such as FTS, gfal, lcg-utils still default to SRM v1 therefore there should not be problems advertising SRM v2.2 as well. Most other dCache sites are on Patch level 26.

 

FZK is upgrading the SRM v2.2 test endpoint to 1.8.0-0 today and should be able to switch on space management.

SARA and IN2P3 have been invited to follow that example soon, since a problem discovered while using lcg-cp and the tape system at SARA is cured by this version. Essentially dCache testing and upgrades are proceeding as planned.

 

LHCb Tests - LHCb has performed several FTS tests to SRM v2.2 at IN2P3 and these went smoothly. There were problems with FTS transfers between CERN (SRM v2.2) and CNAF (CASTOR SRM v2.2) but this seems due to a misconfiguration of the FTS channel.

 

LHCb also tried testing lcg-utils with the following results:

-       NIKHEF – OK

-       CNAF - OK (CASTOR)

-       IN2P3 - problems copying data from WN; site informed. They can dccp from CERN UI out of Lyon without problems though.

-       FZK - problems because the test endpoint was not published in the production BDII (although it is today) and because of a mismatch between the version of gLite (3.0.2) installed on the WNs at FZK and that used by the distribution of the latest release of gfal/lcg-utils that F.Donno made available as a tar file to LHCb.

 

GLUE Schema Information - There is now an agreement on what information is to be published in the glue schema.

An example is available in the GSSD pages: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDGLUEExample

 

A new set of SAM scripts is also available in order to verify that a site publishes information accordingly to the example.

The SAM scripts are run several times per day and the results - and explanation of the errors occurred - are published: http://lxdev25.cern.ch/s2test/bdii/s2_logs/

F.Donno has circulated this information to the sites and invited them to contact her in case of problems.

 

K.Bos asked whether SARA had updated to the latest version.

H.Renshall replied positively and confirmed that the latest version of dCache cures the problems observed at SARA with the tape system.

 

I.Bird asked whether the ATLAS tests that are being run are also checking the space tokens functions.

K.Bos replied that ATLAS has tested only the throughput performance between the Tier-0 site and the Tier-1 sites. Not the space token functions. Explicit SRM 2.2 functionality testing will start after M5, in about 10 days.

 

4.    CCRC Update (Minutes) - H.Renshall

H.Renshall presented the weekly update about the CCRC 08 Planning activities.

 

Data Rates - Replies on data rates for the February run have now been received from all except ATLAS.

 

Storage Requirements - F.Donno has circulated to the experiments the questions of how much, if any, of the February data is to be kept and if the temporary and permanent storage from the run can be taken out of the existing experiment resource planning or needs to be added to them.

The fractions to be kept are not yet known but so far we know that:

-       CMS and ALICE will keep the detector parts of the February runs and

-       LHCb will keep none but need extra resources during the run.

-       ATLAS information is missing.

 

Alice (L.Betev) queried sites on how to separate real data from mock data while still using the same SRM endpoint. This still needs to be checked. At the Tier-0 MCS proposed different directories with different 'file classes' should be used.

 

Planning - A.Aimar will distribute an updated version of the milestone plan. Feedback should happen this week and will be reviewed at the pre-GDB Face2Face meeting.

 

Critical Services - Experiments are asked to provide a list of critical services following the CMS model - see twiki: https://twiki.cern.ch/twiki/bin/view/CMS/SWIntCMSServices

 

Next week will be the pre-GDB Face-to-Face from 13.00 to 16.00 on 6 November in the IT Auditorium. See the agenda at http://indico.cern.ch/conferenceDisplay.py?confId=22709. This meeting should finalise the important details of the February run.

 

I.Bird asked whether only CMS has sent some information on the critical services.

H.Renshall replied that, except CMS, the other experiments had not yet provided the information.

K.Bos added that ATLAS is working on defining exactly the needs for their FDRs. They will de-couple the detector tests and the transfers to the Tier-1 sites. The data will be transferred to the sites from CASTOR. And the plan, once completed, will include the critical services needed.

Ph.Charpentier said that the list was provided in the past and will provide it again in a format similar to the CMS list.

H.Renshall noted also that the list of the VO boxes is needed.

 

G.Merino asked whether the list is only about services at CERN or also at other sites.

H.Renshall replied that the current list is for the critical services needed at CERN only. But if the Experiments have critical services at other sites they should tell it. For instance CMS has specified PhedEx as a critical service at their sites.

 

New Action:

The Experiments provide the list of critical services at CERN and other sites for CCRC 08 to H.Renshall.

 

 

5.    Job Priorities Update (Document) - I.Bird

I. Bird summarized the document he had distributed to the MB mailing list.

 

Job Priorities - status

29 October, 2007

17:21

 

1.       Basic tests in certification, and changes in YAIM have passed.  Testing moves to pre-production system.

Work ongoing to implement YAIM changes also for SL4 version of LCG-CE.

Wiki page set up describing how to configure a site with this new version of YAIM: https://twiki.cern.ch/twiki/bin/view/EGEE/JPTest.  This describes the  process for Torque/Maui, Instructions for PBSPro or LSF are not yet there.

 

Comments from Simone Campana:

Progresses from the “testers” side are the following: we tested the installation done by Di using YAIM on the certification testbed.  The test was successful, i.e. the middleware stack worked as expected. The WMS matches only one queue both with and without production role, i.e. it understands the DENY tags. Also, the glue attributes considered for the matchmaking are the ones of the proper view depending on which role one uses.  At the last TCG it was decided to move forward, which means install the job priority mechanism in at least 3 PPS sites, where at least 1 PBS and 1 LSF sites.  CERN must be one of them.  In PPS we will run a similar test, but more wide: 2 ATLAS and 2 CMS distinct FQANs should be used and moreover we should test that shares are strictly enforced over a long enough period of time. We should also test the SGM gets higher priority in respect of other jobs from the same VO, irrespective of which share they belong to.

 

PPS setup: CERN site is being installed awaiting preparation of the CE using LCG-CE (on SL4) (were some problems publishing info to site BDII).  Once tested at CERN will add 2 other PPS sites.

 

Notice that the Job Priority stuff has not been taken by certification yet. This must wait for the testing in the PPS to be concluded, again in order to avoid passing something to certification which has some architectural problem. Once the PPS tests have successfully passed, the YAIM with the JP will have to go to certification, hopefully certified very quickly and then go to PPS (again) for the last round of tests (which we hope at this point will be extremely quick.)

 

2.       For completeness the accounting system must be able to report by role and group (and by user DN).  The following is a summary of the status from Dave Kant.  There is some significant effort needed in this area.

 

o    APEL patch for processing the new accounting log, which encrypts the user DN and catches the user FQAN is in PPS.

o    Code to extract data from R-GMA, decrypt user DN and process the FQAN is written but not yet fully tested (resp. Dave Kant).

o    Need code to aggregate the data and create user-level and voms-level summaries.  Not started (resp. Dave Kant).

o    Portal - the CESGA portal has been updated to display the various views (Resource manager, user, etc.). Blocked until the previous item is ready.

o    Configuration:  Each site must change the installed configuration of APEL to enable the encryption and sending of the DN to the GOC.  This is a request to YAIM to allow sites to reconfigure APEL without necessarily touching the rest of the APEL site configuration. 

·         YAIM team have not received such a request - but (yesterday) agree that it can/should be done.

 

 

Basically the tests on the certification system have been passed. The PPS setup at CERN is on SL4 with the LCG CE; the issues encountered will be fixed in the next couple of days. After that 2 other PPS sites will be set up.

 

The accounting system must be adapted to deal with groups and roles in APEL accounting, in the CESGA portal, etc.

The request to YAIM to allow sites to reconfigure APEL without necessarily touching the rest of the APEL site configuration.

 

R.Pordes asked what the plans of interoperability with the OSG systems are. In the past had been agreed that new PPS releases should have interoperability tests and verifications.

I.Bird replied that there are agreed PPS sites for testing cross-grid job submission.

 

R.Pordes added that OSG is currently not sending the user DN to the accounting. If that is needed what is the due date?

I.Bird replied that this is a need for ATLAS and they will ask for accounting at the level of roles, groups and user DN level.

R.Pordes stressed the fact that a due date is needed because for the moment these changes were not discussed as a top priority for OSG and Globus.

 

Action:

ATLAS should report on how expects user accounting performed on OSG sites.

 

F.Hernandez asked what the planned time scale for getting to a Job Priority production version is.

I.Bird replied that the JP implementation could be ready by end of 2007. But deployment will be discussed in detail with the sites and well in time.

 

6.    ALICE Quarterly Report and Plans (Slides) - Y.Schutz

Y.Schutz presented the ALICE Quarterly Report for 2007Q3.

6.1      Physics Data Challenge 07

The PDC 2007 showed excellent stability of the central services. The sites delivered > 90% of pledged resources and new sites are joining (Wuhan, Hiroshima).

 

The plot below shows the number of jobs in the last 6 months. The trend is clearly towards stability.

 

 

6.2      CAF

The CAF cluster with PROOF is in production. Disk quota and fair share CPU target for groups and data staging with PROOF Datasets are under development.

 

The plot below shows the usage of the CAF. Most of it is for generation of MC data.

 

 

6.3      ALICE Full Dress Rehearsal

The ALICE FDR is split in 3 phases.

 

Phase 1:  DEC-2007

Phase 1 will mostly consist of cosmic rays data taking and calibration runs for detector commissioning (started already for some detectors, in lab).

 

The steps to follow will be:

-       the Registration in CASTOR2 + Grid File Catalog (OK so far)

-       the Replication Tier-0 to Tier-1 sites synchronously with data taking using FTD/FTS utilities (tested with FTS v.2).

-       the Asynchronous replication to CAF (OK so far)

-       the Pass 1 reconstruction on Grid at the Tier-0 site (OK so far)

-       Interactive expert analysis with PROOF on the CAF (OK so far)

 

As a reminder ALICE has no critical dependence on the SRM version.

 

The status of the Phase 1 is

-       Three detectors are already taking cosmic data on surface

-       The DAQ registration working 100%,  no failure in one month

 

The plot below shows the increasing amount of the files sent from the DAQ to CASTOR by ALICE in the last two months.

 

 

The next steps are:•

-       commissioning exercise, that will start in situ in December

-       generated data fed into the DAQ data flow to reach nominal pp data rates of the dataflow

 

The continuous replication Tier-0 to Tier-1 is pending:

-       The current RAW rate from detectors is very low: 0.2 MB/s (target 60MB/s for pp)

-       ALICE needs to re-establish the tape storage at Tier-1 sites

-       The replication with nominal rate will now be done with injected RAW data

 

The Pass 1 reconstruction will require:

-       Reconstructions on the Grid driven by detector experts;

-       for the moment the rapid changes in the reconstruction software are required and this makes automatic processing difficult.

 

Phase II: From FEB-2007

Will need all elements of Phase 1and in addition:

-       Second  pass reconstruction at T1s

-       Collection and registration of conditions data from DAQ

-       Detectors ECS,  DCS, HLT data during their commissioning (already being tested)

-       On line detector algorithms in DAQ/DCS/HLT

-       Data transit through File Exchange Servers

-       Shuttle registers condition objects and metadata in Grid File Catalog

 

Phase III: From April-2008

All elements of Phase I and Phase II and:

-       Gradual inclusion of online Detector Algorithm and Quality Assurance framework

6.4      ALICE Requirements for CCRC 08

The ALICE requirements are: 

-       Work Load Management
Currently in hybrid mode (with both LCG RB or gLite WMS)
In order to use the gLite WMS, ALICE needs the gLite VO Box suite

-       FTS service only needed for Tier-0 to Tier-1 transfers
ALICE has no constraint on the SRM version

-       xrootd interfaced with all supported gLite SE’s is necessary to ALICE

-       xrootd supported at the sites where is running.

 

L.Robertson asked clarifications about the xrootd interfaces needed.

Y.Schutz replied that the interfaces to xrootd are necessary for ALICE: for dCache they are already in operations, for CASTOR2 and DPM are in testing phase but almost complete. In addition xrootd must be supported at all sites where is running and this has been agreed directly by ALICE with most of the sites.

 

L.Dell’Agnello noted that at the CASTOR workshop it is not clear who will provide support for the interface of xrootd and CASTOR.

T.Cass confirmed that the CASTOR interface has been developed at SLAC but there is no formal agreement about future support and maintenance.

I.Bird added that similar issues could be mentioned for DPM.

 

I.Bird asked how at CNAF is going to work with the usage of STORM for disk-only storage.

L.Dell’Agnello suggested that ALICE contacts the STORM developers in order to discuss the xrootd interface to STORM.

Y.Schutz agreed that ALICE should check whether a STORM interface is needed and, if so, contact the STORM developers at CNAF.

 

The current ALICE readiness is:

-       FTS/SRM have not been tested at the nominal transfer rates

-       Grid services for reconstruction and simulation at T0-T1-T2 are all available.

-       SE with xrootd
dCache in production (GSI,  CCIN2P3,  SARA,  NDGF ,  FZK)
CASTOR2 with xrootd in advanced testing phase
DPM prototype under test at CERN and Torino

6.5      Resources Issues

There has been some progress in using the CPU resources allocated to ALICE.

The storage in external SE is gradually becoming operational

 

The resources problem is not solved even with the new contributions, as shown in red in the table below.

 

 

L.Robertson asked whether ALICE still assumes that an ion run will take place in 2008.

Y.Schutz replied that ALICE assumes a one-week, with reduced luminosity, ion run in 2008. If an official statement will change the LHC plans then ALICE will also change their plans.

6.6      Milestones Status

 

MS-118 - Sep 07:  AliRoot and analysis package release for day 1 - postponed to May 2008

MS-119 - Oct 07:  AliRoot release for detector commissioning - done

MS-120 - Oct 07:  MC raw data for FDR - ongoing

MS-121 - Oct 07:  on line DA and shuttle integrated in DAQ - postponed to February 2008 (FDR Phase II)

MS-122 - Oct 07:   FDR Phase II - postponed to February 2007

MS-123 - Oct 07:  online analysis with CAF -  done

 

MS-124 - Feb.  08:  Start of FDR Phase II

MS-125 - Apr 08:  Start of FDR Phase III

MS-126 - Feb 08:  Ready for CCRC 08

MS-127 - Apr 08:  Ready for CCRC 08

 

7.    AOB

 

 

No AOB.

 

I.Bird reminded the sites to send the updated status of the HL Milestones to A.Aimar.

 

8.    Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.