LCG Management Board
Tuesday 30 October 2007 16:00-17:00 – Phone Meeting
(Version 1 - 2.11.2007)
I.Bird (chair), K.Bos, T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, C.Grandi, F.Hernandez, M.Lamanna, E.Laure, U.Marconi, P.Mato, G.Merino, R.Pordes, Di Qing, H.Renshall, L.Robertson, Y.Schutz, R.Tafirout
Mailing List Archive:
Tuesday 6 November 2007 16:00-18:00 – F2F Meeting at CERN
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous meeting were approved.
1.2 Sites Names
A.Aimar is collecting the name of the sites in order to use single names in the several reports, tables, etc.
Ph.Charpentier asked what the meaning of “DE-KIT" is.
I.Bird explained that the grid centre is now going to be called “Karlsruhe Institute of Technology”, as merge of FZK and Karlsruhe University.
Information as on the 2.11.2007, in red the names confirmed.
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
Update by I.Bird during this meeting.
Not done. Only one new plan received from ES-PIC.
The Sites that have sent their acquisition plans are: TW-ASGC, US-T1-BNL, DE-KIT, ES-PIC and FR-CCIN2P3.
The others should send them to H.Renshall as soon as possible.
· 21 Octobers 2007 - D.Barberis agreed to clarify with the Reviewers the kind of presentations and demos that they are expecting from the Experiments at the Comprehensive Review.
Ongoing. D.Barberis started the discussions with the Reviewers and with the other Computing Coordinators. He will send a summary via email in the next days.
H.Renshall presented the weekly update on the SRM 2.2 deployment.
DCache Patches - The dCache Patch 23/gridftp problem reported last week was quickly resolved and was a configuration problem.
Yesterday NDGF started upgrading their production SE to dCache 1.8.0-0 with SRM v2.2 and could be the first Tier-1 to advertise SRM v2.2 in production.
However, NDGF have decided to not advertise SRM v2.2 in production but to continue to run SRM v1.1 as SRM interface to dCache 1.8.0-0.
F.Donno is going to talk to them in order to understand the reason of this decision. In fact the high-level utilities such as FTS, gfal, lcg-utils still default to SRM v1 therefore there should not be problems advertising SRM v2.2 as well. Most other dCache sites are on Patch level 26.
FZK is upgrading the SRM v2.2 test endpoint to 1.8.0-0 today and should be able to switch on space management.
SARA and IN2P3 have been invited to follow that example soon, since a problem discovered while using lcg-cp and the tape system at SARA is cured by this version. Essentially dCache testing and upgrades are proceeding as planned.
LHCb Tests - LHCb has performed several FTS tests to SRM v2.2 at IN2P3 and these went smoothly. There were problems with FTS transfers between CERN (SRM v2.2) and CNAF (CASTOR SRM v2.2) but this seems due to a misconfiguration of the FTS channel.
LHCb also tried testing lcg-utils with the following results:
- NIKHEF – OK
- CNAF - OK (CASTOR)
- IN2P3 - problems copying data from WN; site informed. They can dccp from CERN UI out of Lyon without problems though.
- FZK - problems because the test endpoint was not published in the production BDII (although it is today) and because of a mismatch between the version of gLite (3.0.2) installed on the WNs at FZK and that used by the distribution of the latest release of gfal/lcg-utils that F.Donno made available as a tar file to LHCb.
GLUE Schema Information - There is now an agreement on what information is to be published in the glue schema.
An example is available in the GSSD pages: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDGLUEExample
A new set of SAM scripts is also available in order to verify that a site publishes information accordingly to the example.
The SAM scripts are run several times per day and the results - and explanation of the errors occurred - are published: http://lxdev25.cern.ch/s2test/bdii/s2_logs/
F.Donno has circulated this information to the sites and invited them to contact her in case of problems.
K.Bos asked whether SARA had updated to the latest version.
H.Renshall replied positively and confirmed that the latest version of dCache cures the problems observed at SARA with the tape system.
I.Bird asked whether the ATLAS tests that are being run are also checking the space tokens functions.
K.Bos replied that ATLAS has tested only the throughput performance between the Tier-0 site and the Tier-1 sites. Not the space token functions. Explicit SRM 2.2 functionality testing will start after M5, in about 10 days.
4. CCRC Update (Minutes)
H.Renshall presented the weekly update about the CCRC 08 Planning activities.
Data Rates - Replies on data rates for the February run have now been received from all except ATLAS.
Storage Requirements - F.Donno has circulated to the experiments the questions of how much, if any, of the February data is to be kept and if the temporary and permanent storage from the run can be taken out of the existing experiment resource planning or needs to be added to them.
The fractions to be kept are not yet known but so far we know that:
- CMS and ALICE will keep the detector parts of the February runs and
- LHCb will keep none but need extra resources during the run.
- ATLAS information is missing.
Alice (L.Betev) queried sites on how to separate real data from mock data while still using the same SRM endpoint. This still needs to be checked. At the Tier-0 MCS proposed different directories with different 'file classes' should be used.
Planning - A.Aimar will distribute an updated version of the milestone plan. Feedback should happen this week and will be reviewed at the pre-GDB Face2Face meeting.
Critical Services - Experiments are asked to provide a list of critical services following the CMS model - see twiki: https://twiki.cern.ch/twiki/bin/view/CMS/SWIntCMSServices
Next week will be the pre-GDB Face-to-Face from 13.00 to 16.00 on 6 November in the IT Auditorium. See the agenda at http://indico.cern.ch/conferenceDisplay.py?confId=22709. This meeting should finalise the important details of the February run.
I.Bird asked whether only CMS has sent some information on the critical services.
H.Renshall replied that, except CMS, the other experiments had not yet provided the information.
K.Bos added that ATLAS is working on defining exactly the needs for their FDRs. They will de-couple the detector tests and the transfers to the Tier-1 sites. The data will be transferred to the sites from CASTOR. And the plan, once completed, will include the critical services needed.
Ph.Charpentier said that the list was provided in the past and will provide it again in a format similar to the CMS list.
H.Renshall noted also that the list of the VO boxes is needed.
G.Merino asked whether the list is only about services at CERN or also at other sites.
H.Renshall replied that the current list is for the critical services needed at CERN only. But if the Experiments have critical services at other sites they should tell it. For instance CMS has specified PhedEx as a critical service at their sites.
The Experiments provide the list of critical services at CERN and other sites for CCRC 08 to H.Renshall.
5. Job Priorities Update (Document) - I.Bird
I. Bird summarized the document he had distributed to the MB mailing list.
Basically the tests on the certification system have been passed. The PPS setup at CERN is on SL4 with the LCG CE; the issues encountered will be fixed in the next couple of days. After that 2 other PPS sites will be set up.
The accounting system must be adapted to deal with groups and roles in APEL accounting, in the CESGA portal, etc.
The request to YAIM to allow sites to reconfigure APEL without necessarily touching the rest of the APEL site configuration.
R.Pordes asked what the plans of interoperability with the OSG systems are. In the past had been agreed that new PPS releases should have interoperability tests and verifications.
I.Bird replied that there are agreed PPS sites for testing cross-grid job submission.
R.Pordes added that OSG is currently not sending the user DN to the accounting. If that is needed what is the due date?
I.Bird replied that this is a need for ATLAS and they will ask for accounting at the level of roles, groups and user DN level.
R.Pordes stressed the fact that a due date is needed because for the moment these changes were not discussed as a top priority for OSG and Globus.
ATLAS should report on how expects user accounting performed on OSG sites.
F.Hernandez asked what the planned time scale for getting to a Job Priority production version is.
I.Bird replied that the JP implementation could be ready by end of 2007. But deployment will be discussed in detail with the sites and well in time.
6. ALICE Quarterly Report and Plans (Slides) - Y.Schutz
Y.Schutz presented the ALICE Quarterly Report for 2007Q3.
6.1 Physics Data Challenge 07
The PDC 2007 showed excellent stability of the central services. The sites delivered > 90% of pledged resources and new sites are joining (Wuhan, Hiroshima).
The plot below shows the number of jobs in the last 6 months. The trend is clearly towards stability.
The CAF cluster with PROOF is in production. Disk quota and fair share CPU target for groups and data staging with PROOF Datasets are under development.
The plot below shows the usage of the CAF. Most of it is for generation of MC data.
6.3 ALICE Full Dress Rehearsal
The ALICE FDR is split in 3 phases.
Phase 1: DEC-2007
Phase 1 will mostly consist of cosmic rays data taking and calibration runs for detector commissioning (started already for some detectors, in lab).
The steps to follow will be:
- the Registration in CASTOR2 + Grid File Catalog (OK so far)
- the Replication Tier-0 to Tier-1 sites synchronously with data taking using FTD/FTS utilities (tested with FTS v.2).
- the Asynchronous replication to CAF (OK so far)
- the Pass 1 reconstruction on Grid at the Tier-0 site (OK so far)
- Interactive expert analysis with PROOF on the CAF (OK so far)
As a reminder ALICE has no critical dependence on the SRM version.
The status of the Phase 1 is
- Three detectors are already taking cosmic data on surface
- The DAQ registration working 100%, no failure in one month
The plot below shows the increasing amount of the files sent from the DAQ to CASTOR by ALICE in the last two months.
The next steps are:•
- commissioning exercise, that will start in situ in December
- generated data fed into the DAQ data ﬂow to reach nominal pp data rates of the dataflow
The continuous replication Tier-0 to Tier-1 is pending:
- The current RAW rate from detectors is very low: 0.2 MB/s (target 60MB/s for pp)
- ALICE needs to re-establish the tape storage at Tier-1 sites
- The replication with nominal rate will now be done with injected RAW data
The Pass 1 reconstruction will require:
- Reconstructions on the Grid driven by detector experts;
- for the moment the rapid changes in the reconstruction software are required and this makes automatic processing difﬁcult.
Phase II: From FEB-2007
Will need all elements of Phase 1and in addition:
- Second pass reconstruction at T1s
- Collection and registration of conditions data from DAQ
- Detectors ECS, DCS, HLT data during their commissioning (already being tested)
- On line detector algorithms in DAQ/DCS/HLT
- Data transit through File Exchange Servers
- Shuttle registers condition objects and metadata in Grid File Catalog
Phase III: From April-2008
All elements of Phase I and Phase II and:
- Gradual inclusion of online Detector Algorithm and Quality Assurance framework
6.4 ALICE Requirements for CCRC 08
The ALICE requirements are:
Work Load Management
FTS service only needed for
Tier-0 to Tier-1 transfers
- xrootd interfaced with all supported gLite SE’s is necessary to ALICE
- xrootd supported at the sites where is running.
L.Robertson asked clarifications about the xrootd interfaces needed.
Y.Schutz replied that the interfaces to xrootd are necessary for ALICE: for dCache they are already in operations, for CASTOR2 and DPM are in testing phase but almost complete. In addition xrootd must be supported at all sites where is running and this has been agreed directly by ALICE with most of the sites.
L.Dell’Agnello noted that at the CASTOR workshop it is not clear who will provide support for the interface of xrootd and CASTOR.
T.Cass confirmed that the CASTOR interface has been developed at SLAC but there is no formal agreement about future support and maintenance.
I.Bird added that similar issues could be mentioned for DPM.
I.Bird asked how at CNAF is going to work with the usage of STORM for disk-only storage.
L.Dell’Agnello suggested that ALICE contacts the STORM developers in order to discuss the xrootd interface to STORM.
Y.Schutz agreed that ALICE should check whether a STORM interface is needed and, if so, contact the STORM developers at CNAF.
The current ALICE readiness is:
- FTS/SRM have not been tested at the nominal transfer rates
- Grid services for reconstruction and simulation at T0-T1-T2 are all available.
SE with xrootd
6.5 Resources Issues
There has been some progress in using the CPU resources allocated to ALICE.
The storage in external SE is gradually becoming operational
The resources problem is not solved even with the new contributions, as shown in red in the table below.
L.Robertson asked whether ALICE still assumes that an ion run will take place in 2008.
Y.Schutz replied that ALICE assumes a one-week, with reduced luminosity, ion run in 2008. If an official statement will change the LHC plans then ALICE will also change their plans.
6.6 Milestones Status
MS-118 - Sep 07: AliRoot and analysis package release for day 1 - postponed to May 2008
MS-119 - Oct 07: AliRoot release for detector commissioning - done
MS-120 - Oct 07: MC raw data for FDR - ongoing
MS-121 - Oct 07: on line DA and shuttle integrated in DAQ - postponed to February 2008 (FDR Phase II)
MS-122 - Oct 07: FDR Phase II - postponed to February 2007
MS-123 - Oct 07: online analysis with CAF - done
MS-124 - Feb. 08: Start of FDR Phase II
MS-125 - Apr 08: Start of FDR Phase III
MS-126 - Feb 08: Ready for CCRC 08
MS-127 - Apr 08: Ready for CCRC 08
I.Bird reminded the sites to send the updated status of the HL Milestones to A.Aimar.
8. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.