LCG Management Board
Tuesday 17 June 2008, 16:00-17:00
(Version 1 - 20.6.2008)
I.Bird (chair), K.Bos, D.Britton, T. Cass, Ph.Charpentier, L.Dell’Agnello, F. Donno, I. Fisk, S.Foffano (notes), A.Heiss, M.Kasemann, D.Kelsey, H.Marten, P.Mato, P. McBride, G.Merino, A.Pace, B.Panzer, Di Qing, M. Schulz, Y.Schutz, J.Shiers, O.Smirnova
Mailing List Archive
Tuesday 24 June 2008 16:00-17:00 – Phone Meeting
Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
1.2 GridKA announcement - H.Marten
H.Marten announced a change in the personnel at GridKA. After 7 years the Project Manager job will pass from H. Marten to A.Heiss with H. Marten acting as deputy. I. Bird thanked H. Marten for his contribution and welcomed A.Heiss.
1.3 T1 accounting – I. Bird
Following the release of the T1 accounting report for May, I. Bird reported that problems with the report have been signalled by H. Marten, and J. Gordon and questions were raised by R.Pordes. The report is being analysed and will be corrected and re-distributed as soon as possible. I. Bird will summarise the process and reply to the questions raised in an email to the MB.
1.4 Requirements and Pledges for 2013 – I. Bird
I. Bird announced that the 5 year outlook for pledge data raised at the end of the last meeting by M.Kasemann was discussed with J.Engelen. It was agreed that this is an issue to be raised at the C-RRB meeting in November and no change can take place before C-RRB agreement. I. Bird therefore requested that the experiments provide their resource requirements for 2013 by 01/07/08 to enable the sites to provide their confirmed pledge data for 2009, and planned pledge data from 2010-2013 inclusive for the next C-RRB meeting.
Action List Review (List of actions)
I. Bird reviewed the only relevant outstanding action:
The only information still missing is:
- ATLAS Alarm Email Address
- CMS list of 4 users DNs that can post alarms to the sites.
For ATLAS K.Bos confirmed this was in the hands of D. Barberis. Bird requested D. Barberis to urgently supply the ATLAS alarm email address. P. McBride committed to posting the 4 CMS user names on the wiki.
Updated mandate for the Joint
Security Policy Group (Mandate) – D.Kelsey
I. Bird mentioned that the JSPG mandate was presented at the last GDB meeting and was being presented to the MB for approval.
D. Kelsey reminded members the aim was to make the JSPC mandate clearer with respect to the primary stakeholders WLCG and EGEE, and to clarify the relationship with the OSG and other non-EGEE grids contributing to the WLCG project.
As there were no comments or questioned, I. Bird declared the updated mandate as approved.
4. CCRC08 Post Mortem Workshop (Agenda; Workshop Summary; Slides) - J.Shiers
J.Shiers mentioned that relevant information including the presentation given to the OPN is available from the agenda page and his brief intervention would concentrate on 2 main issues:
4.1 Storage Services
The storage services are still not stable and it is not always clear which versions and patches are required. Although discussed regularly it is not always obvious to find the information. J. Shiers suggested in addition to the existing websites, once a week there is a standing item on the Operations Meeting agenda for people to easily obtain the latest information.
4.2 The service itself
The service is complex with many dependencies. The number of interventions is limited by the available human resources and the ability to communicate about the dependencies, and only 1 or maximum 2 interventions a week should be handled. Some of the interventions have not been fully discussed in advance, and some are still not able to handle basic recovery required for a service without human intervention.
I. Bird asked to have milestones to address the individual problems. This was agreed starting with VOMS then GridView.
J. Shiers concluded by reminding members that the service is still the challenge.
I. Bird introduced the SRM v2.2 usage agreement as the additional set of functionality regarded as important now. F. Donno stated that her presentation was mainly to outline the proposal and obtain the approval of the Management Board. Version 1.4 of the document is on the CCRC wiki. It has been agreed by service developers, client developers and experiments, mainly ATLAS, CMS and LHCb. The document covers 2 parts – a detailed description of an implementation-independent full solution and an implementation-specific (with limited capabilities) short-term solution that can be made available by the end of 2008.
F. Donno explained that the priority is addressing bugs and performance issues while working on the short-term solution as a way to understand what is really needed in the longer-term and therefore the feasibility of the longer-term proposal. In case the short-term solution demonstrates the ability to adequately address the requirements and use cases of the experiments, then the long-term proposal may be revised.
The short-term solution covers the two main items needed for the experiments: space protection and space selection together with the space types needed. A summary was given for CASTOR, dCache, DPM and StoRM highlighting is already available and what will be implemented between now and end 2008 with estimated timescales.
I. Bird summarised that the top priority is fixing bugs and performance issues in what already exists, the second priority is the implementation of the short-term solution outlined in F.Donno’s presentation which addresses the ATLAS problem for processing, however the longer-term solution will not start to be implemented until a review of the short-term proposal has taken place with real data. K.Bos commented that the proposal has no immediate impact for ATLAS, however did not oppose it.
I. Bird asked the other experiments to comment as this is the last set of functionality which can be added in a short-term timescale. M.Kasemann agreed to the proposal and there were no objections from ALICE and LHCb therefore it was adopted.
LHCb Quarterly Report (Slides)
Ph.Charpentier presented the LHCb Quarterly Report (March-May 2008). See the Slides for all details on the data presented.
6.1 Overview of the Quarter
The main LHCb activities since February 2008 were focused in the following areas:
- Applications and Core Software: Preparation of applications for real data. Simulation with real geometry (from survey). Certification of GEANT4 9.1. Alignment and calibration procedures are now in place.
- Production activities (aka DC06): New simulations on demand. Continue stripping and re-processing, dominated by data access problems (and how to deal with them).
- Core Computing: Building on CCRC08 phase 1. Improved WMS and DMS. Introduce error recovery, failover mechanism etc. Commission DIRAC3 for simulation and analysis. Only DM and reconstruction exercised in February
6.2 Sites Configuration
The Conditions DB and LFC are replicated (3D) to all LHCb Tier-1 sites. LFC mirror read-only at all Tier1s available
Sites SE migration is progressing:
- RAL migration from dCache to Castor2. Very difficult exercise (bad pool configurations). Took over 8 months to complete, but it is over.
- PIC migration from Castor1 to dCache. Fully operational, Castor decommissioned since March.
- CNAF migration of T0D1 and T1D1 to StoRM. Went very smoothly for migrating existing files from Castor2.
SRM v2 spaces: All spaces needed for CCRC were well in place. SRM v2 still not used for DC06 production (see later)
6.3 DC06 Production Issues
DC06 was still using DIRAC2, i.e. SRM v1. There are no plans to back port the usage of SRM v2 to DIRAC2.
LHCb could update the LFC entries to srm-lhcb.cern.ch (v2) for reading but still need srm-durable-lhcb.cern.ch for T0D1 upload and need a srm-get-metadata that works for SRM v2. When SRM v2 and v1 are the same end-point there is no problem and DIRAC2, which was also successfully checked against StoRM endpoints.
File access problems were important and dominating the re-processing activities in CCRC.
Castor sites: They worked
adequately. LHCb would like to have rootd at RAL (problems known for 4 years
with rfio plug-in in root, alleviated with rootd).
DCache sites: No problems when
using the dcap protocol (PIC, GridKa), many problems with gsidcap (IN2P3,
6.4 CCRC08 Summary
The planned CCRC08 May activities for LHCb were to maintain the equivalent of 1 month data taking, assuming a 50% machine cycle efficiency
Run fake analysis activity in parallel to production type activities; analysis type jobs were used for debugging throughout the period GANGA testing ran for last weeks at low level.
Below are the shares of CPU activities among Tier-1 sites. The dominant sites are IN2P3 and NL-T1.
Pit to Tier-0 Transfers - The first exercise was the transfer of data from the LHCb pit to the Tier-0.
Use of rfcp to copy data from pit to CASTOR (rfcp is the recommended approach from IT) One file sent every ~30 sec and the data remains on online disk until CASTOR migration.
The rate to CASTOR - ~70MB/s.
Below is the data flow in and out of CASTOR in May. The gaps are due to problems in the online area that required the interruption of all data transfers.
Tier-0 to Tier-1 Sites - FTS from CERN to Tier-1 centres with the shares as in the table above. Transfer of RAW will only occur once data has migrated to tape and the checksum is verified.
The rate out of CERN was of about 35MB/s averaged over the period. The peak rate reachable is far in excess of the LHCb requirement. In smooth running all sites matched LHCb requirements. The table below shows the efficiency of the transfers on the first transfer attempt (at the end all files were successfully transferred).
One can see the IN2P3 and CERN problems due to SRM endpoints issues and the CERN power outage at the end of May.
Reconstruction - Used only SRM 2.2 SE. And the LHCb space tokens are: LHCb_RAW (T1D0) and LHCb_RDST (T1D0). The data shares need to be preserved at the sites the input is 1 RAW file the output 1 rDST file (1.6 GB each).
The number of events was reduced from 50k to 25k, in order to have the job of about 12 hour duration on 2.8 kSI2k machines in order to fit within the available queues. They need to get queues at all sites that match our processing time.
After the file transfer the file should be online, as are job submitted immediately. LHCb does a pre-stage files and then checks on the status of the file before submitting pilot job. Pre-stage should ensure access availability from cache.
The jobs submitted are summarized in the picture below:
- 41.2k reconstruction jobs submitted
- 27.6k jobs proceeded to done state and processed all events and generated the output file.
- 21.2k jobs processed all 25k events, the other failed because of dCache problems give EOF to the application.
G.Merino noted that also other sites not using dCache have sometimes the same problem.
Ph.Charpentier replied that the issue is being investigated.
The main issue was at NL-T1 with reporting of file status. This was discussed and solved last week during Storage session (dCache version). If the jobs were executed now they would succeed.
At IN2P3 the problem of opening files is understood but the site will have to fix it.
At NL-T1 dcap cannot access the files directly across sites (SARA and NIKHEF) and the files are copied on the local disk of the WN (sometimes from CERN).
Below is the CPU efficiency at each site. The problems at CNAF were due to the access the central software area and therefore the binaries had to be installed locally every time. RAL and IN2P3 require downloading the data and this makes the application waiting up to 4hours. These issues are understood and fixed now.
6.5 DCache Observations
The official LCG recommended version was - 1.8.0-15p3. LHCb ran smoothly at half of T1 dCache sites, only those configured in unsecure mode.
- PIC OK - version 1.8.0-12p6 (unsecure)
- GridKa OK - version 1.8.0-15p2 (unsecure)
- IN2P3 - problematic - version 1.8.0-12p6 (secure)
Segmentation faults - needed to ship version of GFAL to run
Could explain CGSI-gSOAP problem.
- NL-T1 - problematic (secure)
Many versions during CCRC to solve number of issues
1.8.0-14 -> 1.8.0-15p3->1.8.0-15p4
“Failure to put data - empty file” then ”missing space token” problem then “incorrect metadata returned”, NEARLINE issue
Stripping on rDST files reading 1 rDST files and associated RAW file to produce DST files and ETC (Event Tag Collection) during the process stored locally on T1D1. The DST and ETC files are then distributed to all other computing centres on T0D1 (except CERN T1D1).
A total of 31.8k stripping jobs were submitted, 9.3k jobs ran to “Done” and the major issues were with LHCb book-keeping.
6.7 Lessons Learned and Outlook for DIRAC 3
The main areas for improvements are about:
- Error reporting in workflow and pilot logs. Careful checking of log files was required for detailed analysis.
- Full failover mechanism is in place but not yet deployed; only CERN was used for CCRC08.
- Alternative forms of data access. Minor tuning of the timeout for downloading input data was required.
- Continue CCRC-like exercise for testing new releases of DIRAC3. One or two 6-hour runs at a time
- Adapt Ganga for DIRAC3 submission. Delayed due to an accident of the developer.
- Commencing tests with xrootd with CASTOR and dCache for future possible support.
LHCb would like to test a “generic” pilot agent mode of running, even in absence of gLExec.
LHCb has developed ”time left” utility on all sites and batch systems. They can limit to running LHCb applications only (no user scripts) and this does not causes any security risk higher than for production jobs (for a limited duration of time).
Ph.Charpentier asked that LHCb is allowed to run
pilot jobs in absence of gLExec for a limited amount of time. He will send a
memo asking that this is agreed by the WLCG.
I.Bird asked whether the request is the support of the xrootd protocol only.
Ph.Charpentier confirmed that the interest is in having the xrootd protocol supported in read mode, it seems to provide faster data access to ROOT files.
J.Shiers asked whether LHCb is going to use AMGA to access their databases.
Ph.Charpentier replied that accessing directly ORACLE in SQL is more efficient. AMGA is a generic interface but is not really needed by LHCb.
I. Bird reminded the Tier 1 sites that their 2008 resource installation status is required for the reporting at the LHCC July review in addition to the other answers to the questions sent by J. Gordon. Some replies have been received but the others are urgently required.
Summary of New Actions
Tier-1 Accounting Report for May to be analysed, corrected and an explanatory email sent to the MB.
Experiments to provide their 2013 resource requirements by 01/07/08 to email@example.com
New service related milestones should be introduced for VOMS and GridView.