LCG Management Board
Tuesday 30 June 2009 16:00-17:00 – Phone Meeting
(Version 1 – 2.7.2009)
A.Aimar (notes), D.Barberis, O.Barring, I.Bird(chair), K.Bos, D.Britton, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, Qin Gang, J.Gordon, F.Hernandez, M.Kasemann, H.Marten, G.Merino, Y.Schutz, R.Tafirout
Mailing List Archive
Tuesday 7 July 2009 16:00-18:00 – F2F Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No comments received about the minutes. The minutes of the previous MB meeting were approved.
1.2 LHCC Referees Meeting (Agenda)
I.Bird asked that:
- Tier-1 Sites should send each one slide about achievements and issues during STEP
- Tier-1 Sites should send a status report (a few slides) on the current resources and on the new ones procured until end 2009.
- A few slides from each Experiment about achievements and issues of STEP09 and on the status of their Tier-1 and Tier-2 Sites.
I.Bird added that he will send an email summarizing the requests above.
G.Merino added that he also asked for information at the Sites in preparation for the Post-mortem Workshop.
2. Action List Review (List of actions)
Alice has approved the NDGF’s SLA.
The action is considered done for CMS and ALICE and the SLAs approved. If there will be issues it will not be responsibility of the Sites.
· 5 May 2009 – CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.
L.Dell’Agnello stated that CNAF completed their internal tests and will send a report to R.Wartel. The Italian ROC security manager will also send his report next week.
· Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar
Not done by: DE-KIT, FR-CCIN2P3, NDGF, NL-T1, US-FNAL-CMS
Sites can provide what they have at the moment. See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics
Sites should send URLs to existing information until they do not provide the required information.
I.Fisk noted that file is available and will be sent. No progress from the other Sites.
A.Aimar noted that just the URL of the XML file is needed.
· A.Aimar finds how to display directly SLS information from all Sites, without using the SLS interface, for July’s F2F Meeting. And also which metrics Sites are currently displaying.
Will be shown at the F2F.
· 30 Jun 2009 - Sites comment on the ALICE dataflow and rates.
Done. No comments from the Sites.
· M.Schulz should report about the status of the gLExec patch on passing the environment
Will be done at the F2F next week.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting.
All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
Was a quiet week with no alarm tickets.
Only some instability in the Experiments’ frameworks.
- ATLAS post-mortem: Degraded PanDA service, impact on other offline DB services on ATLR
- LHCb reported some issues with DIRAC
From the Sites a few issues too:
- RAL scheduled downtime for move to new Data Centre
- ASGC not stable with several unscheduled downs
- Several CE problems at CERN
Slide 4 shows the summary of the SAM VO tests results. One can notice:
- RAL downtime
- ASGC instabilities
- The INFN report is not corrected in the ATLAS plot.
3.2 Service Incident Reports and Issues
Degraded PanDA Service
Slides 5, 6 and 7 show the post-mortem received
Incident Post Mortem for ATLAS PanDA Monitor, June 23
- At the moment these connections are not pooled, but development of pooling code is being undertaken to improve response times and reduce load on the database. To take proper advantage of this code the http daemon running the service needs to be run in multi-threaded mode (a.k.a. MPM-worker).
- The panda monitor service runs on 3 machines.
- A review of PanDA monitor code development and deployment is being undertaken to ensure no repeat of the above incident occurs. Expansion of the panda monitor test suite to encompass use of the logger service is being done.
DIRAC issues reported by LHCb
All MC physics production has been temporary halted because of inconsistencies in DIRAC bookkeeping. Reported fixed on Friday.
Issue with DIRAC task queue scheduling. Causing some destructive interference between the 1 billion MC production and other (user) jobs in the system. Is this still an issue?
RAL scheduled downtime for DC move
An interesting example for how to cope with long extended downs of a Tier-1.
ATLAS elected Glasgow as stand-in UK Tier-1 taking over the main transfer and data distribution role while RAL was being down.
D.Barberis added that Glasgow was theTier-1 Site for data distribution at the Tier-2 Sites and to other Tier-1 Sites. But not a full Tier-1 as obviously they have no tape storage. And the data will not need to be copied back to RAL.
M.Kasemann asked why ATLAS@RAL is reported green even if is down for all Experiments.
D.Barberis said he will explain it in his report later.
Limits the impact on the Cosmic data taking to ~24 hours down of the LFC @ RAL
Despite presumably hectic activity with equipment movements, RAL continued to attend the daily conf call
The move seems to progress according to the original schedule. CASTOR and Batch will be down until next Monday (6/7).
Planning and detailed progress reported at http://www.gridpp.rl.ac.uk/blog/2009/06/23/r89-migration-tuesday-23rd-june/
Site services are not yet stable and fully useable from VO’s perspective. Split ASGC from ATLAS streams replication because they were down and we were running out of space. It was resumed within recovery window
ATLAS reports several unscheduled or extended downtimes ASGC was taken out from data distributions at several occasions.
CMS decided to give this week as grace period for ASGC to become fully functional
No new tickets and opened tickets put on hold. Resume on Monday 6/7
Qin Gang added that there was some problem but all machines have moved back but not all CEs are running now. The timetable to have the Tier-1 all running should be ready by the end of the week.
I.Bird asked for comments from ATLAS and CMS.
CMS continues the testing. ATLAS is testing the Tier-1 but not the tapes system.
CE Instabilities at CERN
Several problems with CREAM CE reported by ALICE. Configuration problems and bugs found (https://savannah.cern.ch/bugs/?48144, https://savannah.cern.ch/bugs/?52392) The service is back but underlying cause(s) not yet fully understood
Also a service problem with some of the production LCG CEs. Hardware problems on ce124. It had been put in scheduled down in GOCDB but unfortunately the service had not been stopped and this caused ‘black hole’ for LHCb. A known bug (globus_gma process pile-up, https://savannah.cern.ch/bugs/?48588) affected two of the production CEs when coming back after the power-cut the week before
Slide 12 shows the black hole effect of a CE on the LHCb Pilot submissions. The CE was down on the GOCDB but LHCb continued to submit jobs to it.
4. ALICE QR Report 2009Q2 (Slides) – Y.Schutz
Y.Schutz presented a summary of the ALICE activities during the second quarter of 2009.
4.1 Data Taking
Data taking stopped in October 2008 because of planned intervention on the experiment (cabling modification, installation of additional detectors). But all Tier-00 tasks were continuously run, except export to Tier-1 Sites for the sake of saving storage.
On line condition parameters calculation, DAQ, HLT, DCS, was also performed routinely. As well as on line reconstruction of a sampled set of data is run synchronously with data taking. On line Monitoring and QA is partly ready.
The general ALICE framework is operational and the detector implementation in progress. Data taking, of cosmic data, with complete detector will resume in August 2009.
4.2 Data Processing
All cosmic data have been reconstructed in a first pass; while additional reconstruction passes of a selected sample of cosmic data were done with updated reconstruction algorithm and condition parameters at the Tier-0 and Tier-1 Sites.
Reconstructed data was analyzed on the 2 Analysis Facilities, at CERN and GSI, and on the GRID. As for data taking also data processing, of cosmic data, with complete detector will resume in August 2009.
During STEP09 ALICE performed data replication from the Tier-0 to 6 Tier-1 Sites and sustained a 300 MB/s rate. This meets ALICE’s requirements for Heavy Ions data transfers.
The data transfer from the LHC P2 to the Tier-0 was also exercised: It sustained an xrootd copy at 1.25 GB/s from P2 to the CASTOR T0D1 Storage.
Sustained 1.25GB/s CASTOR T0D1 to CASTOR T1D0 (i.e. tapes). and the test at 500 MB/s is still ongoing, compatible with presently deployed disk pool resources
Data processing was also tested during STEP09, but first pass reconstruction at the Tier-0 still: pending until cosmic data become available.
Additional passes reconstruction at Tier-1 is also pending until cosmic data become available.
4.4 MC Production
MC production was only run if needed. ALICE does not have a “keeping processors warm” strategy. In this period large production for EMCAL PPR in progress
Concerning end user analysis, studies of various type of SE performance and tuning of SE parameters per site were done. There is a clear increasing number of users, with about 80 users regularly running on GRID.
MC production will be continued over the whole year with large pp Min Bias productions to be started as well as several smaller first physics productions and AA productions depending on 2009/2010 LHC plans.
4.5 ALICE Software and Services
ALICE Software (AliRoot) underwent a significant new design and implementation of the raw data format, design of the offline calibration and alignment framework and finalization of the online-offline QA framework.
A strict MC data validation strategy was introduced as well as the finalization of PROOF based parallel reconstruction and a new implementation of the realistic trigger simulation.
A new version (v2.17) of the AliEn Services in preparation and will be deployed mid-July 2009.
- Xrd3cp: data replica between CASTOR pools, critical for ALICE CASTOR2 operation
- xrootd: improvements in the server and client; improved resilience of connection over WAN
- Streamlined CREAM-CE submission
- Simplified transfer mechanism exercised successfully during STEP09
- Improved catalogue structure with faster access and increased capacity
- Introduce quotas per user for jobs and storage space
Job submission is using only WMS and submission status has become stable.
CREAM CE deployment is ongoing but is slow and only in 50% of the ALICE Sites.
Several improvements of the analysis framework and new end-users functionalities for grid and PROOF analysis; such as extended merging options and a new submission policy.
The ALICE analysis train was also significantly improved:
- Simplified method to add wagons and wagons from Physics WGs now adopt a uniform style
- Configurations are now saved for further replay
- Higher tag and release frequency for analysis software
55M events were processed on one single trip, producing 2750 AODs (~ 1TB). The success rate is 50-80% without re-submission, and clearly requires better stability of the SE at Tier-1 and Tier-2 Sites.
The CAF, PROOF based reconstruction and analysis, is in production state, and was adopted by many (121) users with more than 20 concurrent users; datasets 200 – 1000 GB, up to 2M events per data set.
Below is the summary of the current and coming ALICE milestones:
- MS-129 Mar 09: Analysis train operational: done
- MS-130 Jun 09: CREAM CE deployed at all sites: 50%
- MS-131 26 Jun 09: AliRoot release ready for data taking: postponed to July
- MS-132 14 July 09: release of AliEn v2-17
- MS-133 30 July 09: deployment of AliEn v2
5. ATLAS QR Report 2009Q2 (Slides) – D.Barberis
D.Barberis presented a summary of the ATLAS activities during the second quarter of 2009.
5.1 Tier-0 and Data taking activities
The ATLAS partial read-out tests ("slice weeks") started in March and continued until mid-June 2009. These are mostly DAQ tests, turning into more complete tests, including Trigger, during May.
Global cosmic data-taking runs restarted last week. With different detector, field and trigger configurations. No data transfer to RAL and ASGC.
The STEP'09 exercise took place in the first half of June. There will be a gap in July-August and will restart with global cosmics in September and will be ready for collisions in Autumn 2009.
The peak for RAW data writing was of 600 MB/sec, as shown below, average always 300-350 MB/sec.
And the distribution to the ATLAS Tier-8 Sites active.
5.2 Data Reprocessing
ATLAS ran during the Easter period a reprocessing campaign for single-beam and cosmic data taken in August-November 2008
500 TB of raw data were processed mostly from tape in all 10 Tier-1 sites.
Reconstructed data were merged and distributed to other Tier-1/2 Sites. A third reprocessing campaign will take place at the end of July starting this time from ESDs. In the meantime there will be a "fast reprocessing" exercise with the current cosmics.
Below the number of jobs execute in April and Mai 2009.
5.3 Data Export Functional Tests
ATLAS continues running data export functional tests at low level to keep checking the status of the whole system. Site contacts are promptly notified of problems and they follow up all troubles together.
SAM tests keep overestimating the site availability. When sites are in downtime and switch off services from the BDII, these services are not tested, but they are not included in the total availability calculation.
For instance RAL this week:
- CE and SRM are removed from BDII (white in SAM tests)
- FTS and LFC work OK (green)
The total availability is calculated as 100% from LFC and FTS only
One would not call this an "available Tier-1 site". There is not concept of which services should be there.
5.4 MC Production
Simulation production continues in the background all the time. But it is limited to physics requests and by availability of disk space for data output storage.
5.5 HammerCloud Tests
The ATLAS HammerCloud tests run now on all Grid middleware flavours and back-ends:
- Panda on OSG
- ARC on NorduGrid
- Panda and WNS on EGEE
They use a mixture of real analysis jobs running on AODs and DPDs produced by simulation and cosmics. More database intensive tasks will be added in the near future. Lots of jobs were run as part of STEP'09.
Below is an example of HC running jobs during 24 hours on 10-11 June, in STEP'09.
Planned software releases:
- Release 15.3.0: June 2009, base release for summer 2009 operations.
- Releases 15.X.0: Once/month (or 6 weeks). Incremental code improvements.
ATLAS is studying the results of STEP09 and will be ready for this Autumn 2009 for collision data. The resource needs have been re-evaluated and are discussed now with the C-RRB Scrutiny Group referees and with the LHCC.
6. LHCb QR Report 2009Q2 (Slides) – Ph.Charpentier
Ph.Charpentier presented a summary of the LHCb activities during the second quarter of 2009.
All LHCb applications moved to latest LCG Applications Area (LCG-AA) releases without problems with the SLC4 compatibility and with ROOT. Older versions are deprecated and will not be ported to SL5. Since May, all application releases are also built for SL5, but only on 64-bits mode. Work with SPI for identifying compatibility libraries for running SLC4-built applications on SL5 is really a heavy burden.
The software distribution and the environment setting had a major re-engineering of the SW distribution and are now used for the distribution of the DIRAC client. It fully relies on LCG-AA deployment of middleware
6.2 DIRAC and Production System
Many releases of DIRAC were performed, focusing on the optimisation of pilot job submission and on the interface to the production system.
Production system has been improved with new scripts to generate automatically complex productions, systematic merging of output data from simulation. Was performed on the Tier-1, from data stored temporarily in T0D1 with a distribution policy performed on merged files of 5 GB; some even larger, up to 15 GB.
There is now a new proposal for DIRAC central services HW implementation with better load balancing, failover on central DB and a new certification service. Provision of adequate hardware is currently being discussed with the IT Department at CERN.
6.3 Activities and Main Issues in 2009Q2
The main activities of LHCb in 2009Q2 were:
- Commissioning of MC production: Physics application software. Geant4 tuning and generator and decay settings tuning
- MC09 simulation production. Large samples for preparing 2009-10 data taking.
- FEST/STEP’09. Data transfers were successful with minor problems with Tier-1 transfers. Data reconstruction was reduced by CondDB access: bad usage of LFC in CORAL.
- Re-processing: Data was transferred during first week of June, removed from cache and re-processing (with staging) launched on Monday 8th. staging went fine, reconstruction hit by the CondDB problem
- TED runs: These are LHCb’s data, from the SPS transfer line. The run just before STEP (6-7 June): was very successful.
Several issues were met by LHCb during the quarter.
- Data management problems. File locality at dCache sites and SRM overloads. Gsidcap access problem (incompatibility with ROOT plugin). SRM spaces configuration problems
- Massive files loss at CERN: 7,000 files definitely lost (no replicas anywhere else Dirac problem). Others could be located and replicated back to CERN
- DIRAC scalability is currently limited to ~ 10,000 concurrent jobs (about 25000 per day). Working on defining and implementing a scalable and redundant infrastructure for central service (with IT)
6.4 Production and User Jobs
Below the number of job ran during the quarter in Red the production jobs in yellow the user jobs.
This is the distribution over the Sites, for a total of 115 Sites in total.
And the same plot by country, over 23 countries.
The cumulative plot of the jobs ran, by country, during the quarters is above 1 million jobs.
7.1 WLCG demos at the EGEE Conference in Barcelona
I.Bird reported that PIC is proposing to have a display at the EGEE Conference, 21-25 September 2009. This would be a good idea in order to show how the WLCG is working on a large scale. Could be showcasing one Experiment per day and the dashboard that is running now in the Computer Centre. P.Mendez and C.Noble will organize something with G.Merino.
Experiments should provide their contact persons.
8. Summary of New Actions