LCG Management Board
Tuesday 12 January 2010 16:00-18:00 – F2F Meeting
(Version 1 – 20.1.2010)
A.Aimar (notes), J.Bakken, D.Barberis, J.-Ph.Baud, I.Bird (chair), D.Bonacorsi, K.Bos, M.Bouwhuis, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, Qin Gang, J.Gordon, D.Heagerty, A.Heiss, F.Hernandez, M.Litmaath, P.Mato, H.Meinhard, G.Merino, A.Pace, R.Pordes, H.Renshall, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon
Mailing List Archive
Tuesday 26 January 2010 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No comments received. The minutes of the previous meeting were approved by the WLCG MB.
1.2 Changes in the MB Membership – I.Bird
I.Bird reported the changes in IT Department at CERN that will have an impact on the WLCG MB membership:
- Helge Meinhard is now the contact person for the Tier-0, replacing T.Cass
- T.Cass will continue to participate in his new role as leader of the DB group
- Service Coordination issues will go via J.Shiers
A.Aimar will stop acting as Secretary for the MB Meeting and be replaced by Denise Heagerty
M.Kasemann will be replaced by I.Fisk as representative for CMS.
I.Bird thanked them both for the participation in the last few years.
2. Action List Review (List of actions)
OPN Mandate: Experiments and Tier-1 Sites should provide names for working with the OPN on the needs and actions needed on Tier-1 to Tier-2 links.
To be done. Already discussed as matter arising earlier.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
This report covers the weeks 14th December 2009 to 10th January 2010.
There was quite smooth running during end of data taking and Christmas break. With a mixture of problems
Few incidents leading to service incident reports:
- Cooling incident at PIC
- Batch system database server overload at IN2P3
The attendance during the last week was as below.
3.2 Alarms and Availability Data
The GGUS ticket average was as usual, below is the summary for the 4 weeks, with 3 Alarms tickets in total:
- Two alarm tickets submitted by Atlas: Test CERN-PROD on 14th and 15th December
- There was one alarm ticket submitted by LHCb: Test FZK on 15th December
Slide 6 shows how most of the Availability was high (mostly green).
D.Britton commented that the unavailability of RAL for LHCb was due to error in the LHCb SAM tests.
A.Heiss commented that for DE-KIT there was a problem with the tests of the space tokens with dCache.
3.3 Service Incidents Reports
PIC - Cooling Problem
- Failure in cooling system in the morning of 19th December (cause not known)
- Ordered fast shutdown of critical services: 3D, CE, SE
- Cooling problem fixed around 11:00
- Systems fully restarted at 15:00 (LFC replication via Streams excepted)
- Procedures being improved and automated
IN2P3 - Batch System DB Server Overload
- Local batch management system (BQS) failure due to database server overload in the afternoon of 4th January
- Problem due to a user requesting historical information for December (2 millions entries)
- User cancelled his request and resubmitted it with wrong parameters (24 millions entries), cancelled again …
- No job submission nor job status queries were possible
- MySQL restarted
- No other fix as BQS will probably be replaced by another batch system.
3.4 Miscellaneous Information
- LHC produced collisions at 2.36 Tev (world record)
- ATLAS collected 150TB of data with beam on, a significant fraction will be reprocessed
- ALICE decided to blacklist sites not running SL5
- IN2P3 added 60 GB of disk for CMS to store software releases
- BIGID problem for CASTOR at RAL
- FS probe problem at RAL on one of the LHCb disk servers
- Transfer timeouts between PIC and SARA as well as between Weizmann T2 and SARA due to large MTU values used
M.Bouwhuis commented that the issue was due to an OPN problem between CERN and PIC.
- Job submission failures for CREAM-CE at IN2P3
- LAN upgrade and 700 TB added at BNL and was a transparent upgrade without stopping the service.
- CNAF plans to move LHCb data from CASTOR to GPFS and TSM as already done for CMS
L.Dell’Agnello added that CNAF plans to move also ATLAS from CASTOR to TSM.
- FTS problem at CERN for ATLAS and CMS: “could not load client credentials”: is it FTS 2.2 specific or is it a deployment issue?
A.Heiss reported that the same problem was encountered also at DE-KIT
J.Gordon asked whether FTS2.2 is the “production” version.
J.-Ph.Baud replied that it is the production version but the upgrade is not recommended until these issues are solved.
- CERN network instabilities on Friday 8th January (problems in the core routers)
4. Update on Accounting (HEPSPEC, Storage, Installed cap.) (Slides ) – J.Gordon
All sites were asked to benchmark their current hardware and publish:
- 89 sites publishing from 21 countries, Gstat2 knows about 349 sites.
Most countries only publish a few (see table below).
Sites measuring HEPSPEC06 convert to SI2K to publish. All APEL are normalised numbers in SI2K
CESGA Portal asked optionally to convert when displaying. It is on their roadmap) but have no plans yet for a mass conversion of raw data. They could convert it in the monthly reports.
New version of APEL Client which uses ActiveMQ to transfer data to central repository and is under certification. While the new version of the central server which accepts data from ActiveMQ as well as R-GMA and is being readied for production.
Once these two tasks are complete alternative publishers can be developed (OSG, DGAS, etc).
Regional repositories will not be available before EGI.
4.3 Storage Accounting
Gstat2 has developed collectors for the new storage schema. New RAL staff member started to work on storage accounting in December and will present specifications on February’s GDB for comments.
It will harvest data from BDII so Nagios sanity checks are also relevant here. One can check the results now in gstat2 at this URL: http://gstat-prod.cern.ch
4.4 Installed Capacity
These values were sanity checked:
- Installed cpu numbers are known to gstat2
- SI2K totals known to Gridview for the reliability reports
- Online and Nearline site storage totals are available to gstat2
Regular reports for T1 and T2 are on the on gstat2 work plan but to start they will manually merge with CPU reports to start. No dates known for the implementation.
The patch for APEL client to use new normalisation value is in certification. It will allow sites who normalise CPU in their batch system in order to publish the correct CPU capacity
4.5 User DN Publication
The CESGA portal shows the Sties not publishing the User DN and they should be reminded that it is not acceptable.
I.Bird asked that regular reports are produced on the different metrics mentions in the talk. Only with regular reporting the Sites will act.
on Multi-User Pilot Jobs (MUPJ) (Slides)
5.1 Technical Forum Working Group
The Tech Forum started right after December GDB. There are now 59 members, and others welcome
Here is the group: http://groups.google.ch/group/wlcg-tf-pilot-jobs with almost 200 messages so far.
A Summary Wiki went through 6 versions https://wlcg-tf.hep.ac.uk/wiki/Multi_User_Pilot_Jobs
To view that page the browser needs an IGTF cert loaded. The questionnaire was sent to T1+T0 representatives yesterday and the majority of responses are expected in 2 weeks.
The next activity foreseen in the WG is to produce recommendations.
5.2 Summary Wiki
- What are pilot jobs?
- Single- vs. multi-user pilots
- What is gLexec?
2. Boundary conditions
- Mainly JSPG policies http://www.jspg.org/wiki/JSPG_Docs
- Some adjustments could turn out to be desirable, as some wording do not match the policy
3. Benefits of pilot jobs compared to "classic" jobs
- Also single- vs. multi-user pilots
4. Issues for efficient/correct scheduling of pilot jobs
- A single class of pilot jobs may not be a panacea
5. Drawbacks of multi-user pilots
- Mainly issues surrounding gLexec
6. Multi-user pilot jobs with identity change
- Pro: complete separation of users
- Con: setuid complications
7. Multi-user pilot jobs without identity change
- Pro: no setuid complications
- Con: incomplete separation of users
8. Legal considerations
- Some sites may have more constraints than others
9. Virtual machines (one VM per job)
- Will simplify matters
A short discussion followed but M.Litmaath suggested checking the explanations in the Summary Wiki.
A questionnaire was sent to all Sites in order to understand clearly the situation on MUPJ.
1. Does your site policy allow the use of multi-user pilot jobs by the LHC experiments you support? (no/depends/yes)
- If no, why?
2. Does your site policy support the use of gLexec in setuid mode? (no/allow/require)
- If no, why?
3. Does your site policy support the use of gLexec in log-only mode? (no/allow/require)
- If no, why?
4. When gLexec returns an internal error (e.g. SCAS/Argus/GUMS temporarily unavailable), does your site policy allow the pilot to continue and run the payload itself? (no/depends/yes)
- If depends, on what?
I.Fisk not that the reliability of GUMs and SCAS/gLexec are relevant whether one uses MUPJ or not. And they should be evaluated.
M.Schulz replied that for the moment the SCAS solution is not as reliable as it will be, but will improve. For the moment SCAS may be a bottle neck for a Site.
J.Templon noted that Experiments could exercise SCAS for a fraction of their jobs and one could evaluate the reliability.
Ph.Charpentier replied that the Experiments do not have time for those tests.
Tier-1 Sites should collect the answers on the MUPJ questionnaire from their Tier-2 Sites. They are expected to report on the answers form their countries.
Ph.Charpentier noted that there could also be a question of MUPJ without gLexec.
M.Litmaath replied that it is implicit and Sites can have their additional comments. He reminded that in case of legal issues one must should that had done all that is possible to correctly collect authorization information.
J.Templon noted that would be better to have some external body assessing/certifying the security situation.
I.Bird noted that the situation is good and we have done the necessary due diligence on the matter. And the advice given by expert of the policy group which said to use for example gLexec.
D.Barberis noted that some Tier-2 will not have a reference Tier-1.
I.Bird suggested sending the questionnaire to the Collaboration Board mailing list even if the Tier-1 will report on their Tier-2 Sites.
6. ALICE Quarterly Report (2009Q4) (Slides) – Y.Schutz
Y.Schutz presented the Quarterly Report 2009Q4 for ALICE.
6.1 Data Taking
Data taking with all installed detectors from the first collision on
- 1Mio collision events
- 365 GB RAW at the Tier-0
- Replicated two times in external Tier-1 Sites; but only after end of data taking.
The data migration strategy was changed. During data taking data are migrated from the DAQ disk buffer to the ALICE CASTOR disk pool (alicedisk) for temporary storage. The data are then optionally migrated to the CASTOR permanent data storage (t0alice).
CASTOR v.2.1.8 was extremely stable throughout the data taking
6.2 Data Processing
Pilot reconstruction and analysis is performed for a fraction of the run, typically a few thousands events, on the CAF as soon as data are transferred to CASTOR and registered in the AliEn file catalogue. This method provides quick feedback to run coordination on data quality.
Data reconstruction is automatically launched (first pass) at T0 at the end of the run and ESDs are available for analysis a few hours later in 3 SE. The first pass reconstruction success rate was ~96%. Second pass reconstruction has been run during Christmas break at T0+T1s. Second pass reconstruction success rate ~98%. The analysis trains have run several times over the entire set of reconstructed data of pass1 and pass2. MC production was executed at all sites with several ‘Early physics’ production runs with RAW conditions data.
In average 2600 jobs were running concurrently.
6.3 ALICE Software
A number of fixes were needed to the algorithms, when confronted for the first time with real collision data.
The main issues found regarded high memory usage ~ 4 GB for RAW reconstruction. The main effort is concentrated to reduce the memory usage; the aim is 2 GB, after calibrating the detectors with the collected data sample. A new release was done on January 15.
6.4 Services: SL5 and CREAM
Priority was given to SL5 migration and all Tier-1 Sites and most of the T2s have migrated. Four T2s are still blocked (Athens, PNPI, UNAM, Madrid) because they do not run SL5. All is expected to be done by the end of this month.
CREAM CE deployment reached 50% of the ALICE sites without progress recently. ALICE still must run dual submission CREAM/WMS still the norm and is not the desired setup by ALICE and they continue to work with the sites on deployment of CREAM.
The ALICE updated milestones are:
- MS-130 15 Feb 10: CREAM CE deployed at all sites
- MS-131 15 Jan 10: AliRoot release ready for data taking
The first data taking period has been a full success for ALICE in general and for ALICE computing in particular data flow and data processing went as planned in the Computing Model. The Grid operation has been smooth and the sites delivered in general what they have pledged.
Two main concerns remain and are the excessive usage of memory prevents us to run efficiently at all T1 sites and achieve uniformity of the submission system (i.e. CREAM at all Sites) before start of data taking.
F.Hernandez asked whether the MC jobs still require 2 GB and whether this is specified in the job descriptions for the CE.
Y.Schutz answered that only the reconstructions jobs require for 4 GB, temporarily, and in IN2P3 cannot run for the moment.
7. ATLAS Quarterly Report (2009Q4) (Slides) – D.Barberis
7.1 Tier-0 and data-taking activities
In October 2009 ATLAS took global cosmics.
In Mid-November at the start of LHC data ATLAS was ready, with open trigger, with low thresholds, and with the full calorimeter read-out. Data was made of 5 MB/event on average.
The instantaneous rate was limited to 800 MB/s, but average event rate very low anyway.
This produced Large RAW, but small ESD.
Cosmics runs are interleaved with LHC runs when the machine is off. They are needed (together with beam halo) for detector alignment to constrain the weak distortion modes that cannot be constrained by tracks originating from the collision point.
I.Bird asked whether the storage of cosmics data was already planned originally and required.
D.Barberis replied that this is within the existing requests of ATLAS.
ATLAS has accumulated almost 1 PB including replicas. All data was processed in real time at Tier-0 and there were no surprises wrt. MC events in terms of CPU and memory.
7.2 Data Distribution Pattern and Performance
All RAW data was sent to disk and tape in each Tier-1 by Tier-1 share. Moreover all RAW go to disk at BNL, Lyon and SARA. Normal is tape in each Tier-1 by Tier-1 share and no extra RAW data to disk at CERN except for the CAF.
All ESD to disk in each Tier-1. Normal is to have two copies distributed over all Tier-1s, with full ESD copy to disk at CERN and ESD data to disk in Tier-2s by Tier-2 share.
AOD and dESD, skimmed data, to disk in all Tier-1s. Normal is 2 copies kept in all Tier-1s only.
Copied to disk in Tier-2s by Tier-2 share (total ~18 copies). Normal is 10 copies in the Tier-2s only
Additional copies will be reduced dynamically to make room for 2010 data.
Below is the total data throughput over a month in November and December.
All data were delivered to Tier-1s and Tier-2s using open datasets. RAW during data-taking (run in progress) while ESD etc during Tier-0 processing, as soon as outputs were available.
Data were available for analysis at Tier-2s on average 4 hours after data-taking; including the time for Tier-0 processing
7.3 Data Reprocessing
An "ultra-fast" reprocessing campaign was run on 21-31 December 2009 using the last Tier-0 software cache plus a few last-minute bug fixes (release 126.96.36.199) and most up-to-date calibrations and alignments for the whole period.
Only 22 RAW->ESD jobs failed out of 130148 and 27 ESD->AOD jobs out of 10001
A few software bugs being followed up, affecting beam splash events. Next reprocessing round will take place in February using release 15.6.3.X built now. It will also be a test of releases 15.6.X.Y to be used at Tier-0 next month. All was on SLC5/gcc4.3 only.
7.4 Simulation Production
Simulation production continues in the background all the time. It is only limited by physics requests and the availability of disk space for the output.
Parallel effort is underway for MC reconstruction and reprocessing. It is including reprocessing of MC09 900 GeV and 2.36 TeV samples with AtlasTier0 188.8.131.52 reconstruction, same release as for data reprocessing.
7.5 Analysis Data Access
Data were analysed on the Grid already from the first days of data-taking as shown below.
Several "single" users submitted event selection and/or analysis jobs on behalf of their performance or physics working group. Output is made of ntuples that are copied to group space and then downloaded by end users
In this work model the number of real Grid users is somewhat underestimated.
Restart data-taking with separate detector runs during January, with no data export.
ATLAS will start global cosmics run first week of February, with start of Tier-0 data processing and export.
ATLAS is ready for LHC beams mid-February
J.Gordon asked whether ATLAS expects more Sites to move on SL5.
D.Barberis replied that the situation as it is fine for ATLAS.ATLAS expects all sites to have moved to SL(C)5 by now. The software release that is in validation now, to be used at the re-start of data- taking, will be distributed only for the SLC5 (and compatible) platform.
F.Hernandez asked about ATLAS testing the CREAM CE.
D.Barberis replied that ATLAS will do its CREAM tests when it is tested by the other Experiments.
8. CMS Quarterly Report (2009Q4) (Slides) – I.Fisk
The CMS Distributed Computing System generally performed well with the addition of collision data. The data rates and sample sizes are still quite low and the system was not resource constrained during this early period.
The workflows and activities were generally what was expected from the computing model; but the workflow could be executed much more frequently:
- Data Multiply Subscribed. More T1 (four) and T2 (about 13) subscriptions than would happen with more data.
- Re-processing occurred every 2-3 days instead of every couple of months.
Data Reconstruction, Skimming, Re-reconstruction at Tier-1s could run in parallel with distributed user analysis and MC production at Tier-2s.
8.2 Data Collection Infrastructure
Tier-0 and Tier-1 Re-reco and Data Distribution Systems functioned with early collisions. Events were reconstructed and exported to Tier-1 sites, the express stream latency was at target levels and data was re-reconstructed using Tier-1 centers. In addition the Prompt Skimming system was moved into production.
The Tier-0 Facility had been routinely exercised with cosmic data taking and simulated event samples and was performing stably with Cosmics data with very few failures as shown below.
Below with Collisions, with the failures concentrated in setup phase.
The ~3000 cores at CERN were all used with local submission to farm with multiple workflows
Overall there was very good stability and performance of the CMS software.
And CMS received confirmation from CERN on T0+CAF pledge in 2010.
8.3 Data Distribution and Access
Below is the data distribution from CERN or a Tier-1 going to destination at another Tier-1.
Below instead is with source a Tier-1 going to a destination Tier-2.
8.4 Site Stability
Tier-1 Readiness November and December is shown below. With mostly the 7 CMS Tier-1 ready.
Readiness is defined as passing the CMS, SAM, Job Robot, and Transfer tests for a high percentage of a time window.
About 40 Tier-2 are constantly available but has not improved much over the two months.
8.5 Activities over the end 2009 Break
Data Processing Activities during the break were the following:
Re-processing and skimming of all good runs finished on 12/24 for the two large physics datasets
- ZeroBias 22M RAW events, 1019 files processed.
- MinimumBias RAW 21.5M events, 1207 files
Almost problem-free processing of high-quality data.
For the latest CMSSW version only one of >2000 job failed due to memory consumption all was done within 4-5 days.
8.6 MC Production
Smooth MC Production over break with 120M events produced (RAW, RECO, AOD); including special MinBias samples for comparison with 900GeV and 2.36TeV data. Most was with Full Simulation, some with Fast Simulation
8.7 Current Activities and Improvements
CMS Operated in an environment without resource constraints; the data rate and complexity is lower than expected in the final system and this allows many more passes and caused some complaints about lack of utilization.The number of users is also lower
While we see the ability replicate data to Tier-2s. We are taking advantage of the oversubscription. This is needed to anticipate achieving good performance Tier-1 to Tier-2 when the data is accessible from fewer places.
The CMS Computing TDR defines the burst rate Tier-1 to Tier-2 as 50MB/s for slower links up to 500MB/s for the best connected sites. We have seen a full spectrum of achieved transfer rates. The Average Observed Daily Max peaks at the lower end.
From the size of the facilities and the amount of data hosted, CMS has planning estimates for how much export bandwidth should be achievable at a particular Tier-. No Tier-1 has been observed to hit the planning numbers (though a couple have approached it)
CMS would like to organize a concerted effort to exercise the export capability. Need to work with site reps, CMS experts, FTS and Network experts. This is an area for collaboration.
In conclusion the Distributed Computing worked well during the opening collision data for CMS. CMS thanked CERN and the Tier-1 sites for keeping things working. Some items are to be followed up on because are not yet working at the rates anticipated in the planning.
9. LHCb Quarterly Report (2009Q4) (Slides) – Ph.Charpentier
9.1 Activities in 2009Q4
This is a summary of the LHCb activities:
- Core Software: Stable versions of Gaudi and LCG-AA
- Applications: Stable as of September for real data. Fast minor releases to cope with reality of life.
- Monte-Carlo: Some MC09 channel simulation. Few events in foreseen 2009 configuration. Minimum bias MC09 stripping
- Real data reconstruction: As of November 20th
Below are the Number of Jobs per day and CPU Usage by Job Type.
9.2 Experience with Real Data
Very low crossing rate with a maximum of 8 bunches colliding (88 kHz crossing) with very low luminosity.
Minimum bias trigger rate: from 0.1 to 10 Hz. Data was taken with single beam and with collisions. Only 217 GB.
9.3 Read Data Processing
Iterative process with small changes in reconstruction application and improved alignment. In total 5 sets of processing conditions and only the last files were all processed twice.
Automatic job creation and submission after:
- File is successfully migrated in Castor
- File is successfully replicated at Tier1
If job fails for a reason other than application crash
- The file is reset as “to be processed”
- New job is created / submitted
Processing more efficient at CERN (see later)
- Eventually after few trials at Tier1, the file is processed at CERN
- DST files distributed to all Tier1s for analysis
9.4 Issues Encountered
A few issues were encountered by LHCb:
- Castor migration: Very low rate: had to change the migration algorithm for more frequent migration
- Issue with large files (above 2 GB): Real data files are not ROOT files but open by ROOT. There was an issue with a compatibility library for slc4-32 bit on slc5 nodes. Fixed within a day
- Wrong magnetic field sign: Due to different coordinate systems for LHCb and LHC. Fixed within hours
- Data access problem when accessing by protocol, directly from server. DCache issue at IN2P3 and NIKHEF. DCache experts working on it. Moved to copy mode paradigm for reconstruction. Still a problem for user jobs and some sites have been banned for analysis
9.5 Transfers and Latency
No problem observed during file transfers. Files were randomly distributed to Tier1 and LHCB will move to distribution by runs (few 100’s files). For 2009, runs were not longer than 4-5 files.
Very good Grid latency in terms of time between submission and jobs starting running
9.6 Usage of MUPJ in LHCb (not presented, copied from the slide)
The so-called Multi-User Pilot Jobs (MUPJ) are used by DIRAC on all sites that accept role=Pilot
- They are just regular jobs!
MUPJs match any Dirac job in the central queue
- Production or User analysis (single queue)
- Each PJ can execute sequentially up to 5 jobs
P If remaining capabilities allow (e.g. CPU time left)
P MUPJ has 5 tokens for matching jobs
- role=Pilot proxy can only retrieve jobs, limited to 5
A limited user proxy is used for DM operations of the payload
- Cannot be used for job submission
Proxies can be hidden when not needed
DIRAC is instrumented for using gLexec (in any mode)
- Problems are not with gLexec but with SCAS configuration
LHCb is not willing to lose efficiency due to the introduction of badly configured gLexec
- Yet another point of failure!
- Cannot afford testing individually all sites at once
This topic has been lasting over 3 years now
- Where is the emergency? Why did it take so long if so important?
Propose to reconsider pragmatically the policy
- They were defined when the frameworks had not been evaluated, and VOs had to swallow the bullet
- Re-assessing the risks was not really done in the TF (yet)
leaves decisions to sites
- MUPJs are just
jobs for which their owner is responsible
- VOs should assess the risk on their side
LHCb was concentrating on real data even if was very few data (200 GB). But was a very important learning exercise.
A few improvements identified for the 2010 running
- Run distribution (rather than files)
- Conditions DB synchronization check
- Make sure Online Conditions are up-to-date
Still some MC productions with feedback from first real data E.g. final position of the VeLo (15 mm from beam)
First analysis of 2009 data made on the Grid and LHCb foresee a stripping phase for V0 physics publications.
LHCb definitely wants to continue using MUPJs.
F.Hernandez reported that IN2P3 observed problems in accessing files using GSI DCAP, like SARA. A ticket is submitted to the dCache team and they reproduced it. A patch is available for test but IN2P3 cannot reproduce the problem. He asked whether LHCb is still banning IN2P3.
Ph.Charpentier replied that he will have to verify offline.
I.Bird reminded the MB that there is the LHCC Review on the 16th February. Sites and Experiments should check the agenda.
The issue for resources for 2011 was raised but without news will be assumed as a nominal year.
The 2010 calendar of the MB meeting should be fixed for the whole year.
11. Summary of New Actions