LCG Management Board


Tuesday 10 June 2008, 16:00-18:00 – F2F Meeting




(Version 1 - 13.6.2008)


A.Aimar (notes), D.Barberis, O.Barring, I.Bird (chair), D.Britton, Ph.Charpentier, L.Dell’Agnello, I.Fisk, P.Flix, S.Foffano, F.Hernandez, M.Kasemann, M.Lamanna, E.Laure, H.Marten, G.Merino, A.Pace, B.Panzer, R.Pordes, Di Qing, Y.Schutz, J.Shiers, O.Smirnova

Action List

Mailing List Archive

Next Meeting

Tuesday 17 June 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)


1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.


F.Hernandez sent a correction (already added) to the previous minutes, in the AOB section, about LHCb and dCache at IN2P3. .


I.Bird asked why the issues at IN2P3 involve only LHCb and not the other Experiments.

Ph.Charpentier replied that 80% of the times there are failures opening a file in dCache via GSI/dcap doors. Maybe the other Experiments use other mechanisms. Because of these problems at IN2P3 LHCb copies the file on the Worker Node fetching it from CERN and not from the IN2P3 MSS.

J.Shiers added that this will discussed at the P-M Workshop on Thursday and Friday.

1.2      ALICE Accounting - H.Marten

H.Marten reported that, following-up to the previous MB meeting, the study of ALICE accounting, has started.

DE-KIT (M.Alef) and ALICE are checking the results obtained from MonALIsa for ALICE vs. the DE-KIT monitoring values.

They will report to the MB in the next few weeks.


2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.


  • 30 Apr 2008 – Sites send to H.Renshall plans for the 2008 installations and what will be installed for May and when the full capacity for 2008 will be in place.


A.Aimar will contact for an update to the MB in the coming weeks.


  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).


On going. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it.

Other sites that want to install it up should confirm it.

Done. The agenda above is considered as accepted.

  • 31 May 2008 - OSG Tier-2 should report availability and reliability data into SAM and GOCDB; so that the monthly Tier-2 report can include their information.

To remove, New milestones have been already added to the HL Milestones dashboard.

R.Pordes added that OSG would like that the drafts about Tier-1/Tier-2 reliability and availability are first forwarded to an OSG list for verification. She also asked GridView to produce the report for the OSG sites before the reports are distributed via the WLCG.

I.Bird agreed that the OSG list is added to the current set of lists that receive the drafts.


Update 14 June: The list for OSG is:

Missing information from INFN, ATLAS, CMS and LHCb.

CMS: Updated the mailing list. Still Missing the name of the 4 users DNs that can post to the sites alarms lists.

ATLAS, LHCb: Neither mainlining list nor users DNs specified yet.


D.Barberis noted that ATLAS would like that the person in shift can raise such alarms, not always the same 4 people. A generic address (e.g. of the people on shift) should be allowed.

CERN mailing lists cannot filter signed mail messages only restricted to the member of that same list or to addresses registered in CERN/NICE.

Ph.Charpentier added that the LHCb mailing list is ready and the 4 users will be specified soon.


Update 14 June 2008

The only information still missing is:

-       ATLAS Alarm Email Address

-       CMS list of 4 users DNs that can post alarms to the sites.

  • H.Marten and Y.Schutz agreed to verify and re-derive the correct data and the normalization factors applied (at FZK for instance). And compare the local accounting, APEL WLCG accounting and the ALICE MonALIsa accounting.

Ongoing. H.Marten reported that follow-up to the previous MB meeting, about ALICE accounting, has started.


3.   CCRC08 Update (CCRC'08 and Data Taking Readiness; Minutes of daily con-calls (week 23); Slides) - J.Shiers 


J.Shiers presented the last post- CCRC08 summary. From now on will be the “LCG Weekly Services Report”.


Additional information will be presented to the GDB meeting and to the P-M Workshop in the next days of this week.


The daily operations calls will continue at 15:00, with a weekly summary, until it is decided otherwise. Notes from the daily meetings continue to be made available shortly after the conf-call is over.


The meetings are now in 513 R-068 for those at CERN. Remote joining details remain unchanged: x76000 code 0119168.

3.1      Core Services

The main issue this week was the CERN CASTOR SRM failure following the scheduled upgrade: a detailed post-mortem is available here.

-       The incident started after the 9.30 the SRM2 upgrades created a confusing situation.

-       At 13.00 the developer understood that a bug had been introduced for requests involving tape recalls and it was decided to rollback the software.

The lessons learned from this incident are:

-       Even straight-forward, well understood and exercised upgrades may not be as transparent as we think.


Therefore: announce them better. Testing of Castor/SRM software has deficiencies.


Until these are addressed, we should adopt a more conservative upgrade strategy:

-       Day-1: upgrade one endpoint, and verify it carefully

-       Day-2: upgrade the others


Other issues encountered were in the preparation of FTS data for analysis of May run.

3.2      Monitoring - FTM not installed at many sites

We have observed that only few Tier1 have installed the FTS monitoring component (FTM) to report Tier-1 to Tier-1 transfers back to the central GridView repository. FTM is different from the internal monitoring of FTS and also FTM is needed. The FTS team will circulate advice to the Tier1 sites that are not publishing their T1 to T1 traffic to GridView.


M/W component

Patch #



Patch #1752

Released gLite 3.1 Update 20

FTS (T0)

FTS (T1)

Patch #1740

Patch #1671

Released gLite 3.0 Update 42

Released gLite 3.0 Update 41


Patch #1458

Released gLite 3.1. Update 10


Patch #1738

Released gLite 3.1 Update 20

DPM 1.6.7-4

Patch #1706

Released gLite 3.1 Update 18

3.3      DB Services

The issues with some DB services during the power cut have been analyzed and traced to Ethernet switches of RAC6: on physics power instead of critical power due to wrong connection of the power bars (had correct “yellow label”). A new bar connected with critical power was installed as stop-gap and an intervention to clean up the setup is being planned.


The upgrade of production databases to Oracle version has been scheduled with the Experiments:

-       LHCb 5.June (done), LCG 16.June, Atlas 18.June, CMS 24.June.

-       These upgrades are expected to require 2 hours of downtime each.


An upgrade of the downstream databases, ATLDSC and LHCBDSC, to Oracle has been performed on Thursday 5th June.


More details of CERN interventions are always available on the: IT status board


New database clusters at PIC for ATLAS joined to the Streams environment for ATLAS on Monday 2nd June using a new procedure based on Transportable Tablespaces in order to reduce the re-instantiation time.

3.4      Sites and Experiments

The sites report (slide 6) focuses on BNL and NL-T1 but is it because the other sites do not report issues? Are the other sites not showing or not having issues?


The Experiments have the information available on their logbooks but is not yet clear what is worth to report to the MB.

-       LHCb: elog summary of weekly activities expected Monday afternoon (after weekly meeting). Preliminary conclusions from May run:'08+Observations/106

-       CMS: Phase 2 operations elog:

-       ATLAS: Plans for last week:'08+Observations/104

-       ALICE: All ALICE sites VO-boxes have been updated with SLC4, gLite 3.1 and new AliEn services. In production since one week . Data taking: detector calibration data since mid-May, replicated online to T1s quasi-online (see GridView plots) ~50MB/sec, full detector complement data taking starting this week (alignment, calibration). Processing: calibration data processed and put in conditions database, user and detector expert analysis ongoing on Grid and CAF


I.Bird commented that what is relevant to the MB should be reported and that in his opinion the current weekly reports correspond well to what is needed by the MB.

3.5      Changes to Mailing Lists and Wiki

Mailing List - The CCRC08 mailing list will be renamed to become the WLCG Operation mailing list.


Update 14June 2008: the CCRC list has been renamed as


Wiki - A new wiki focussing on on-going service issues (alone) will be prepared in coming weeks, summarizing the key links / pages that we have learned we need to run and monitor the service.

3.6      Conclusions

The service runs smoothly – most of the time. Problems are typically handled rather rapidly, with a decreasing number that require escalation. Most importantly, we have a well-proven “Service Model” that allows us to handle anything from “Steady State” to “Crisis” situations


We have repeatedly proven that we can – typically rather rapidly – work through even the most challenging “Crisis Situation”. Typically, this involves short-term work-arounds followed by longer term solutions. It is essential that we all follow the “rules”, rather soft, of this service model which has proven effective.


I.Fisk asked how May reliability for CERN could be 100% even if there was a power cut at CERN.


New Action:

A.Aimar will ask information about the CERN’s reliability data during the power cut.


F.Hernandez commented that IN2P3 cannot have the same person participating to all daily meetings. Are these daily meetings going to continue?

J.Shiers replied that the daily meeting will continue and be a place where everybody can find and distribute updated information. The minutes should be read by everybody because they contain concise and useful information. Participation is not mandatory but the daily meeting is the standard time to collect and distribute information.


4.   ATLAS QR Presentation (Slides) - D.Barberis


D.Barberis presented the Quarterly Report (March-May 2008) for ATLAS.

4.1      Software Releases

The ATLAS Software has progressed and produced several releases during the quarter.


Release 14.0.10 was deployed on 9 April and focused on M7 Cosmics (May) data taking period.


Release 14.1.0 was deployed on 1 May.

-       The release for FDR-2 Tier-0 processing and monitoring. Using validated 14.1.0.Y (bug-fix cache built daily for rapid turn-around)

-       Used also for global cosmics and initial single beam running

-       Baseline turn-on simulation production


Release 14.2.0 built last week

-       Initial colliding beam running in the summer.

-       Essential to focus on robustness and technical performance (cpu/memory) within the 14.2.X branch

-       Completion release 14.2.10 will be built in ~2 weeks after the first round of validation

-       Full deployment mid-July and be ready for LHC start-up.


Release 15.0.0 (baseline for 2009 running) foreseen for late Autumn 2008


The general strategy is to have always 3 open bug-fix projects in TagCollector:

-       AtlasP1HLT for online bug-fix patches, built if/when needed

-       AtlasPoint1 for Tier-0 reconstruction and monitoring bug-fixes

-       AtlasProduction for simulation production on the Grid

All patches to the current release are included (if applicable) into the future release development.

4.2      Data Distribution Tests

The Throughput Tests (TT) continue (a few days/month) until all data paths are shown to perform at nominal rates. This includes:

a) Tier-0 to Tier-1s to Tier-2s for real data distribution. Is working almost everywhere and only with SRM2 endpoints.

b) Tier-2 to Tier-1 to Tier-1s → Tier-2s for simulation production. Is part of simulation production since a long time.

c) Tier-1 to/from Tier-1 for reprocessing output data. It was run in all combinations last month.


The Functional Test (FT) is also run in the background approximately once/month in an automatic way. The FT consists in low rate tests of all data flows, including performance measurements of the completion of dataset subscriptions. The FT is run in the background, without requiring any special attention from the site managers and is useful because it checks the response of the ATLAS DDM and Grid m/w components as experienced by most end users.

4.3      Tier-1 to Tier-1 Transfers

ATLAS is actively testing all the links among Tier-1 sites. Below is the cumulative data transfer among all Tier-1 ATLAS sites and on SRM2 endpoints.



Below one can also see the delay of the transfers between Tier-1 sites.

Fraction of delayed (>24 hours) datasets during transfer between all ATLAS Tier-1s.


The Total datasets sample to transfers was of 629 datasets:

Source Dataset Distribution:

-       INFN-T1_DATADISK:                        35

-       TAIWAN-LCG2_DATADISK:             28

-       IN2P3-CC_DATADISK:                     78

-       FZK-LCG2_DATADISK:                    71

-       PIC_DATADISK:                                36

-       BNL-OSG2_DATADISK:                   174

-       TRIUMF-LCG2_DATADISK:             28

-       NDGF-T1_DATADISK:                      26

-       SARA-MATRIX_DATADISK:             101

-       RAL-LCG2_DATADISK:                    52


The delays noted show delayed dataset in the 24h and are being investigated. Some sites are better at sending than receiving data because are using different systems for the “to” and “from” transfers functions.

4.4      Distributed Production Simulation

Simulation production (slides 6 and 7) continues all the time on the 3 Grids (EGEE, OSG and NorduGrid). The rate is limited by the needs and by the availability of data storage more than by disk resources. Validation of simulation and reconstruction with release 14 is still in progress.


The picture below shows the inter-cloud transfers (each Tier-1 and all Tier-2 associated to the Tier-1).



The picture below shows the number of jobs running in the ATLAS system. Typically is about 18-20.000 at any moment. Using 80% of the Tier-1 capacity.


4.5      Tier-0 and CAF organization

The picture shows the architecture of the ATLAS system at the Tier-0, CAF. Priority is given to groups rather than to single users. And currently users cannot obtain files directly from Tape.



4.6      Storage Classes at a Typical Tier-1 Site

The picture below was not commented but is includes for reference only. This original design is actually not implementable with the current SRM implementations.


Note: For discussion only all numbers are indicative and not to be used as reference.




4.7      Evolution of the Storage Model

The definition of ATLAS storage classes for 2008 is still in progress. ATLAS tried to implement a reasonable set of storage classes (and storage tokens) for CCRC’08-2 and FDR-2, using only SRM v2.2 end points. Not all original model can be implemented, as the T1D1 class is not working as in the SRM 2.2 specifications, neither in dCache nor in Castor.


ATLAS are therefore moving towards a more conservative approach:

-       Treat T1D0 and T0D1 as completely separate storage instances

-       Copy data twice in case they have to go to disk and tape at destination, or trigger an internal copy from disk to tape. This is the case of samples of RAW and ESD produced at Tier-0, and all simulation output from Tier-2s.


In this way ATLAS lose the T1D1 functionality but retain control of which datasets are on disk:More control comes with more active data management on ATLAS’ side.

4.8      Plans

The coming Software releases are

-       14.X.Y releases will include bug fixes only for HLT/Tier-0 and Grid operations

-       Release 15.0.0. Late autumn 2008 and will include feedback from 2008 running and be the base release for 2009 operations.


Cosmic runs:

-       M7-M8. Continuing in June-July 2008.

-       Continuous mode wil then be started and run in parallel with initial LHC operations


Collision data software will be ready to go from mid-July for what concerns ATLAS Software and Computing.


I.Bird asked whether the overall situation is positive, except for the T1D1 issues that will be managed by ATLAS by duplicating and controlling the data location.

D.Barberis replied that the main issues are in the distributed computing area (T1D1 in particular) but the global situation seems positive.


5.   CMS QR Presentation (Slides) - M.Kasemann

CMS had 2 parallel important activities during the quarter:

-       CSA Computing and Software Analysis challenge, to provide simulation data and prepare the detectors setup, calibration, etc. CSA goals are those of a CMS-specific exercise strongly tight to the Commissioning and Physics schedule. Exercise first 3 months of data taking and is completely driven by requirements of CMS Commissioning and Physics

-       CCRC08 participate to the verification of the CMS and WLCG services and infrastructure. CCRC goals are a multi-VO computing scale test. CCRC activities were performed to augment CSA loads with additional computing tests and ensure that all systems are being stress tested end-to-end.


The picture below shows the work performed for each week of May 2008 at the Tier-0/1/12 and the CAF.



Below is a reminder of the CMS data flow from the detector to the MSS and to the Tier-1 ad Tier-2 sites. The integral of the data flow reached 600 MB/s.


5.1      CSA08: Simulating Data

One important activity was the production of MC samples with CMSSW 2.0. The focus was on 2 particular scenarios expected in 2008, both running on 7 days of total running:

-       “S43”: with 43 x 43 bunches, L~2x1030, 1 pb-1, 120M events

-       “S156”: with 156 x 156 bunches, L~2x1031, 10 pb-1, 120M events

The samples have big overlaps, the total production was of ~150 M events


The Conditions selected no pile-up; assume a complete detector; and with zero suppression.

The Generator Streams were:

-       Physics samples (minbias, jets, leptons) + technical samples cosmics

-       Assume Storage Manager bandwidth can sustain 300 MB/s

-       Primary Datasets defined by generator

5.2      CSA08: Alignment and Calibration

The intention was to have a situation similar to real-data start-up. To have full complexity of almost 20 concurrent alignment and calibration workflows (with interdependencies).

Using AlCaReco streams with realistic skim filters, partially with event content reduction at HLT.

All workflows executed on the CAF by seven teams from DPGs, with more than 30 persons directly involved.

5.3      CSA08: Production

Production of samples started with a delay of 11 days. Events were produced at T0, T1 and T2 and up to 14000 CPU-slots were used. In total 150M events produced in about 2 weeks.


CSA08 processing started on May 5 (as planned) mostly back on schedule 1 week into the schedule later completely back on schedule.



Below is the summary for the S43 and S156 events produced, a total of about 150M in just 2 weeks.




After the challenge there is now the clean up to perform:

-       Cleanup PreCSA08 pre-production from tape at T1 sites and CERN

-       Cleanup of all (tape) samples at CERN

-       Provide opportunity to T1 sites to cleanup for non-custodial samples

5.4      CSA08: CAF Usage

The CAF was heavily used during CSA08, more that 500 TB of data produced as should in the table below: And there were still 300 TB of disk free at the end of the activity. Only 3 activities were allowed and coordinated: Alignment and Calibration, early Physics Analysis, DPG Quality for a total of 283 users registered of which 81 were active in May.


5.5      Tier-0 Workflows

Great performance for CSA and CRUZET (CMS cosmics runs) processing/reprocessing produced about 40 TB of data.

Reconstruction as such seems fine, job submission and WM systems holds up


CMS still wants to perform an end-to-end test: schedule, depending on experts availability, in order to “re-play” of CRUZET data, through re-packer and prompt reconstruction, and moving data out to Tier-1s. And also assess the latency from “taking data” to prompt processing to primary datasets available at Tier-1s.


And also still open to test a rate of 300MB/sec from Pit 5 for 1 full week should be scheduled during June. Preferably before WLCG Workshop (June 13/14).

5.6      Transfers from Tier-0 to Tier-1 Sites

CMS had defined specific targets (total nominal 600 MB/s) for each Tier-0 to Tier-1 transfer. With “Acceptable” threshold for a total of 425 MB/s and an “Exceptional” target of 850 MB/s, very useful in case of catch-up transfers to perform.


Below is the break-down for each site.


The performance measured and the result vs the target. There should be at least 3 consecutive days by each site.

The total was reached on the 16-18 May.


Many sites reached the extra targets (ASGC and PIC in particular) but a few sites just managed to reach the targets and had difficulties (FZK and FNAL).


5.7      Transfers Tier-1 to Tier-1 Sites

The use case was to check how long it takes to redistribute a sample of 30 TB. of AOD after re-reco within 4 days,

The chosen sample was of size of 28.6TB, which scaled to 3-days at nominal rates, of about 250M AOD events.


All sites passed the metric including sending/receiving up to 14TB (FNAL)

The picture below shows the time between first and last transfer to be ready (e.g. ASGC: transfers took between 0.5 and 2.5 days).



Now CMS want to repeat it moving data from T1 sites to the CERN-CAF.

5.8      Transfers from Tier-1 and Tier-2 Sites

The goal was to exercise the full matrix of regional and non-regional transfers in Prod.

-       Regional and non-regional T2s enter in the same way - with different target rates though.

-       Started May 15., 2 cycles planed

-       The transfer metric are: Latency, participation, rate


The exercise was successful and 178/193 links could be tested. Details are in the WLCG Workshop later this week.

5.9      CMS Job Submission

CMS submitted daily (from/to Tier-0/1/2 sites) about 100K jobs per day with a peak of 200K, in mid-May, as shown below.


Slide 14 shows the break down of the jobs run per day by CMS.

5.10    Re-reconstruction and skimming at the Tier-1 Sites

CSA re-processing was a great success. High throughput, and great performance of Tier-1 sites with 127M and 106M events within ~5 days each (1M events is 1h of data@300Hz). Nominally 6300 slots, but could “opportunistically” use more


CCRC Skimming of RECO and AOD data sets. The skimming exercise is useful, uncovered interesting “features” more skimming exercises (FEVT…) this week under “controlled” conditions.


Custodial storage and organizing CMS file catalogue (TFC) and Tape Families. The tools work and provide the required flexibility

5.11    Tier-2 Analysis

CMS tested the readiness of their Tier-2 sites. The exercise is still proceeding — possibly into this week - chaotic submission exercise. Is driven by many people, also T2 managers

Last week ramped up T2 MC production again to perform tests in “realistic” scenario of MC load at T2s

The results to be summarized at the WLCG Workshop.


In addition centrally organized activities took place for the Tier-2 sites in order to measure the success rate for “typical” physics analysis jobs.


The performance exercise and the results were:

-       Exercise 1: modest IO.
Failure rate ~0.3 to 0.6%

-       Exercise 2: nominal IO.
Failure rate ~9%
90% of the failures are in just 3 sites where the SE failed

-       Exercise 3: local stage-out
(May 26th 10:00 - ongoing)

5.12    CCRC CMS Experience

The regular, daily operational meetings organized by WLCG (J.Shiers) were very useful:

-       Attended by experiments, T0, T1 and T2 sites

-       Experiments status and plans presented (for CMS: D.Bonaccorsi)

-       Mostly operational issues discussed


CCRC was very effective to solve operational issues but had little “cross-experiment” coordination and scheduling. Each experiment had a full schedule of activities. This should be reviewed as there was not high degree of planned concurrency.

Some correlated tests were planned at sites (collecting the information now).


5.13    CSA08 Conclusions

CSA08 has been very successful. Production, calibration and alignment were completed. Physics analysis still ongoing, no show stopper.


Calibration and alignment in CSA08 worked very well

-       S43 & S156 exercises completed on time by all sub-detectors

-       All required constants uploaded to the production database

-       Re-reconstruction could proceed on schedule


Various problems were encountered but quickly solved and the overall time schedule for CSA08 did not slip

Organizational challenges were mastered properly:

-       complexity of a large number of workflows

-       inter-dependences between workflows

-       management of database conditions

5.14    CCRC08 Conclusions

For CMS CCRC08 was very successful and demonstrated all key use case performances of T0, CAF, T1, T2 infrastructure.


It demonstrated simultaneity with major DPG/ALCA/Physics activities and, and the same time, stress tested the computing infrastructure with real and artificial load


Some tests will have to happen next weeks

-       T0 end-to-end test

-       Some coordinated tests with ATLAS at T1 centres

-       Data deletion exercise

-       Review T0-T1 performance, re-test where required


The lessons learned are being compiled now and will be reviewed at computing management meetings, June 18-20 and at the WLCG CCRC workshop scheduled for June 12/13.


I.Bird asked whether the sites that had problems also had the same issues in the past or these are only occasional problems.

M.Kasemann replied that in May the measurements where more systematic and showed issues with some sites in load situations (e.g. FZK).

H.Marten clarified that at FZK was the CMS representative that decided to switch off the load tests that would have tested the transfers. Was not a site problem or decision, it was a CMS decision.


Ph.Charpentier asked whether the data was accesses directly remotely or is transferred to the WN.

M.Kasemann replied that the data is not moved locally to the WN.


M.Kasemann concluded that his main worry is about having concurrent experiments running at full data rates when real data will arrive.


6.   Job Priorities Deployment

Markus Schulz commented that the Job Priorities implementation have been certified and released to PPS and then to production.

The configuration and installation now depends on the VOs and the Sites.


I.Bird noted that someone has to follow the installation at the ATLAS sites.

D.Barberis replied that the sites know already and should be followed up.


New Actions:

For the Job Priorities deployment the following actions should be performed :

-        A document describing the shares wanted by ATLAS

-       Selected sites should deploy it and someone should follow it up.

-       Someone from the Operations team must be nominated follow these deployments end-to-end


7.   AOB

M.Kasemann noted that resources needed by the VO until 2013 were asked by S.Foffano. Should they be specified once some real data is seen?

S.Foffano replied that the RRB meeting had decided to ask until 2013 and that should be completed. And will be updated later, but is required by the MoU. This 5 years outlook is needed by the funding agencies.


I.Bird agreed that maybe 3 years would be sufficient but should be agreed by the RRB and the Scrutiny Group. Is not an MB decision.



8.   Summary of New Actions



The full Action List, current and past items, will be in this wiki page before next MB meeting.


New Action:

A.Aimar will ask information about the CERN’s reliability data during the power cut.


New Actions:

For the Job Priorities deployment the following actions should be performed :

-        A document describing the shares wanted by ATLAS

-       Selected sites should deploy it and someone should follow it up.

-       Someone from the Operations team must be nominated follow these deployments end-to-end