LCG Management Board

Date/Time:

Tuesday 12 February 16:00-17:00 – Phone Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=27469

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 15.2.2008)

Participants:

A.Aimar (notes), D.Barberis, I.Bird (chair), Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, J.Gordon, F.Hernandez, M.Kasemann, M.Lamanna, H.Marten, G.Merino, A.Pace, B.Panzer, R.Pordes, M.Schulz, J.Shiers, O.Smirnova,

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 19 February 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      Tier-1 Availability and Reliability Data - 200801 (OPS_Data)

The Tier-1 Reliability data is now available. A.Aimar will ask the sites to complete the sites reports not filled at the weekly Operations meeting.

 

Update: The LCG Office has distributed the monthly Report (PDF).

 

New Actions:

13 Feb 2008 - A.Aimar will ask for the Sites Reliability Reports January 2008. Sites should send it by end of the week.

1.3      LHCC Referees Meetings (Agenda)

The meeting agenda is now fixed. The Experiments will individually present their status and progress.

 

D.Barberis asked whether the presentations should focus on some particular aspects.

I.Bird replied that the Referees expect a summary of the Experiments’ CCRC preparation and initial experience.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

·         12 Feb 2008 – Sites should start publishing their tape efficiency data in their wiki page (see https://cern.ch/twiki/bin/view/LCG/MssEfficiency)

 

Not done. CERN, CA-TRIUMF, ES-PIC, UK-T1-RAL, US-T1-BNL have published the weekly values. The other sites should replace the examples with their real data, if available.

 

3.   CCRC08 Update (CCRC08 Calendar; General Observations; GridView; LogBook; Slides) - J.Shiers

 

 

J.Shiers presented a summary of the initial days of CCRC08-Feb. See the Slides for all details.

3.1      Overview

The February CCRC08 run has started on Monday 4 February but required immediately many Face-to-Face meetings, and actually in went into ‘routine schedule’ only on Thursday 7

 

These first few days show that:

-       The Daily meetings seem to work well for sharing information.

-       Contacts with the Tier-2 sites and the DB support are needed. To this purpose “wlcg-{tier2,db}-contacts” mailing lists is established.

 

The e-logbooks seem to be a useful way to report and track problems. The overlap with other reporting and support systems, e.g. GGUS, has to be reviewed. Nonetheless, this is a much more powerful and convenient system than a ‘diary’ in a Wiki (Twiki response time at CERN has recently been poor).

 

The Services: clearly still need some tuning and/or configuration. As well as additional bugs (e.g. GFAL for VOs > 15 chars). This does not influence the HEP VOs but some sites support also other VOs and therefore the problem is relevant.

 

There was a site outage (at RAL) on Thursday – long recovery time for some services, highlights need to revise ‘site offline’ procedure

 

J.Gordon clarified that they are planning to use DNS aliasing and also to have an alternative database at CNAF in order to avoid similar issues.

 

There was an high-load on CERN CEs – what is root cause? Are we ready to handle load of all 4 experiments concurrently?

 

M.Schulz added that the current LCG CE does not scale properly with the number of users (20 or 30 users seems the limit). Using pilot jobs (i.e. the jobs go via a single user) would reduce the number of users and therefore improve the situation.

 

There is also the suspicion of some coupling to AFS – numerous problems are ‘synchronized’ with AFS events.

 

The Service maps are very useful but a lot of follow-up and work is required to get all services green. (http://gridmap.cern.ch/ccrc08/servicemap.html).

 

The most recent news are:

-       Quite a few (~12) transfer-related problems reported today.

-       A couple of patches in pipeline for dCache, also for FTS

 

News from the Experiments:

-       ALICE: files failing to migrate due to and unclear file size mismatch.

-       ATLAS: Expecting larger data exports starting this week and continuing next.

-       CMS: running at 400MB/s ± 100MB/s.

-       LHCb: pit-Tier0 working well; not ready for re-processing activity (waiting Brunel release to be deployed).

 

-       High-load regularly seen on CERN VO boxes (ALICE, LHCb) as well as CEs

 

Ph.CHarpentier added that LHCb is now starting their “Tier-0 to Tier-1” transfers in order to then be able to proceed with the reprocessing at the Tier-1 sites.

3.2      Storage

The attempt to identify and resolve key issues around storage in a pragmatic and timely manner

-       There is extremely little time prior to the May run

-       There is also limited effort (.e.g. the recent FNAL budget cuts).

 

There is now a draft “Addendum” to agreed on for the SRM v2.2 MoU (see http://tinyurl.com/2bpkje):This document is an addendum to the WLCG SRM v2.2 MoU (also referred to as “Usage Agreement”):

-       It details changes in methods, their signatures and/or behaviour required for proper operation of the SRM v2.2 implementations under production conditions.

-       Once approved by the WLCG Management Board, this document has precedence over the MoU for the described methods and associated behaviour.

-       FNAL asked that also Support issues should also be covered by the addendum.

 

The list of key issues was discussed in a dedicated meeting on Monday 11 February. None of the issues are new, however some progress has been made since the previous discussion

 

The Experiments should provide their feedback on the list of issues and the relative priority by Friday 22nd February. Then further meetings during the following week will be scheduled to discuss how – and on what timescale – these issues can be addressed.

 

This will be re-iteration on this list during March’s F2F meetings and possibly also during WLCG Collaboration workshop.

In the immediate future, work-arounds (sometimes manpower intensive) are to be expected.

 

These are the current issues and the priorities should be discussed and agreed, the Experiments should provide feedback by and of next week:

 

-       Protecting spaces from (mis-)usage by generic users.
Concerns dCache, CASTOR

-        Tokens for PrepareToGet/BringOnline/SRMCopy (input)
Concerns dCache, DPM, StoRM

-        Implementations fully VOMS-aware
Concerns dCache, CASTOR

-       Correct implementation of GetSpaceMetaData
Concerns dCache, CASTOR. Correct size to be returned at least for T1D1

-       Selecting tape sets
Concerns dCache, CASTOR, StoRM. By means of tokens, directory paths?

 

New Action:

22 Feb 2008 - Experiments should provide feedback on the priorities of the current storage issues. See slide 6 of http://tinyurl.com/2yrc6b.

3.3      Databases

As reported last week, quite a few interventions around database services, both at CERN and other sites.

This may well be due to the combined effect of increasing usage of these services for CCRC’08-related activities, together with additional visibility that this brings.

 

“Stand-by” databases potentially offer reliability and protection at low cost – consequences of prolonged downtimes are significant. As last year’s “Critical Services” review tells us, the key services – as ranked by the experiments – concentrate on those related to data(-base) access

 

The extension of stand-by (on-call) services to cover databases is also required for CASTOR, LFC, FTS, etc and is being studied, as well as establishing more than 8x5 streams coverage.

3.4      Weekly Operations Review

As described the previous week, it should be based on the 3 agreed metrics:

-       Experiments' scaling factors for functional blocks exercised

-       Experiments' critical services lists

-       MoU targets

 

For the moment there is still information missing (as shown in slides 8-10).

 

Some metrics from the Experiments also need to be integrated and their help is needed.

3.5      Critical Service Follow-up

Have proposed targets (not commitments) for Tier0 services now similar targets are requested also for Tier1s/Tier2s.

 

Experience from first week of CCRC’08 suggests targets for problem resolution should not be too high (if ~achievable)

 

The MoU lists targets for responding to problems and solving them could be this one:

-       Tier1s: 95% of problems resolved <1 working day?

-       Tier2s: 90% of problems resolved < 1 working day?

 

Time Interval

Issue (Tier0 Services)

Target

End 2008

Consistent use of all WLCG Service Standards

100%

30’

Operator response to alarm / call to x5011

99%

1 hour

Operator response to alarm / call to x5011

100%

4 hours

Expert intervention in response to above

95%

8 hours

Problem resolved

90%

24 hours

Problem resolved

99%

3.6      Conclusions

An interim report on CCRC’08 is due for next week’s meeting with the LHCC Referees.

 

The preparations for this challenge have proceeded (largely) smoothly – we have both learnt and advanced a lot simply through these combined efforts. We are also learning a lot about how to run smooth production services in a more sustainable manner than previous challenges. But it is still very manpower intensive and schedules remain extremely tight. More reliable – as well as automated – reporting is really needed

 

 

All should try to maximize the usage of up-coming F2Fs (March, April) as well as WLCG Collaboration workshop to fully profit from these exercises.

 

And, from June onwards, one will be in continuous production mode (all experiments, all sites), including tracking/fixing problems as they occur.

 

In addition J.Shiers also proposed that in the coming years similar CRRC activities should be planned every year in preparation for the yearly start-up of data taking. There will always be new features and new target rates to test.

 

The MB agreed on this idea in order to prepare, every year, CCRC-like tests of the services before every restart of data-taking and distribution periods.

 

4.   ATLAS QR Report (Slides) – D.Barberis

 

D.Barberis presented the ATLAS QR report, an update on the information presented in November.

4.1      Data Distribution Tests

The ATLAS throughput tests will continue, a few days/month, until all data paths are shown to perform at nominal rates. This includes:

-       Tier-0 to Tier-1s to Tier-2s for real data distribution.
Now working almost everywhere. Run again in January (but with SRM v1 end-points, see later slide)

-       Tier-2 to Tier-1 to Tier-1s to Tier-2s for simulation production.
Is part of simulation production since a long time

-       Tier-1 to/from Tier-1 for reprocessing output data.
Started with the BNL-IN2P3CC-FZK combination

 

The Functional Test will also be run in the background approximately once/month, in an automatic way:

-       Consists in low rate tests of all data flows, including performance measurements of the completion of dataset subscriptions

-       Is run in the background, without requiring any special attention from site managers

-       Checks the response of the ATLAS DDM and Grid m/w components as experienced by most end users

 

Below is an example of transfers executed in January 2008.

-       Files are send from the Tier-0 to BNL

-       From BNL they are moved to FZK and to LYON

-       The FTS channels still need to be tuned in order to reach better data rates.

bnl1

 

F.Hernandez asked what happens if VOs want different FTS parameters on the same site.

D.Barberis replied that this situation can be configured but it will take some time to find the values suitable for all VOs.

4.2      Distributed Simulation production

The simulation production continues all the time on the 3 Grids (EGEE, OSG and NorduGrid) and reached 1M events/day. The rate is limited by the needs and by the availability of data storage more than by CPU resources.

 

The validation of simulation and reconstruction with ATLAS SW release 13 is still in progress.

 

The graphs below shows the activity in the last three months.

 

Simulated data must be transferred to other sites, including the RDOs to CERN for the FDR-1 data preparation exercise.

 

data

4.3      Global schedule: M*, FDR & CCRC’08

FDR must test the full ATLAS data flow system, end to end:

-       SFO to Tier-0 to calib/align/recon to Tier-1s to Tier-2s to analyse

-       Stage-in (Tier-1s) to reprocess to Tier-2s to analyse

-       Simulate (Tier-2s) to Tier-1s to Tier-2s to analyse

 

The SFO toTier-0 tests interfere with the cosmic data-taking. ATLAS will decouple these tests from the global data distribution and distributed operation tests as much as possible.

 

The CCRC’08 must test the full distributed operations at the same time for all LHC experiments. This is also requested by Tier-1 centres to check their own infrastructure.

 

The ATLAS proposal is to decouple CCRC’08 from M* and FDR.

-       CCRC’08 has to have fixed timescales as many people are involved

-       CCRC’08 can use any datasets prepared for the FDR, starting from Tier-0 disks

-       CCRC’08 can then run in parallel with cosmic data-taking

-       Tier-0 possible interference and total load has to be checked

-       Cosmic data distribution can be done in parallel as data flow is irregular and on average much lower than nominal rates

4.4      ATLAS FDR-1

The original aim was to prepare 10 hours of run at luminosity 1031 and one hour at 1032. Using release 12 for the simulation and release 13 for trigger and byte stream generation code. This would implied using:

-       5 physics streams (e.m., muons/B-phys, jets, taus, min_bias)

-       Express stream (10% of nominal rate)

-       Inner detector alignment stream (as example of calib/align data streams)

 

Getting the trigger and mixing code to work took longer than anticipated. Event mixing requires a large amount of simulation output files from different physics channels in the same location (Castor @ CERN): This was not easy given the general current disk space crisis.

Most of trigger selection code was delivered at the last minute and it was only superficially validated. They found that the trigger rates were a factor 3-4 less than what we wanted to have.

 

In the end ATLAS had far fewer events than anticipated, and many small files. Although they had doubled the luminosity block size just to avoid small files!.

 

Most SRM 2.2 end-points, space tokens and storage areas were really set up and configured only at the end of January. ATLAS could start testing them in earnest only last week (the FDR-1 week).

 

Below are some examples of workflows for calibration and reconstruction respectively.

 

 

 

 

Reconstruction

 

4.5      Detailed FDR-1 Daily Log

Day 1: Mon 4th Feb

-       Decided to continue event mixing using the Tier-0 farm to have a more reasonable event sample to work with. Start of FDR delayed.

Day 2: Tue 5th Feb

-       Run started at 9 am. 8 runs in total, 1 hour each.

-       Processing of express stream on Tier-0 started to produce monitoring and calibration data.

Day 3: Wed 6th Feb

-       New run started at 9 am. Same events, new run numbers.

-       Processing of express stream as before.

-       4 pm: sign-off by Data Quality group of Tuesday data; start of bulk reconstruction.

-       More testing of SRM 2.2 end points and storage areas. No transfer yet…

Day 4: Thu 7th Feb

-       New run started at 9 am. Same events, new run numbers.

-       Processing of express stream as before.

-       4 pm: sign-off by Data Quality group of Wednesday data; start of bulk reconstruction.

-       Processing of Tuesday bulk completed.

-       More testing of SRM 2.2 end points and storage areas. NIKHEF problem. RAL power cut

Day 5: Fri 8th Feb

-       4 pm: NO sign-off by Data Quality group of Thursday data as there was a mix-up with updated Inner Detector alignment constants. Express stream processing restarted. Bulk reconstruction started later on.

-       Processing of Wednesday bulk completed.

-       More testing of SRM 2.2 end points and storage areas. NIKHEF problem. RAL power cut.

Day 6-7: Sat-Sun 8-9th Feb

-       Tier-0 processing completed

-       Should have finally started data transfer to Tier-1s but more configuration problems hit us

Day 8: Mon 10th Feb

-       Data transferred to Tier-1s. So little data that it took only one hour (not a stress test!)

 

Several post-mortem meetings are taking place to analyse the results and the issues discovered.

-       Mon 11 Feb: data preparation steps

-       Wed 13 Feb: Tier-0 processing and data export operations

-       Tue 19 Feb: data quality assessment and sign-off procedures

 

Slides 14 and 15 show the kind of tools and displays used to display and monitor the FDR-1 execution.

4.6      ATLAS Plans

Software releases:

-       13.2.0

-       Last week. Targeted at the M6 run in March.

-       14.0.0

-       Base release 14.0.0 available end February 2008. Includes LCG_54, new tdaq-common, new HepMC, completion of EDM work for Trigger records and optimisation of persistent representation

-       14.X.0 releases

-       Controlled production releases every 4-6 weeks.

-       14.X.Y releases

-       Bug fixes only for HLT/Tier-0 and Grid operations

Cosmic runs:

-       M6

-       Beginning of March 2008

-       Continuous mode

-       Start immediately with detector-DAQ integration and commissioning weeks

FDR:

-       Phase II

-       Early May 2008 (to be discussed, possibly before the start of continuous data-taking mode with complete detector)

CCRC’08

-       Phase I

-       February 2008 (after FDR-1): Test SRM 2.2 everywhere in earnest using realistic loads and file sizes

-       Phase II

4.7      Conclusions

ATLAS consider the FDR-1 exercise very useful and have learned important information on:

-       Data concentration at CERN

-       Event mixing (jobs with many input files)

-       Late delivery of crucial software components is not a good idea, both for the ATLAS software and for the SRM 2.2.

 

-       The data quality loop was tried for the first time and needs some adjustment but basically works

-       The calibration procedures were also attempted for the first time but they need much more thinking and testing

-       Tier-0 internals are not a worry. Except for operations manpower (shifts not yet tried)

 

They are looking forward to CCRC’08-1 starting now!

 

 

5.   VOBOXes and 24x7 Support at the Sites (HLM_20080205; HLM_20080212; Reports Received) - Sites Roundtable

 

Some Sites had sent their report (Reports Received) about the status of the 24x7 and VOBoxes support milestones.

 

The others were asked by I.Bird to send their written statements to A.Aimar before the end of the week:

-       ASGC:
Not represented at the meeting.

-       INFN:
VOBoxes support INFN proposed a few weeks ago an SLA document and was approved by the Experiments representative at CNAF.
24x7 Support is available for critical issues (power, cooling etc) but not for the specific storage and grid services.

-       NDGF:
24x7 the support is defined in a document. The staff is not yet in place for implementing it 24x7.
VOBoxes: The SLA is not fully agreed yet.

-       RAL:
RAL is preparing a written explanation. 24x7 they are checking their alarms and VOBoxes milestones are done.

 

6.   AOB

 

No AOB.

 

7.   Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.

 

New Actions:

13 Feb 2008 - A.Aimar will ask for the Sites Reliability Reports January 2008. Sites should send it by end of the week.

 

New Action:

22 Feb 2008 - Experiments should provide feedback on the priorities of the current storage issues. See slide 6 of http://tinyurl.com/2yrc6b.