LCG Management Board

Date/Time

Tuesday 6 October 2009 16:00-17:00 – Phone Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=62561

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 12.10.2009)

Participants

A.Aimar (notes), I.Bird (chair), M.Bouwhuis, D.Britton, N.Brook, T.Cass, L.Dell’Agnello, D.Duellmann, M.Ernst, I.Fisk, S.Foffano, Qin Gang, M.Girone, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, M.Lamanna, P.Mato, G.Merino, H.Renshall, M.Schulz, Y.Schutz, J.Shiers , O.Smirnova, R.Tafirout 

Invited

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 13 October 2009 16:00-18:00 – F2F Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes of the previous meeting.

1.2      Pledges - prep for RRB (Paper) – I.Bird

The table with the latest pledges is available (Link).

It was difficult to obtain the data from the Sites. But is needed because will be used for the RRB. Sites that do not have the final data should inform S.Foffano and a footnote should be added to explain the situation.

 

A.Heiss reported that DE-KIT will be able to give more precise data only for x Overview Board in November; but not for the RRB next week.

S.Foffano replied that she should be informed asap, in writing, if any clarification note is needed in the document.

1.3      SAM Reports 200909 (Tier1_Reliab_200909.zip; Tier2_Reliab_200909.pdf) – A.Aimar

A.Aimar distributed the Tier-2 SAM reports to the Collaboration Board and GDB and the Tier-1 Report to the MB for comments.

 

2.   Action List Review (List of actions)

 

  • 5 May 2009 - CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

Removed. No news but R.Wartel agreed to remove the item and wait for the next security tests in the near future.

  •  Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar.

Not done by: FR-CCIN2P3 and NDGF
Sites can provide what they have at the moment.

See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics

Sites should send URLs to existing information until they do not provide the required information.

NL-Tier-1 was added to the SLS metrics page on the following day. Only CC-IN2P3 and NDGF left.

NDGF and FR-IN2P3 reported that they will send an update with the estimated dates for completion.

 

3.   LCG Operations Weekly Report (Slides) – M.Girone
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

With the start of October (LHC cool-down, experiments shifting to data taking / analysis-challenge mode etc.) some small changes to “SCOD procedures” are being implemented:

-       More formal handover from SCOD-to-SCOD on Monday of each week with summary of key issues including those to be followed up.

-       Closer follow-up of LHC operations and startup via proposed participation in LHC weekly operations planning (LHC Programme Coordination)

-       Monitoring of meeting participation. To record who really participates to the meeting and who never shows up – are there trends?

-       A WLCG SCOD e-log or blog?

-       Reminder that during week(s)-on-duty SCOD is expected to devote large fraction of time to these activities

 

Alarm tickets tested during this week – still some problems;

-       An SIR is requested from ASGC for DB-related problems. The database needed a point-in-time recovery but data still seem to be corrupted. Streams replication is stopped. This followed a scheduled intervention on Sunday 27. No entry found in GOCDB. Under investigation.  

-       VOMS problems with system restarting many times from Sept 31st to Oct 4th. Now fixed but voms-proxy-init time has increased by a factor 3. Under investigation

 

Qin Gang reported that the problem is that their table space is not synchronized with the other Tier-1 Sites.

M.Girone clarified that there are further problems than the table space. ASGC seems unable to recover the data and will need to be resynchronized from another ATLAS Site.

J.Shiers remarked that the 1 FTE needed and promised at ASGC is not yet available.

 

The other main news of this week are:

-       RAL upgraded to SRM 2.8-1. CERN still runs SRM 2.8-0 for LHCb, SRM 2.7-18 for others

-       Scheduled intervention at IN2P3 for the upgrade from pnfs to Chimera

-       dCache “golden release” has been announced

-       New ROC for Latin-America created

 

I.Bird asked for an update on the migration to Chimera at IN2P3.
F.Hernandez replied that the update to Chimera was successful. And the bug found at IN2P3 was fixed immediately by the dCache developer that was at IN2P3 to help.

 

The Operations meeting will state keeping ad attendance summary as too many Sites still do not participate regularly.

 

Site

M

T

W

T

F

CERN

Y

Y

Y

Y

Y

ASGC

Y

Y

Y

Y

Y

BNL

Y

Y

Y

Y

Y

CNAF

FNAL

FZK

Y

Y

Y

Y

Y

IN2P3

NDGF

NL-Tier-1

Y

Y

Y

Y

PIC

RAL

Y

Y

Y

Y

Y

TRIUMF

 

R.Tafirout reminded that with TRIUMF there is the agreement that they connect only if there are important issues.

 

There were many Test Alarms tickets in the GGUS Summary below.

 

VO

User

Team

Alarm

Total

ALICE

0

1

6

7

ATLAS

11

43

0

54

CMS

2

1

8

11

LHCb

1

23

8

32

Totals

14

68

22

104

 

The Test Alarms were not properly received by

-        RAL: missing entries in GOCDB – once fixed the CMS test alarm was received

-        FZK: being investigated

-        CERN: did not receive the e-mail (being investigated)

-        CNAF: ALICE and LHCb: sent email to IT-INFN-CNAF instead of INFN-Tier-1

 

The tests for ATLAS, CMS and LHCb to be (re)done this week.

 

L.Dell’Agnello reported that INFN-Tier-1 is the name in GOCDB and CNAF tried but could not reply to the test. The reason is unclear and will be investigated. 

J.Gordon reported the same problem for RAL with the GOCDB Site name.

G.Merino asked whether “Emergency Email” in GOCDB is the the address used for Tier-1 alarms. This was never clearly announced.

J.Shiers agreed that this point should be clarified.

3.2      VO SAM Availability

Slide 6 shows the usual summary of the VO SAM tests for the four LHC Experiments.

 

One can notice that:

-       IN2P3: Chimera upgraded on Tuesday / Wednesday.
From F.Hernandez: This short note just to let you know that the SRM service at CC-IN2P3 is back again. We are now running dCache 1.9.4-3/Chimera. The FTS channels are open again. The scheduled shutdown is finished.

-       FZK: problems with tests seen by CMS to be followed up

-       PIC: problems with tests seen mid-week both by ATLAS and CMS likewise

-       RAL: problems seen by ATLAS at w/e with CASTOR DB (multiple h/w failures). Might need a point-in-time recovery.

3.3      Outlook

LHCb FEST foreseen for last week – moved by two weeks due to Da Vinci options file problem

 

We expect WLCG operations to ramp-up to something like “STEP-09” level – but it will last for many months and not just two weeks

-       ATLAS throughput tests foreseen to start this week
Mon-Tue: 5TB/day/Tier-1
Wed-Thu: 50TB/day/Tier-1

-       CMS analysis focused October exercise also this week (affects mostly Tier-2s)

 

ATLAS User Analysis Tests from 21st-23rd October.

 

It is important to have good attendance from Tier-0+Tier-1 sites plus all Experiments at the Operations meeting. Alice is often missing. Reports should be always sent and can always be added directly to Wiki or e-mailed to wlcg-scod@cern.ch.  

 

CNAF started testing TSM to replace CASTOR as tape back-end.

 

R.Pordes asked whether Sites not up to date with their security patches are going to be banned or sanctioned.

I.Bird replied that the proposal was send to EGEE by R.Wartel but that is an EGEE only decision. The WLCG MB last week endorsed the proposal of monitoring and warning the Sites to urgently apply the security patches. But there was not any WLCG decision (yet) on any banning policy. He suggested that OSG contacts R.Wartel for details on the EGEE policies.

 

 

4.   ALICE Quarterly Report 2009Q3 (Slides) – Y.Schutz

 

 

Y.Schutz presented the 2009Q3 Quarterly Report for ALICE.

4.1      Data Taking

ALICE data taking cosmics with complete detector has resumed in July 2009.

Injection tests were also performed:

-       11-12 July: TI2 transfer line test

-       25-28 September: TI2-TI8 transfer line test

 

All Tier-0 and CAF tasks are continuously exercised, except export to Tier-1 and Pass2 reconstruction, for the sake of saving storage.

 

 

All data are reconstructed in a first pass at Tier-0. Reconstructed data analyzed on the 2 Analysis Farms, at CERN and GSI, and on the GRID.

4.2      MC Production

ALICE runs MC at all sites, with data replicated in 3 different SEs but all must have xrootd enabled. ALICE only run if needed (i.e. no “keeping processors warm” strategy)

 

Mainly first physics type of MC production with raw data reconstruction (cosmic, calibration, beam tests) at Tier-0 exclusively, with several passes. Selected data is used on the CERN AF.

 

Analysis Train on the Grid for all major production in 2009 and end users on CAF and Grid, where AOD is the preferred input.

 

The graph below shows the profile of jobs running on the grid in the last year. In average 7500 jobs per day.

 

 

And the CAF usage will be increased by 3 by November 2009.  Below is the current usage by user, with about 200 users in average.

 

4.3      Analysis Train

Several trains have been running during the quarter.

 

4.4      AliRoot Software

Deployment of the online-offline QA framework and the PROOF based parallel reconstruction is fully operational.

Focus was on the validation and tuning of the offline framework with cosmic data and on the optimization of memory and CPU consumption.

 

Studies are ongoing on how to design of multi-threaded reconstruction applications and on the validation of the GEANT4 and FLUKA transport models vs. GEANT3

4.5      Migration to CREAM and SL5

The CREAM CE is now deployed in 50% of the ALICE sites therefore ALICE still needs dual submission CREAM/WMS.

All ALICE software has been validated for SL(C)5. The production RAW and MC) and user analysis is already running on SL(C)5 resources wherever available. Some Sites have SL5 WNs but SL4 VOBoxes.

 

SITE

SITUATION

CERN

2 VOBOXES in SLC4 and one in SLC5, submitting to both types of WNs

KISTI

1 VOBOX in SL4 and 1 in SL5 (CREAM VOBOX), submitting to both types of WNs

SUBATECH

2 VOBOXES in SL4, small cluster available in SL5 (behind CREAM)

RAL

2 VOBOXES in SL4, cluster avalable in SL5 (behind LCG-CE)

ITEP and SPBSu

Full site migrated to SL5, but only one VOBOX available

KALKOTA

1 VOBOX in SL4 and 1 in SL5 (CREAM VOBOX), submitting to both types of WNs

IPNL, IHEP RRC-KI and GRIF_DAPNIA

Full site migrated to SL5 BUT the VOBOX (still SL4)

IPNO

VOBOX migrated to SL5 but not the WNs

4.6      Milestones

New milestones for ALICE are:

-       MS-130 Jun 09: CREAM CE deployed at all sites: 50%

-       MS-131 26 Jun 09: AliRoot release ready for data taking: done

-       MS-132 14 July 09: release and deployment of AliEn v2-17: done

 

F.Hernandez asked which are the issues of using an SL4 VOBOX with CREAM CE and WNs on SL5. The VOBOX is not yet certified for SL5 and that is why is still running on SL4 at IN2P3.

Y.Schutz replied the certification is probably not completed and that ALICE would like to reduce the number of different configuration.

 

 

5.   CMS Quarterly Report 2009Q3 (Slides) – M.Kasemann

 

 

M.Kasemann presented the 2009Q3 Quarterly Report for CMS.

5.1      Computing Organization in 2010

Below is the organization of the CMS computing for 2010, as approved by the CMS Collaboration Board. No changes in structure but changes in the names. In red the newly appointed people.

 

5.2      CMS Software Releases

As announced in February 2009 CMS continued to work on adding/ improving functionality in all CMSSW areas, preparing for tuning with real data.

 

It took a long development period lasting 3 months with more than 500 package changes in last dev pre-release of CMSSW version 3:

-       CMSSW_3_1_0 was released on July 1st. Release targeted at MC production goals. Validation using 52 different pre-production samples

-       3_1_1 (July 7th): first pre-production round (~15M events)

-       3_1_2 (July 24th): second pre-production round. MC production started using 3_1_2 on July 29th

-       CMSSW_3_2_0 released on July 19th. Release targeted at CRAFTier-09 data-taking, processing and re-processing

 

The software performance considerably improved between Version 2 and 3 and was maintained in Version 3 as shown below.

 

5.3      Software Status

Since June the focus has been on integration and validation of CMSSW Releases (3_1_x/3_2_x) for use in MC production and CRAFT-09. Important steps have been taken to improve validation procedures (PVT).

 

The CRAFT exercise has been very valuable for adapting software to run in data-taking conditions. The Tier-0 workflows ran very reliably with very few failures in prompt reconstruction. Prompt calibration loop exercised at CAF for the first time and there was the deployment of online/offline patches in operation.

 

Further tests of Tier-0/Tier-1 production systems using Monte Carlo samples are in progress. Also the automation of prompt calibration workflows using CRAB is high priority. Validation of SLC5/gcc432 planned for CMSSW_3_3_0.

5.4      Site Readiness

The Site readiness is closely monitored in CMS with reports and follow-up during weekly Facility Operations meetings and with additional meetings to focus on Asian, Russian and Turkish sites.

 

Substantial improvement is observed for a large number sites but Tier-1: sites readiness a concern as shown in the graph below. CMS plan expert visits to improve the situation at their Tier-1 Sites.

 

Tier-2: readiness state improved significantly over the last year, except during the summer. There is an improvement but several Sites still need to he helped to improve. Especially on can see that during the Summer there was no improvement.

 

5.5      Cosmics Run

CRAFT Tier-0

CRAFT rate of prompt reconstruction is higher than expected collision data. Processing and data movement is smaller because of the reduced event size and complexity. Setting up a MC based Tier-0-test to exercise other elements.

 

CRAFT data does not fully utilize the Tier-0 reconstruction farm. IO Rates also lower. Collision data will utilize the full resources. Prompt Reconstruction was introduced (>run 110500).

 

CRAFT Tier-1 Sites

Tier-1 traffic during CRAFT was custodial transfers to RAL and CNAF with additional copy of the data to FNAL. Generally successful with an interesting exercise in recovery after RAL cooling failure.

 

Preparing for CRAFT-08 and 09’s re-reconstruction at Tier-1 sites. Not all Tier-1s are equally ready to accept collision data. CMS Computing is preparing visits (FZK, IN2P3, and ASGC) and exercises during the fall, repeating STEP09 tests at some sites.

5.6      MC Production

Pre-production of MC generation in July. About 15M events produced within few days.

 

Summer09 MC production started on July 29, when pre-production validation finished.

 

Production status is the following:

-       Events requested (10TeV + 7TeV): > 550 M

-       Events produced: > 500 M

 

A large number of resources could be used efficiently with about 15k slots pledged during STEP09. Quite some opportunistic non-CMS-Tier-2-pledged resources were used (T3, beyond pledge, non-CMS).

5.7      Tier-2 Resource usage

Tier-2 resource usage reported to WLCG on monthly basis Installed resources queried by CMS in June 09.

 

Since May 09 CMS uses ~all Tier-2 resources (plus opportunistic resources mostly for MC production but analysis uses ~ 40% of the available slots.  Tier-2 resources can be used effectively for MC and Analysis as shown below.

 

 

5.8      LHCC and CRSG Reviews

The LHCC Mini Review of Computing Resources (July) stated “…important investment in the construction of the LHC and the detectors… physics outcome using very first LHC data should be maximized and not limited by computing resources. … current estimates suffer from large uncertainties … not an appropriate time to cut back substantially on computing resources.”

 

RRB Computing Resources Scrutiny Group: “Generally speaking the resources are well justified.”

 

Below the comparison between the C-RSG estimates and the CMS requests:

 

5.9      Computing Support of Analysis

An Analysis Operations Task started in June (S.Belforte, F.Würthwein and J.d’Hondt as liaison) with the mandate to support the analysis of the Physics groups.

 

Data Placement Operations are part of the tasks of this group:

-       Update Tier-2 Associations.

-       Management of Central Space at T@

-       Group Skim Transfer Service.

-       Increase group space allocation to 50 TB.

 

“Analysis Operations” supports data transfer and analysis, operates CRAB-submission servers, to ease job-failure analysis and performs data transfers and management of “central data samples”. It also supports physics group production of “group data samples” (concept of 1 priority-user per group), supports registration and transfers of “group data samples” and provides tailored documentation and training.

 

October Physics Exercise (5-18 10 2009)

All Physics groups are asked to repeat an analysis with the latest MC results. Data-Operations” produced “Secondary datasets” from MC and skimming on trigger quantities and pre-scales on the Tier-1, distribute to Tier-2.

 

There are October-exercise-contacts for quick response to operational questions and active participation in October-Exercise-Meetings (weekly, daily).

 

5.10    Outlook for Computing

Analysis operation support getting very demanding and will focus on October Exercise, data management, analysis jobs success rate.

 

The program until the LHC start-up is the following:

-       Finalize data distribution of RAW, RECO and AOD to CAF, Tier-1 and Tier-2 centres

-       Tier-0: repeat scale tests using simulated collision-like events

-       Tier-1: STEP09 tape and processing exercises where needed; with Tier-1 visits scheduled: GridKa, IN2P3, ASGC (at CERN).

-       Tier-2: Support and improve distributed analysis efficiency (Analysis Operations)

-       Review Critical Services coverage

-       Fine tune Computing Shifts procedures

-       Make sure 2010 resources pledged are available

 

 

 

6.    AOB

 

 

6.1      Moving Procurement Deadline to 1 June 2009?

G.Merino asked whether the recommendations of moving the pledged resources from 1 April to 1 June of every year is going to be implemented by the WLCG.

 

I.Bird stated that the best approach is always that the ramp-up is agreed with the Experiments. Two months should not make a big difference.

G.Merino replied that 1 June would be much better for PIC in this case.

F.Hernandez added that June would help IN2P3 disk purchase and he proposed to move to June.

M.Kasemann agreed if there is a good ramp-up towards June.

J.Gordon added that a couple of months later may mean to have WNs with 2 TB internal disks.

M.Bouwhuis added that NL-T1 is already receiving 2 TB disks.

 

I.Bird concluded that Experiments should provide ramp-up profiles and agree them with the Sites.

 

6.2      CMS SAM Tests at FZK

A.Heiss clarified that the problems with the CMS SAM tests are not due to the Sites but the CMS not testing the correct Site.

6.3      ATLAS SAM Tests at BNL

ATLAS SAM tests for the moment still fail at BNL and the tests need to be adapted.

 

 

7.    Summary of New Actions

 

 

No new actions.