LCG Management Board

Date/Time

Tuesday 1 April 16:00-18:00 – F2F Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=27476

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 6.4.2008)

Participants  

A.Aimar (notes), D.Barberis, I.Bird (chair), D.Britton, J.Casey, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, J.Gordon, F.Hernandez, M.Kasemann, M.Lamanna, U.Marconi, H.Marten, P.Mato, G.Merino, B.Panzer, R.Pordes, R.Quick, Di Qing, M.Schulz, Y.Schutz, C.Sehgal, J.Shade, J.Shiers, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 15 April 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

-       26 Feb 2008 - The Sites and Experiments should confirm to A.Aimar that they have updated the list of their contacts (correct emails, grid operators’ phones, etc). Here is the current contact information: https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails

 

Done. This information is going to be checked in this meeting.

 

-       29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.

Not done. Asked to GridView but it still needs to be implemented.

-       29 Feb 2008 - A.Aimar will verify why the reliability values for the Tier-2 sites seems incorrect (being lower than availability).

 

Done.

Reliability is lower than availability because availability was considering the “unknown” status as “available” in order not to penalize the sites when SAM has problems. The algorithm is going to be modified to fix this inconsistency. For the availability calculations if “unknown” is between “downtime” periods it will be considered “unknown = downtime” and not “unknown = available”.

 

-       18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.
Experiments should provide the read and write rates that they expect to reach. In terms of clear values (MB/sec, files/sec, etc) including all phases of processing and re-processing.

 

Not Done.

J.Templon suggested that the Experiments provide the same information in the format used by LHCb.

 

-       31 March 2008 - OSG should prepare site monitoring tests equivalent to those included in the SAM testing suite. J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.

 

Ongoing. Equivalence of the OSG tests not yet officially confirmed.

 

3.   CCRC08 Update (Slides) - J.Shiers
 

J.Shiers presented an update of the CCRC08 activities.

 

The semi-automatic reporting strategy continues successfully, with different degrees of completeness depending on the Experiment.

The current activities, and by VO, are summarized in the tables below.

 

Service

General Activities/Issues

DBs

Two major activities.

1) Preparation of the new RAC 5 and 6 hardware (both now on critical power) to migrate experiment services next week.

2) Upgrade of integration RACs to Oracle 10.2.0.4. This version will probably not be used for CCRC'08.There is not enough time for testing.

3D

CNAF was down for the entire week due to maintenance.

Streams replication to CNAF has been split and stopped; it will be resumed this week. 

 

Problems with ATLAS streams replication between offline and online database stopped on Good Friday due to an error from the streams apply process caused by a user application mis-configuration.

The replication was resumed after a few hours.

CASTOR

The new CASTOR version (2.1.7) will be ready for pre-production certification on the 1st of April as originally planned. All tests and stress tests are successful.

SRM

A bug has been found in the CERN SRM v2 databases. This is in fact fixed in a rollout due on Monday so the rollout has been brought forwards to the previous Friday.

FTS

Some Apache patches are in the pipeline for FTS servers.

CE

The fixes deployed at CERN to the LCG CE that reduce their load by an order of magnitude are being packaged for external distribution and should be ready next week.

This patch is important for outside sites.

DM

Patches to the 1.6.7 version of lcgutils should also be ready for distribution next week.

 

Each Experiment reports the activities and issues every week. See the slide notes for additional information from the Experiments.

 

Day

ATLAS

Wed

Throughput testing with junk data. Build up T0 to T1 transfers to 150% nominal, adding 50% each day (150% over the w/e) CNAF has a downtime so their share is going to BNL. NDGF have requested a reduced rate due to insufficient resources. SRMv2 is being used at CERN.

Thu

Some Castor fixes at the end of the day. Try to reach 100% of the rate today then 200% on Friday then throttle back to 100% for the weekend. A current problem is they no longer see BNL on the ATLAS dashboard.

Fri

1) Yesterday the export rate was supposed to be increased to 100% of nominal. Problem in the ATLAS T0 machinery and not enough LSF jobs were submitted.

2) Display problem with the dashboard. Some entries end up in the production dashboard but should be in the T0 dashboard and vice versa.

3) SARA is not getting data on disk since disk is full (being cleaned up now). NDGF does not get data on disk since disk is full (cannot be cleaned up centrally since NDGF does not use LFC - the only supported catalogue for the central deletion tools). NDGF people should do the cleaning (alerted).

LFC

ATLAS are planning a bulk CERN site change to about 60K LFC entries. The consensus was that this is a relative simple operation that should only take a few tens of minutes of real time. To be scheduled!

 

 

Day

CMS

Mon

 

Tue

 

Wed

Will continue their T0 to T1 exports to overlap with ATLAS but there is a downtime coming up at FZK.

Thu

Asking how to get actions from IT on non-urgent issues during long holiday periods such as Easter. The use case there was to allow some DataOps team members to access the DAQ worker nodes to look at LSF issues. These are non-urgent requests but which cannot wait for weeks. It was agreed to discuss this at the next F2F meeting.

Fri

 

The request of CMS to resolve non-urgent issues during non-working hours, holidays, etc cannot be taken into account.

It should be discussed, when issues that are reducing the performance but are not serious enough for an urgent call.

 

T.Cass added that often issues are resolved because people follow their email by individual initiative but one needs to define what can be supported during the weekend and holidays in this period.

 

Day

ALICE

 

Main activities for this week:

-       Pass 1 reconstruction of RAW data taken during February/March commissioning exercise (CERN T0)

-       Replication of specific RAW datasets to named T2s for fast detector-specific analysis (calibration/software tuning)

-       MC production ongoing all centres

 

 

Day

LHCb

Tue

Testing stripping workflow.

Thu

Are evaluating the requirements for local worker node disk space in case they switch to copying production input data to local WN disk rather than reading via remote I/O. Current estimate is 5 GB. They will be running some production to exercise the manpower of their stripping workflow.

 

 

4.   Update on OSG/SAM milestones (Slides) - R.Quick
 

R.Quick presented the planning for the next few months in the implementation of the SAM tests for the OSG Tier-2 sites.

 

 

Probe Development Completed

Probes Deployed to Sites

Publish Data to WLCG

CE Availability

Completed - Nov 2007

50% sites reporting; working to get completion commitment from US-ATLAS and US-CMS for remaining 50% of sites

Completed - Jan 2008

CE Reliability (Planned downtime reporting)

15-May-08

22-May-08

01-Jun-08 (depends on transport mechanism from SAM developers)

SE Availability

07-Apr-08 (SRM)

Start deployment end of May 2008 With OSG 1.0; need to work with US-CMS and US-ATLAS for deployment commitment

Data available within one week of each site deployment completion

SE Reliability

15-May-08

Start deployment end of May 2008 With OSG 1.0; need to work with US-CMS and US-ATLAS for deployment commitment

Data available within one week of each site deployment completion

 

They will be working with SAM developers on defining the transport mechanisms to get scheduled downtime into SAM.

There are design sessions and discussions planned at next WLCG Collaboration Meetings on April 21-24 and during May’s meetings in Madison, Wisconsin.

 

J.Casey added that a new message format will be defined in order to allow the OSG reliability information (scheduled downtime, etc) to be sent directly to SAM, without going via GOCDB like the EGEE sites. This will be done in the next few weeks during the planned meetings.

 

I.Bird asked whether there is a problem with the deployment of the CE tests at 50% of the sites (row 1 in the table above).

R.Quick replied that the sites are invited to do so and should be done by May 2008. But depends on the individual sites.

I.Bird asked that the US Tier-2 coordinator(s) should put pressure on the Tier-2 to proceed with this installation.

 

I.Fisk replied that the CMS sites have been contacted and, if really so urgent, could be completed before end of April.

 

R.Pordes asked how some US ATLAS Tier-2 sites that do not have an SRM interface (they provide xrootd only) are taken into account by SAM.

I.Bird noted that all MoU sites should provide a SRM interface. Until this issue is clarified these sites will appear as not reporting availability data. This should be followed up and the status of these US ATLAS sites should be clarified.

 

New Action:

18 Apr 2008 - M.Ernst should clarify the situation with the ATLAS sites that are not providing and SRM interface. And how availability reliability is reported.

 

1.   Summary of the Overview Board (slides; document) - I.Bird

The Overview Board met on the day before the MB Meeting. I. Bird presented a brief summary of the discussion.

Attached are the document submitted to the OB and the slides presented at the meeting (document, slides).

 

The main points raised during the OB Meeting were:

-       LHC Schedule.
Cooling should be completed by Mid-June. Experiments should be ready by end of June and beam should start in July.
Some magnets need to be retrained or stay at lower energies (10-11 Tev). This will be discussed during April with the Experiments.

-       CERN Power.
The technical problem is recognized but the solutions to adopt are not yet agreed. The Experiments stated that their resource requirements cannot be changed. CERN confirmed that will fulfil its MoU commitments.

-       Resource specification for the RRB board.
It should include the resource available and planned. But also whether all those computing resources have a matching power plan.

-       CCRC Report.
There were not many comments, but a general very positive feedback on CCRC and on the performance of CASTOR (an issue in the past).

-       Follow-up to EGEE 3.
All European sites present (NDGF not present) replied that they would fulfil their commitment until ~2012, after EGEE3. Manpower could be an issue.
The European Tier-1 sites should get involved with their NGI in the EGI discussions that are taking place.
In the US the budget cycle is yearly but for next ~4 years the WLCG resources should be planned.
In Canada WLCG is budgeted until 2011 and they are extending the commitment.

-       Follow-up to WLCG Phase 2
There will be a one-year extension of WLCG covering the first year of data taking and planning for the next steps.

 

2.   Contact Information of the Tier-1 and Experiments (Wiki page) Roundtable

In order to conclude the long-standing action about having a confirmed list of contacts for each site and experiment a round table took place in order to verify the information in the official wiki page https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails.

 

The Tier-1 sites and Experiments (CMS provided it after the meeting) confirmed that the value in the table are up to date.

 

J.Templon noted that some Experiments lists are protected and sites cannot post alarms in there.

 

New Action:

15 Apr 2008 - Experiments should confirm that the alert/contact mailing lists are open to posts submitted from the sites.

3.   Brainstorming on Future Metrics and Milestones (Milestones 27.3.2008; Slides)

I.Bird noted that most of the milestones in the HLM Dashboard (link) are now past, therefore new ones are needed.

 

3.1      Existing Milestones

 

VO-specific SAM testing

This needs to be restarted and milestones need to be proposed.
In two weeks they need to be reviewed at the MB.

New Action

15 Apr 2008 - A.Aimar will distribute the VO-specific milestones for March 2008.

 

J.Templon noted that the MoU critical tests for the sites should be highlighted differently. The VO-specific tests should be split in those that are failing because of the site and those because to the VO application or framework.

I.Bird replied that the critical tests are used for site reliability therefore they be only be those that need to be used for the sites reliability. The whole VO-specific tests must be reviewed and agreed at the MB because they will be the main reliability metrics in the future.

 

R.Pordes and I.Fisk reported that the US Tier-2 sites are not fully deployed yet (until May) and therefore they should not appear in the Reliability Report for the Tier-2 sites.

A.Aimar explained that the report was distributed for comments and also to highlight that the value for some sites are missing. The report is automated and excluding sites is difficult but a note can be added in each page of the Tier-2 Reliability report.

 

New Action:

15 Apr 2008 - A.Aimar- A note should be added to the Tier-2 Reliability reports indicating that the US sites are not yet reporting availability and reliability data.

 

R.Pordes asked that the Tier-2 reliability figures are approved by the US CMS representatives before being distributed.

 

Job priorities

The work on JP started long time ago and there should be a verification of the progress.

 

M.Schulz clarified that the Experiments should specify what exact configuration is needed at each site.

J.Templon added that the default at the sites is JP off and sites need to configure it.

J.Gordon added that the issue will be discussed the day after at the GDB.

 

I.Bird added that a milestone about it should be added for the Experiments and for the sites. If instead it is not important should be removed as a requirement. D.Barberis will clarify this at the GDB on the following day.

 

R.Pordes added that OSG has not received any requirement from ATLAS and CMS at their US sites.

I.Bird asked why this feature is only needed at the EGEE sites. M.Ernst clarified that US ATLAS implements the priorities locally at the Tier-1 and this does not involve OSG. BNL is not expecting the feature from OSG.

I.Fisk added that the same is done by US CMS at FNAL.

 

24x7 support and VOBox SLA

The milestones are defined in the HLM dashboard and should be checked almost weekly from now on.

 

CAF definition

Experiments confirmed that they will provide the definition of the CAF by May 2008.

 

Reliability Targets

New targets should be defined for Tier-1 and for Tier-2.sites (e.g. % of T2 resources above 95% reliability).

The Tier-2 sites provide very different amounts of resources. Maybe some kind of “weighted” reliability should be introduced (using the site pledge as weight).

 

F.Hernandez noted that the Tier-2 are not represented at the MB, he asked whether each Tier-1 is supposed to monitor the Tier-2. But, depending on the VO, a Tier-2 can be connected to different Tier-1 sites.

3.2      New Milestones

 

Accounting

There will be the publication of the installed capacity (at Tier 2s).

J.Gordon suggested that the new CPU benchmarking unit should be defined before publishing the details of the CPU resources in the Information System.

 

Storage accounting

What should be reported and how to verify the accuracy of the data published?

J.Gordon noted that the data is recorded from the Information System but the resources change during the month. Periodic snapshots are taken and an average can be calculated.

I.Bird suggested that the value available at the end of each month should be published soon. In this way sites will start verifying the data they publish in the Information System.

 

VOMS Accounting: User, Group and Role Accounting Reporting

The policy should be finalized and the data be published.

J.Gordon explained that the data for ATLAS is available. But sites and ATLAS should agree what can be published.

 

New Action:

The ATLAS Tier-1 sites should publish the VOMS role and the roles accounting data.

 

H.Marten reported that the FZK is already publishing this data.

J.Templon added that these values are useful to check whether the Job Priority works and whether the resources are split accordingly to the configuration at the site.

 

F.Hernandez added that for IN2P3 there are legal issues of privacy and this may result in requiring specific IN2P3 solutions.

I.Bird added that if VOMS is not used or the data cannot be legally collected than these measurements will obviously not be needed. If so WLCG should abandon the issue and EGEE will follow these issues in due time.

 

Job Reliability and Efficiency

These metrics should be restarted and data collected and reviewed every month.

 

M.Schulz added that the usage of job wrappers should be also re-evaluated.

I.Bird clarified that the job wrappers can also useful to collect efficiency data on the jobs, not only on the status. If it is a heavy task could be done on a sample of the jobs.

 

R.Pordes asked that, if a discussion on the job reliability takes place, OSG wants to be involved in order to participate to the discussions and to the solutions adopted.

 

Pilot Jobs

The working group should finish its work and milestones, for the sites, on gLEexec installations and configurations should be added.

 

M.Schulz added that the solution should be certified and then goes in production. He reminded that SCAS is absolutely needed for a scalable solution.

I.Fisk asked why the testing cannot be done using GUMS that is already available and does not use a shared file system.

I.Bird agreed that if SCAS is not ready in due time other solutions have to be evaluated. But this should be discussed outside this meeting.

 

SL5 Installations

There is the need of a version of WM running SL5 with the latest compiler supported by the Experiments. New hardware will support RH5 and therefore an SL5 version is needed. A milestone should be defined for the Experiments and for the WN Middleware.

3.3      Risk Register

A risk register was compiled a couple of years ago and should be updated.

 

These metrics should we be regularly following and reporting on:

-       Accounting

-       Site Reliability

-       Tape metrics, MSS metrics

-       Job reliability

 

I.Bird asked what metrics could be defined in order to monitor global Data Transfer performance.

M.Kasemann added that also the Data Transfer reliability should be measured (i.e. number of retries for each link vs. the data rate).

T.Cass added that also the transfer queues provide a useful measure of the performance of the link vs. the needs of the Experiments.

 

New Action:

I.Bird and A.Aimar will propose new milestones to the Management Board.

4.   AOB

 

No AOB.

 

5.   Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.