LCG Management Board

Date/Time:

Tuesday 27 June 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a061506

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 3 - 05.07.2006)

Participants:

A.Aimar (notes), D.Barberis(from 16:45), I.Bird, S.Belforte, K.Bos, Ph.Charpentier, D.Foster, B.Gibbard, J.Gordon, I.Fisk, F.Hernandez, M.Lamanna, H.Marten, M.Mazzucato, G.Merino, H.Renshall, L.Robertson (chair), M.Schulz, Y.Schutz, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 4 July 2006 at 16:00 – Face-to-face meeting at CERN

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of the Previous Meeting

 

About the action below in the minutes:

Action:

27 Jun 2006 - FZK sets up the ORACLE streams to CERN for the LCG 3D project.

 

H.Marten noted that he had said that a person in charge from FZK would contact D.Duellmann in order to discuss the next steps and time scales. This had been done (by Doris Wochele) on Tuesday or Wednesday last week, 3D had a meeting on Monday and setting up of the streams at FZK is in progress and should be followed by the 3D project not the MB.

L.Robertson answered that because the setup was supposed to be done by end of May the MB had set this as a watch point at a higher level.

 

M.Mazzucato noted that he was not sure that the move to GT4 was necessary. I.Bird explained that In order to go to 64 bit support and port the middleware to SL4 (that will be the new standard platform) one will have to use VDT 1.3 that implies GT4 (even though it will be used in “GT2 mode” for backward compatibility).

 

After agreement on the comments above, the minutes were approved.

1.2         2006Q2 QR report to prepare (schedule)

A.Aimar presented the schedule for the preparation of the Quarterly Reports for 2006Q2 and the Executive Summary for the WLCG Overview Board.

 

July 3
Distribute 2006Q2 reports to be filled by each site, experiment and project, with all due milestones to be commented.

 

July 10
The QR Reports filled should be sent back to A.Aimar.

 

July 17
A Review document sent to the authors of the QR reports, with the comments of the reviewers and changes and clarification needed.

 

July 24
Authors send back the reports modified and completed.

 

July 31:
The QR reports and an Executive Summary will be sent to the Overview Board.

 

 

2.      Action List Review (list of actions)

 

Note: Actions in RED are due.

 

 

  • 23 May 06 – Tier-1 sites should confirm via email to J.Shiers that they have set-up and tested their FTS channels configuration for transfers from all Tier-1 and to/from Tier-2 sites. Is it not sufficient to set up the channel but the action requires confirmation via email that transfers from all Tier-1 and to/from the "known" Tier-2 has been tested.

Not done: ASGC, BNL, FNAL, INFN, IN2P3, NDGF, SARA-NIKHEF, and TRIUMF.
Done: RAL, FZK and PIC.

IN2P3 and TRIUMF had done it, but did not send the mail to J.Shiers.

SARA and INFN: confirmed that they have sent the email to J.Shiers.

FNAL: FTS server is set up and they have tested few channels. Still needs a meeting with G.McCance to the required set-up.

No information from ASGC, BNL, NDGF.

 

  • 30 May 06 - ALICE and LHCb will send to J.Shiers the list of the Tier-2 sites to monitor in SC4.

SC4 for LHCb does not include any Tier-2 sites.

ALICE will send the list very soon.

  • 31 May 06 – K.Bos should start a discussion forum to share experience and tools for monitoring rates and capacities and to provide information as needed by the VOs. The goal is then to make possible a central repository to store effective tape throughput monitoring information.

Not done.

  • 13 Jun 06 – D.Liko to distribute the Job Priority WG report to the MB.

Not done.

 

  • 27 Jun 06 - K.Bos proposes a group that will look into implications of the move to SRM 2 for Tier-1 and Tier-2 sites.

Update from K.Bos’ email to the GDB.

 

The names of people that offered to take part in this group are:

Jan van Eldik (CERN)
Mark van de Sanden (SARA)
Artem Trunov (CMS and Alice, IN2P3)
Lionel Schwarz (IN2P3)
Adriŕ Casajús (PIC)

Jos van Wezel (FZK/GridKa)
Ruth Pordes as a place holder for possibly Eileen Berman and/or Frank Wuerthwein (FNAL)

 

1.      Summary of Experiments' Requirements for SC4 (transparencies) H.Renshall

 

 

This is report of a one day meeting that was held at CERN on 21 June to look at technical issues of the current service challenge program. There were 30 registered attendees representing most Tier 1 sites and some UK and German Tier 2 sites, and several others joined by VRVS.

 

The meeting was divided in two sessions:

-          AM: site issues and middleware 

-          PM: experiments plans and requirements

1.1         Report on Site Issues and Middleware

 

Tier 1 sites were requested to send in advance their issues, missing features and other comments regarding all aspects of the recent activities (disk-disk and disk-tape transfers, gLite installation, services preparation, operations, etc).

 

All input received was grouped and presented in the morning session:

 

-          Understanding Disk - Disk and Disk - Tape Results 

-          Problems in setting up basic services

-          FTS failure handling

-          Operational Requirements for Core Services 

-          Discussion on moving from here to full production services and data rates

 

Disk - Disk and Disk - Tape Results (M.Litmaath)

 

In April the full rate of 1.6 GB/sec was only reached for one day with very variable ramp-up and stability across the T1 sites.

 

To discover and solve problems the FTS log files are vital and accessing them via “remote login” is very impractical: web access to the log files is needed. srmCopy operations are difficult to debug because remote logging information is not sent to FTS.

 

Many problems could be detected by sensors at T1 sites and there was a significant failure rate for SRM or gridftp requests.

 

Most channels need too many parallel transfers and/or too many streams. This is not compatible with the real experiment use cases and so we need to improve the rate per stream.

 

Network problems and interventions are not always announced; sometimes changes in firewall configurations block/perturb the transfers.

 

GridView can only publish data for Castor and DPM sites, a publisher for dCache transfer statistics needs to be implemented.

 

 

Setting up (gLite) basic services (G.McCance)

 

Most T1 sites are now upgraded. There was some confusion between gLite and LCG parts qualifying as an upgrade and some configuration issues (e.g. changes in CE and FTS).

 

Concerning resources and planning the sites reported that they had difficulty because the release were on short notice and sometimes in parallel with PPS, both requiring the support of key staff .

 

I.Bird and M.Schulz noted that:

-          the releases of gLite were not on short notice (announced in January)

-          the release date was in schedule (only two days late)

-          resources are funded by EGEE to run PPS in parallel (therefore this should not have been a problem)

 

Lack of documentation or inconsistency was reported and more configuration examples would have been useful.

 

The Operational Procedures are often not adequate and there is limited sharing of tools and experiences among sites and monitoring is insufficient.

 

FTS failure handling (G.McCance)

 

The FTS channels can start dropping files. The default is to do 3 retries per file with 10 minutes between them.

There are transient problems. For instance the SRM stops working – maybe10,000 files then dropped - then it works again 30 minutes later without intervention.

 

It is recommended that sites make use of “FTS Hold” state, so that failed transfers can go into hold state, to be repeated when the problem has been resolved. So the site should check for held transfers several times per day, fix problem, reset transfers again.

 

The retry and hold policies are configurable per VO because some VOs do not want this “hold” feature.

 

When there are repeated failures the channel should be halted.

 

Work is in progress to make FTS information (performance figures, aggregate failure classes, log files, error messages) more accessible

 

The FTS server at CERN has a web page with readable alarms that local monitoring tools could poll

 

L.Robertson summarised  that the two issues to follow up in particular are:

-          Availability of srmCopy logging information in FTS

-          GridView needs a publisher for dCache

 

I.Bird asked more for information about which was the context of the transient FTS failures (which site, which MSS involved, etc).

 

Operational Requirements for core services (J.Casey)

 

These are the top 5 requirements from the sites:

-          Better logging.
A lot of information is missing (e.g. DN of initiating user in transfer log) and it is too hard to understand the log information

-          Better diagnostics tools.
One needs to be able to verify that the configuration is correct and functional for all VOs. The SAM monitoring should be part of the solution and be available at the site in order to check a configuration

-          Detailed step-by-step troubleshooting guides.
The need is acknowledged and guides could be developed starting with the recent T2 workshop tutorials.

-          Better error messages from tools

-          Improve monitoring
Interfaces to allow central/remote components to be interfaced to local monitoring systems of the sites. GridView level is not enough. Looking at other tools

 

M.Schulz noted that any operation must be traceable to a DN and therefore if it is missing from the log this is a serious bug.

 

How to move to production quality (Discussion, H.Renshall)

 

A summary of the questions and observations made during the workshop follows:

 

  1. Sites should maintain at least 2 SRM end points per experiment for data to go to tape (disk0tape1 class) and data to go to disk-only (disk1tape0). No particular comments. To be followed up by the SC team.

  2. Are the resources needed for the upcoming data challenges available? The buffer sizes for 24 hour recycling for ATLAS had been sent round. No comments were received except from RAL, which remarked that Atlas have a fixed 10 TB of disk space that they have to clean up first to make buffer space.

  3. The challenges should be treated as ‘accelerator on’ so sites should try to check services during weekends.
    Tier-1s are required to support out of hours operations only from the beginning of 2007 so for now all is ‘best effort’.
    No comment from the sites.

  4. Are the experiment plans for 2007 resources (disk, tape and cpu) well established, including the consequences of the new LHC schedule?
    I.Fisk said that one should expect the full data rate from CMS. This is an excellent chance to calibrate the detectors.
    To be followed up by the SC team (proposed for next technical meeting).

  5. Experiment data of the same type (raw, esd, aod) should be on separate pools of tapes per experiment. This is an efficiency issue for staging in such data.
    Sites expressed surprise at this requirement. For the current Atlas SC it has been relaxed.
    SC team to check per experiment if this is a long term requirement.

 

S.Belforte noted that there should be a mechanism to communicate to the sites which data is better on the same set of tapes (name space, pool, etc). L.Robertson said that this is an issue for the group looking into implications of the SRM 2 implementations on the sites.

 

1.2         Experiments Plans and Requirements

Talks were requested from each experiment to address:

-          What they want to achieve over the next few months with details of the specific tests and production runs.

-          Specific actions, timelines, sites involved.

-          If they have had bad experiences with specific sites then this should be discussed and resolved.

 

ATLAS SC plans/requirements

 

Running now till 7 July to demonstrate the complete Atlas DAQ and first pass processing with distribution of raw and processed data to Tier 1 sites at the full nominal rates.

Will also include data flow to some Tier2 sites and full usage of the Atlas Distributed Data Management system, DQ2.

 

Raw data to go to tape, processed to disk only. Sites to delete from disk and tape, but the tape must be written.

 

After summer investigate scenarios of recovery from failing Tier 1 sites and deploy cleanup of pools at Tier 0.
Later ATLAS will test their distributed production, analysis and reprocessing.

 

J.Templon noted that ATLAS has not asked for resources for reprocessing. H.Renshall agreed to clarify the issue and update the SC4 wiki with the latest ATLAS requests.

 

He pointed out that DQ2 has a central role with respect to Atlas Grid tools

-          ATLAS will install local DQ2 catalogues and services at Tier 1 centers

-          ATLAS defines a region of a Tier 1 and those sites with good network connections to the Tier-1 that will depend on the Tier 1 DQ2 catalogue.

-          Expect some (volunteer) Tier 2s to join SC, when T0/T1 runs stably

-          ATLAS will delete DQ2 catalogue entries

Will require:

-          VO box per Tier 0 and Tier 1 – done

-          LFC server per Tier 1 – done, must be monitored

-          FTS server and validated channels per Tier 0 and Tier 1 – almost complete

-          ‘Durable’ MSS disk area at Tier 1 – few sites have it. To be followed up by Atlas and SC team.
(disk0tape1, disk can be reused, tape must stay at least 24 h)

 

Atlas would like their T1 sites to attend (VRVS) their weekly (Wed at 14.00) SC review meeting during this running phase.
No commitments from the sites were made. RAL expressed its support for the participation of sites to the experiments meetings.

 

L.Robertson asked if the sites have problems with this list of requirements.

-          SARA-NIKHEF said they have less tapes so will recycle them more often.

-          TRIUMF will recycle disk space every 16h because do not have enough capacity at the moment.

 

L.Robertson said that he believed that the experiment-site meetings are important because there are many issues to sort out quickly with the sites at the start of the service challenge, and we have heard often that sites have difficulties to understand what an experiment needs.

The discussion that followed concluded that if the first part of the experiment meeting is devoted to “sites issues” the sites should participate to that part. H.Renshall agreed to ask ATLAS to organize the meeting starting with “sites issues”.

 

ALICE SC Plans and Requirements

Their validation of the LCG/gLite workload management services is ongoing and they stressed that stability of the services is fundamental for the entire duration of the exercise.

 

Validation of the data transfer and storage services

-          2nd phase: end July/August T0 to T1 (recyclable tape) at 300 MB/sec

-          The stability and support of the services have to be assured during and beyond these throughput tests

 

In August/September they will perform the validation of the ALICE distributed reconstruction and calibration model. And do reconstruction at their Tier 1 sites.

 

The integration of all Grid resources within one single interfaces to different Grids (LCG, OSG, NDGF) - will be done by ALICE

 

End-user data analysis is scheduled for September/October.

 

ALICE Requirements and Issues

 

ALICE have deployed a VO box at all their T0-T1-T2 sites, installed and maintained by ALICE. But site related problems should be handled by site administrators

ALICE requires:

-          FTS services as plugin to AliEn File Transfer Daemon

-          LFC at all ALICE sites. Used as a local catalogue for the site SE. ALICE will take care of the LFC catalogue entries.

-          FTS endpoints at the T0 and T1 with SRM enabled storage to tape (tapes must be written) and automatic data deletion (by the sites) for the 300 MB/sec throughput test (24 to 30 July).

-          Site support during the whole tests and beyond:

 

Question from ALICE: What are the site contacts for the central and distributed support teams, or does everything go through GGUS?

Answer: All problems to be reported to GGUS. SC team to check possibility of out of hours action.

 

Question from ALICE: Will the SC team setup and test this before handing it over to ALICE?

Answer: SC team will follow up the enabled storage but is up to ALICE to test the setup.

 

J.Gordon noted that sites do not have any automatic mechanism to delete tapes, and manual operations will be done only when needed by the sites. Experiments had previously stated that only they would be responsible for deleting files.

 

CMS SC Plans

 

In September/October CMS runs CSA06, a 50 million event exercise to test the workflow and dataflow associated with the data handling and data access model of CMS 

 

Now till end June

-          Continue to try to improve file transfer efficiency. Low rates and many errors now.

-           Attempt to ramp up to 25k batch jobs per day and increase the number and reliability of sites aiming to obtain 90% efficiency for job completion

 

July

-           Demonstrate CMS analysis submitter in bulk mode with the gLite RB

 

July and August

-          25M events per month with the production systems

-          Second half of July participate in multi-experiment FTS Tier-0 to Tier-1 transfers at 150 MB/sec out of CERN

-          Continue through August with transfers - this will overlap with the ALICE tests.

 

CMS Requirements

 

Improve Tier-1 to Tier-2 transfers and the reliability of the FTS channels.

 

CMS are exercising the channels available to them, but there are still issues with site preparation and reliability: the majority of sites are responsive, but there is a lot of work for this summer

 

Require to deploy the LCG-3D infrastructure: from late June, Frontier + SQUID caches must be deployed.

 

All participating sites should be able to complete the CMS workflow and metrics (as defined in the CSA06 documentation)

 

S.Belforte noted that the 25M events/month will have to be copied to CERN for further work during the year and do not need to be stored long term on the Tier-1 sites.

 

 

LHCB SC Plans/Requirements

 

Will start DC06 challenge at beginning of July using LCG production services and run till end August:

-          Distribution of raw data from CERN to Tier 1s at 23 MB/sec

-          Reconstruction/stripping at Tier 0 and Tier 1

-          DST distribution to CERN and Tier 1s

-          Job prioritization will be dealt with by LHCB (via DIRAC) but it is important jobs are not delayed by other VO activities

 

Preproduction for this is ongoing with 125 TB of MC data at CERN.

Production will go on throughout the year for an LHCB physics book due in 2007.

 

LHCb requires:

-          SRM 1.1 based SE’s separated for disk (disk1tapex) and MSS (disk0tape1) at all Tier 1s  as agreed in Mumbai and FTS channels for all CERN-T1’s

        Data access directly from SE to ROOT/POOL (not just GridFTP/srmcp).
For NIKHEF/SARA (firewall issue) this could perhaps be done via GFAL.

-          VO boxes at Tier 1s – so far at CERN, IN2P3, PIC and RAL. Need CNAF, NIKHEF and GridKa

-          Central LFC catalogue at CERN and read-only copy at certain T1s (currently setting up at CNAF)

 

DC06-2 in Oct/Nov requires T1’s to run COOL and 3D database services

 

L.Robertson noted that the firewall issue at SARA should be clarified by an expert. H.Renshall will follow it up.

 

The GFAL plugin in ROOT for now is not working.

 

Summary of Main Experiments Requirements

 

ATLAS requires:

-          ‘durable’ disk storage end points for ATLAS must be verified

-          ATLAS LFC instances at the Tier 1’s must be monitored

-          T1 sites to attend weekly SC review during current run

 

ALICE requires:

-          tape storage end points for ALICE

-          LFC available for ALICE at all their sites

-          Good site support emphasised by ALICE.

-          Can any best-efforts out of hours support be offered?

 

CMS requires:

-          improved performance and reliability of file transfers

-          LCG-3D (squid) infrastructure from July

-          sites complete their CSA06 metrics

 

LHCB requires:

-          separate disk and MSS (tape) storage classes

-          direct ROOT/POOL data access to SE’s

-          VO boxes at CNAF, NIKHEF and GridKa

-          COOL and 3D database services  (October)

 

Next Technical Meeting

 

H.Renshall has tentatively scheduled another technical meeting for 15 September
The goal is to review the June to August challenges and experiment plans for the rest of 2006 and for the 2007 production services.

 

 

2.      GLite 3.0 status at the sites (transparencies) M.Schulz

 

 

Postponed to next MB meeting.

 

3.      AOB

 

 

J.Templon noted that the plans for 2007 and 2008 should be discussed in view of changes of schedule of the LHC.

L.Robertson will meet J.Engelen to find out which additional information about the accelerator planning can be made available and report to the MB.

 

4.      Summary of New Actions

 

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.