LCG Management Board

Date/Time:

Tuesday 8 August 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a063096

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 10.8.2006)

Participants:

A.Aimar (notes), L.Bauerdick, S.Belforte, I.Bird, F.Carminati, T.Cass, L.Dell’Agnello, T.Doyle, I.Fisk, B.Gibbard, A.Heiss, J.Knobloch, G.Merino, B.Panzer, G.Poulard, Di Quing, H.Renshall, L.Robertson (chair), O.Smirnova, R.Tafirout 

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 22 August from 16:00 to 1700 (no meeting on the 15 August)

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

N.Brook asked to add that LHCb will phone in the GDB and MB in BNL but no participations in person. Done.

 

The minutes of the previous meeting were approved.

1.2         Update on the LHCC Review Preparation - L.Robertson

Many people that were contacted are probably on holiday.

-          V.Guelzow (Tier-1 status from the internal service review) and M.Jouvin (Overview of Tier-2 readiness) did not yet confirm their presentations.
See minutes of the previous meeting.

-          Milos Lokajicek accepted to prepare the presentation about “Tier-2 site from a country without a Tier-1”

-          M.Lamanna is in contact with F.Forti to agree on the expectations of the reviewers and on tuning the demonstrations from the experiments about their usage of the grid.

-          R.Pordes will give the OSG middleware presentation and she will discuss with I.Bird about the OSG operations presentation.

1.3         Feedback on the Resources Available/Required - H.Renshall

Sites and experiments were asked to check the values in the table of Resources Available at Sites vs. Required by the Experiments.

 

H.Renshall asked few clarifications about the data in the SC4 Resources table, presented at the previous MB.

-          FZK announced new capacity coming, details are needed
A.Heiss stated that he will provide to H.Renshall updated exact values for the FZK capacity.

-          PIC announced that they will install 60 TB in dCache, more details are needed
G.Merino stated that he will provide updated information to H.Renshall.

-          SARA updates on CPU acquisitions are needed.

-          FNAL clarification on the disk installations that are due in a few days

 

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

 

  • 23 May 06 - Tier-1 sites should confirm via email to J.Shiers that they have set-up and tested their FTS channels configuration for transfers from all Tier-1 and to/from Tier-2 sites. * Is it not sufficient to set up the channel but the action requires confirmation via email that transfers from all Tier-1 and to/from the "known" Tier-2 has been tested.

 

To be done: BNL, FNAL, NDGF and TRIUMF.

These sites were asked individually to complete this action or send a report.

 

Email from R.Tafirout (TRIUMF):

Please consider this action item as "done" for TRIUMF. Arguments/reasons below:

We have configured our FTS service weeks/months ago. All the T1-T2's (between TRIUMF and the Canadian T2 have been successfully tested and operated for ATLAS data movements and in both directions.

Regarding the T1-T1 channels and tests (from the other T1's to TRIUMF), they are all configured (except NDGF-TRIUMF channel: FTS needs an officially registered site in GOCDB and there are still some issues with NDGF.). So far only IN2P3-TRIUMF and CNAF-TRIUMF have been successfully tested. There are still some issues with the other sites which need to be resolved:

SARA, BNL, FZK, and Taiwan do not accept my certificate DN which has "Email" in it (this is the case for several certification authorities). This is a known issues with dCache sites which need to "hack" the dCache password file (dCache translate "Email" to "EMAIL". Maarten has sent a script (grid-mapfile2dcache-kpwd) few months ago which fixes the problem, so I don't think that these sites had applied it properly. I will contact these sites.

Regarding PIC and RAL, there are problems in transfering a file to their SRM in order to get it back via FTS channels PIC-TRIUMF and RAL-TRIUMF. I will try again later.

I really think that our FTS service is fully/properly configured...the other T1 are really the show stoppers... ;)

Thanks & Regards,

Reda


Email from B.Gibbard (BNL):

This is the status update to FTS channel setup at BNL.

For Tier1-Tier1 channels, we have all channels in FTS.  All channels except CNAF are tested and functioning.  I was told that CNAF castor is not running currently.

For US Tier2, all channels are functioning and being used by the production service (PANDA) daily.

 

Email from I.Fisk (FNAL):

We have verified the channels work individually, but we are seeing interference with multiple channels active on FTS itself.    When the person who installed it initially is back from vacation on Thursday we will continue debugging.

 

 

  • 30 Jun 06 - J.Gordon reports on the defined use cases and policies for user-level accounting in agreement with the security policy working group, independently on the tools and technology used to implement it.

 

Not done.

Presented to the GDB in June and asked for feedback but did not receive any.

J.Gordon does not receive any use cases he will propose some use cases himself at the GDB in September.

 

  • 31 Jul 06 - Sites and experiments should define a clear structure for providing clearer information and unique liaisons and information contacts.

 

Not done.

The Service Coordination (J.Shiers) will set up a white board where the sites and experiments can publish their contact information.

 

                                         

  • 31 Jul 06 - Sites should exchange more information about monitoring, alarming and 24x7 support in the framework of HEPIX.

 

Not done.

I.Bird proposed to HEPIX the organization of a workshop during next HEPIX meeting in October. For the moment it is not clear whether it will be possible. More news from I.Bird in the coming weeks.

 

  • 31 Jul 06 - Experiments should express what they really need (not “nice to have” requests) in terms of interoperability.

 

Not done.

Experiments agreed to send information to J.Shiers.

 

 

1.      Status and Progress of the Service Challenge activities (transparencies) - H.Renshall

SC4 Weekly Report and upcoming activities (more information)

 

This is a snapshot on the status and progress of the Service Challenge activities.

 

Weekly Report to the Operations Meeting

 

A weekly report is presented at the Operations meeting by H.Renshall. The report includes information from:

-          Reports and issues from the Experiment Integration and Support (EIS) teams collected at the weekly LCG Service Coordination meeting (Wednesday mornings)

-          IT group reports to the weekly IT C5 internal meeting (Friday mornings)

-          Reports  and issues from the experiments service challenge teams from the weekly LCG Resource Scheduling meeting (Monday afternoons)

 

As example, the latest weekly report is available here.

 

ALICE SC Status

 

From the last week of July they are ramping up to 300 MB/sec and they export data from the Tier 0 to 6 Tier-1 sites (CNAF, IN2P3, RAL, FZK, SARA, USA site).

 

For a short period they have reached a peak of 150 MB/s. Now they are well below their target rate (less than 50 MB/s aggregate) and will continue to try and reach their target.

 

Specific Issues:

-          No endpoints available to ALICE at RAL and NDGF. They are waiting for the sites to set them up

-          Had some “srm_put” problems to SARA.

-          IN2P3 currently giving FTS preference to CMS.

 

Main issues:

-          Site instabilities

-          Poor feedback from submitting individual problem reports to GGUS.
Switched to using the ALICE internal Task Force list but will now also include GGUS again.

 

ATLAS SC Status

 

ATLAS has just passed the milestone of transferring 1 PB of data in the last 50 days.

 

They had post-mortem meetings on their Tier 0/Tier 1 first pass processing and on the data export exercise. The overall assessment is positive, with no major issues for CERN but problems with some T1 sites.

 

Currently continuing data export at 4-500 MB/s but they plan to repeat the whole exercise at the full rate of 770 MB/s in the last two weeks of September. This will overlap with the start of CMS CSA06 and a more intensive Monte Carlo campaign by ATLAS. ATLAS is going to ask whether this can be moved forwards.

 

CMS SC Status

 

Their planned data export exercise of disk to tape at 150 MB/s was delayed; meanwhile they try a higher rate for their “Tier 0 to Tier 1 disk to disk” exercise.

 

The target for CMS are: is 500 MB/s for 1 week, minimum acceptable is 300 MB/s for 3 days

 

The exercise started badly with the CERN power cut effects (Monday) then having an expired certificate for the FTS proxy server on Tuesday (this will be better managed when the service moves to FIO on new high-reliability hardware in September).

 

After Tuesday the work was progressing more smoothly:

-          Reached 350 MB/s Wednesday but dropped overnight then recovered – not known why.

-          Stopped for CERN Oracle data base upgrade on Thursday.

-          Reached 300 MB/s Thursday, dropped overnight and effectively stopped Friday morning. Unexpected status replies from Castor caused Phedex to issue 80000 ‘prepare-to-get’ requests which took hours to process.

-          Reached 350 MB/s on Friday afternoon

-          And 350 MB/s has been maintained since then.

 

Specific problems: Low rates to RAL and not using CNAF due to their Castor2 migration. At the Operations meeting CNAF told them they could restart. After some problem overnight, since 09.00 CET Monday CNAF has been taking about data at the 100 MB/s rate and CMS reached their 500 MB/s target.

 

The plan is to continue this week or until they have seen sufficient stable running. A 30 minute Castor stop at CERN Wednesday will be useful to test their recovery procedures.

 

L.Bauerdick stated that the improvements achieved and the effort spent are very visible and appreciated in CMS.

 

LHCb SC Status

 

Planned to start raw data distribution, reconstruction and stripping at Tier 1 in July. In parallel it continues Monte Carlo event generation at Tier 1 and Tier 2 with events stored back to CERN. Last Tuesday CERN changed castorsrm.cern.ch endpoint from castorgrid to “srm.cern.ch” to fix a gfal bug affecting LHCB transfers back to CERN.

 

“Log Files Transfers problem” - LHCb stores batch job logs back to a central CERN classic SE (run as a VO-box service) using lcg-cp (which runs a gridftpd on the SE). This ran well during July then started having process-avalanches and crashing the box. This was traced to exceeding a 32000 directory entry limit which the lcg-cp/gridftp chain did not handle well. This was fixed and also LHCb reduced the numbers of individual files sent the system restarted running well

 

IT remains worried about the scalability of a central gridftp file server for job logs. ATLAS runs a similar service currently at lower rates. Improvements are being studied e.g. use castor2 or gridftp2 but a proper grid architecture solution would be desirable.

 

LHCB observe intermittent (very) slow performance when transferring some files from remote worker nodes to CERN. This happens to Castor2 and a classic SE so is not thought to be a Castor2 problem. As a workaround LHCb have increased their transfer timeout to 1000 seconds.

 

Issues from Last Week and from the SC Action List

 

From last week:

-          ALICE requests a disk0/tape1 endpoint from RAL.

-          ALICE requests a disk0/tape1 endpoint from NDGF.

-          CMS transfers to RAL run at a low rate (10 MB/s compared to target 50 MB/s).

-          Medium and long-term solutions needed for the collection and access by end users of batch job logs.

-          Understand limitations/usability of direct file transfer from worker nodes to CERN.

-          All sites should be able to provide long term stability at their MoU performance levels.

 

General actions:

-          Site monitoring of local services - to be discussed in detail at September 15 SC Tech Day

-          Tools to monitor transfer activities on FTS channels; provide access to FTS logs (Temporary solution being tested )

-          Improve the performance and reliability of the gLite Resource Broker (critical path for CMS CSA06)

-          Support multiple priority levels for grid batch jobs (critical path for LHCB distributed analysis)

 

Experiments actions:

-          ALICE: report on its T1-T2 end points.

-          LHCB: Root/Pool data access to SEs is needed

 

Significant Points discussed at the MB

 

End-points for ALICE at RAL

T.Doyle, for RAL, will collect information about the status of the SRM end-points and report to the MB.

 

CMS transfer rate to RAL

T.Doyle said that this problem was being worked on, and was mainly due to the small disk cache being used, a result of problems with the most recent delivery of disk storage.

 

NDGF Status and requests from ALICE and ATLAS

O.Smirnova reported that NDGF is not participating much to the SC and that will not be able to provide more services before September.
F.Carminati reported that ALICE is trying to work with specific individual Nordic centers; not with NDGF as a single entity, for the moment.

 

Transfers of Job Log back to CERN

L.Robertson asked if there is any more scalable and reliable solution for transferring many small log files.
I.Bird replied that the solution may be found looking at existing services or publicly available tools. H.Renshall said that even a mail service could be sufficient.

 

The goal is not to find a general mechanism to move large files asynchronously (that is provided by FTS) or to provide a general purpose end-user grid file service (as discussed at Mumbai, but which has many complicated implications), but to focus on reliably moving small job log files.

 

Data Transfers from the WN to CERN (LHCb)                                           

Transfers by LHCb to CERN of MC data files directly from remote worker nodes had problems last week. It is not clear why this does not use the FTS service. This will be discussed when LHCb is present.

 

Progress on these points will be reviewed by the MB at the end of the month

 

2.      Other Business

 

2.1         CERN Power-cut - T.Cass

The power cut was caused by a short circuit in a substation that receives a 400 KW feed. This caused a fire at 7:45 AM on the 29 July.

 

All major services were stopped, including Physics Services and all critical servers (web, mail, etc). It took a long time to safely solve the electrical problems and then the services were restarted during the day. The Physics Services were back at midnight.

 

This was the longest power cut in last 30 years. The DG launched an investigation in order to find out what exactly happened and the possibility that such event may reoccur.

 

The IT computer center infrastructure must be verified in order to make sure that IT can rely on the diesel generators in the future. A report will be distributed later, probably during September.

 

L.Robertson remarked that the criticality of the CERN services that are central to LCG operation should be re-assessed looking at:

-          critical services to guarantee reliable operations for the grid and the rest of the sites – Ian Bird will look into this

-          databases services that are crucial for applications running outside of CERN and how should they be made redundant in case of another CERN outage – Jürgen Knobloch will look into this

2.2         Reliability Measurement (more information) - L.Robertson

L.Robertson distributed the new reliability measurements for July.

 

Note: The CERN’s stop did not allow the collection of the SAM data for the 29-31 July. Therefore all other Tier-1 sites will be considered 100% available in the period from the start of the power cut until midnight on 31 July.

 

Action:

31 Aug 06 - Sites should send feedback and comment in detail all the periods of unavailability of the site. Mail to the MB list.

 

L.Robertson noted that the performance is not improving during the whole quarter; therefore either the target values are to be reviewed or actions must to be taken in order to achieve at all sites the target availability agreed.

The target was to have 88% availability by the end of September.

 

This will be discussed at the MB in BNL (5 September).  Before that sites should justify/explain all downtimes of last quarter.

 

FNAL unavailability is due to a single test failing (“SRM advisory delete” fails in dCache). The problem is fixed in the new version of the dCache SRM but the update will happen on the 20th only. All other attempts to patch it have failed.

2.3         NDGF status

 

L.Robertson expressed his worry on the lack of a formal participation to the planning, testing, availability and other LCG activities.

 

An appointment between L.Robertson and the NDGF director will be fixed in the next few days by O.Smirnova.

 

 

3.      AOB 

 

 

No MB meeting next week.

 

4.      Summary of New Actions 

 

 

Action:

31 Aug 06 - Sites should send feedback and comment in detail all the periods of unavailability of the site. Mail to the MB list.

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.