LCG Management Board
Tuesday 8 August 2006 at 16:00
(Version 1 - 10.8.2006)
A.Aimar (notes), L.Bauerdick, S.Belforte, I.Bird, F.Carminati, T.Cass, L.Dell’Agnello, T.Doyle, I.Fisk, B.Gibbard, A.Heiss, J.Knobloch, G.Merino, B.Panzer, G.Poulard, Di Quing, H.Renshall, L.Robertson (chair), O.Smirnova, R.Tafirout
Tuesday 22 August from 16:00 to 1700 (no meeting on the 15 August)
1. Minutes and Matters arising (minutes)
1.1 Minutes of Previous Meeting
N.Brook asked to add that LHCb will phone in the GDB and MB in BNL but no participations in person. Done.
The minutes of the previous meeting were approved.
1.2 Update on the LHCC Review Preparation - L.Robertson
Many people that were contacted are probably on holiday.
V.Guelzow (Tier-1 status from
the internal service review) and M.Jouvin (Overview of Tier-2 readiness) did
not yet confirm their presentations.
- Milos Lokajicek accepted to prepare the presentation about “Tier-2 site from a country without a Tier-1”
- M.Lamanna is in contact with F.Forti to agree on the expectations of the reviewers and on tuning the demonstrations from the experiments about their usage of the grid.
- R.Pordes will give the OSG middleware presentation and she will discuss with I.Bird about the OSG operations presentation.
1.3 Feedback on the Resources Available/Required - H.Renshall
Sites and experiments were asked to check the values in the table of Resources Available at Sites vs. Required by the Experiments.
H.Renshall asked few clarifications about the data in the SC4 Resources table, presented at the previous MB.
FZK announced new capacity
coming, details are needed
PIC announced that they will
install 60 TB in dCache, more details are needed
updates on CPU acquisitions are needed.
- FNAL clarification on the disk installations that are due in a few days
2. Action List Review (list of actions)
Actions that are late are highlighted in RED.
To be done: BNL, FNAL, NDGF and TRIUMF.
These sites were asked individually to complete this action or send a report.
Email from R.Tafirout (TRIUMF):
consider this action item as "done" for TRIUMF. Arguments/reasons
configured our FTS service weeks/months ago. All the T1-T2's (between TRIUMF
and the Canadian T2 have been successfully tested and operated for ATLAS data
movements and in both directions.
T1-T1 channels and tests (from the other T1's to TRIUMF), they are all
configured (except NDGF-TRIUMF channel: FTS needs an officially registered
site in GOCDB and there are still some issues with NDGF.). So far only
IN2P3-TRIUMF and CNAF-TRIUMF have been successfully tested. There are still
some issues with the other sites which need to be resolved:
and RAL, there are problems in transfering a file
to their SRM in order to get it back via FTS channels PIC-TRIUMF and
RAL-TRIUMF. I will try again later.
think that our FTS service is fully/properly configured...the other T1 are
really the show stoppers... ;)
is the status update to FTS channel setup at BNL.
Tier1-Tier1 channels, we have all channels in FTS. All channels except
CNAF are tested and functioning. I was told that CNAF castor is not
For US Tier2, all channels are functioning and being used by the production service (PANDA) daily.
Email from I.Fisk (FNAL):
We have verified the channels work individually, but we are seeing interference with multiple channels active on FTS itself. When the person who installed it initially is back from vacation on Thursday we will continue debugging.
Presented to the GDB in June and asked for feedback but did not receive any.
J.Gordon does not receive any use cases he will propose some use cases himself at the GDB in September.
The Service Coordination (J.Shiers) will set up a white board where the sites and experiments can publish their contact information.
I.Bird proposed to HEPIX the organization of a workshop during next HEPIX meeting in October. For the moment it is not clear whether it will be possible. More news from I.Bird in the coming weeks.
Experiments agreed to send information to J.Shiers.
1. Status and Progress of the Service Challenge activities (transparencies) - H.Renshall
SC4 Weekly Report and upcoming activities (more information)
This is a snapshot on the status and progress of the Service Challenge activities.
Weekly Report to the Operations Meeting
A weekly report is presented at the Operations meeting by H.Renshall. The report includes information from:
- Reports and issues from the Experiment Integration and Support (EIS) teams collected at the weekly LCG Service Coordination meeting (Wednesday mornings)
- IT group reports to the weekly IT C5 internal meeting (Friday mornings)
- Reports and issues from the experiments service challenge teams from the weekly LCG Resource Scheduling meeting (Monday afternoons)
As example, the latest weekly report is available here.
From the last week of July they are ramping
up to 300 MB/sec and they export data from the Tier 0 to 6 Tier-1 sites
(CNAF, IN2P3, RAL, FZK,
For a short period they have reached a peak of 150 MB/s. Now they are well below their target rate (less than 50 MB/s aggregate) and will continue to try and reach their target.
- Had some “srm_put” problems to SARA.
- IN2P3 currently giving FTS preference to CMS.
- Site instabilities
feedback from submitting individual problem reports to GGUS.
ATLAS SC Status
ATLAS has just passed the milestone of transferring 1 PB of data in the last 50 days.
They had post-mortem meetings on their Tier 0/Tier 1 first pass processing and on the data export exercise. The overall assessment is positive, with no major issues for CERN but problems with some T1 sites.
Currently continuing data export at 4-500
MB/s but they plan to repeat the whole exercise at the full rate of 770 MB/s
in the last two weeks of September. This will overlap with the start of CMS
CSA06 and a more intensive
CMS SC Status
Their planned data export exercise of disk to tape at 150 MB/s was delayed; meanwhile they try a higher rate for their “Tier 0 to Tier 1 disk to disk” exercise.
The target for CMS are: is 500 MB/s for 1 week, minimum acceptable is 300 MB/s for 3 days
The exercise started badly with the CERN power cut effects (Monday) then having an expired certificate for the FTS proxy server on Tuesday (this will be better managed when the service moves to FIO on new high-reliability hardware in September).
After Tuesday the work was progressing more smoothly:
- Reached 350 MB/s Wednesday but dropped overnight then recovered – not known why.
- Stopped for CERN Oracle data base upgrade on Thursday.
- Reached 300 MB/s Thursday, dropped overnight and effectively stopped Friday morning. Unexpected status replies from Castor caused Phedex to issue 80000 ‘prepare-to-get’ requests which took hours to process.
350 MB/s on Friday afternoon
- And 350 MB/s has been maintained since then.
Specific problems: Low rates to RAL and not using CNAF due to their Castor2 migration. At the Operations meeting CNAF told them they could restart. After some problem overnight, since 09.00 CET Monday CNAF has been taking about data at the 100 MB/s rate and CMS reached their 500 MB/s target.
The plan is to continue this week or until they have seen sufficient stable running. A 30 minute Castor stop at CERN Wednesday will be useful to test their recovery procedures.
L.Bauerdick stated that the improvements achieved and the effort spent are very visible and appreciated in CMS.
Planned to start raw data distribution,
reconstruction and stripping at Tier 1 in July. In parallel it continues
“Log Files Transfers problem” - LHCb stores batch job logs back to a central CERN classic SE (run as a VO-box service) using lcg-cp (which runs a gridftpd on the SE). This ran well during July then started having process-avalanches and crashing the box. This was traced to exceeding a 32000 directory entry limit which the lcg-cp/gridftp chain did not handle well. This was fixed and also LHCb reduced the numbers of individual files sent the system restarted running well
IT remains worried about the scalability of a central gridftp file server for job logs. ATLAS runs a similar service currently at lower rates. Improvements are being studied e.g. use castor2 or gridftp2 but a proper grid architecture solution would be desirable.
LHCB observe intermittent (very) slow performance when transferring some files from remote worker nodes to CERN. This happens to Castor2 and a classic SE so is not thought to be a Castor2 problem. As a workaround LHCb have increased their transfer timeout to 1000 seconds.
Issues from Last Week and from the SC Action List
From last week:
- CMS transfers to RAL run at a low rate (10 MB/s compared to target 50 MB/s).
- Medium and long-term solutions needed for the collection and access by end users of batch job logs.
- Understand limitations/usability of direct file transfer from worker nodes to CERN.
- All sites should be able to provide long term stability at their MoU performance levels.
- Site monitoring of local services - to be discussed in detail at September 15 SC Tech Day
- Tools to monitor transfer activities on FTS channels; provide access to FTS logs (Temporary solution being tested )
- Improve the performance and reliability of the gLite Resource Broker (critical path for CMS CSA06)
- Support multiple priority levels for grid batch jobs (critical path for LHCB distributed analysis)
- LHCB: Root/Pool data access to SEs is needed
Significant Points discussed at the MB
T.Doyle, for RAL, will collect information about the status of the SRM end-points and report to the MB.
CMS transfer rate to RAL
T.Doyle said that this problem was being worked on, and was mainly due to the small disk cache being used, a result of problems with the most recent delivery of disk storage.
NDGF Status and requests from ALICE and ATLAS
reported that NDGF is not participating much to the SC and that will not be
able to provide more services before September.
Transfers of Job Log back to CERN
asked if there is any more scalable and reliable solution for transferring
many small log files.
The goal is not to find a general mechanism to move large files asynchronously (that is provided by FTS) or to provide a general purpose end-user grid file service (as discussed at Mumbai, but which has many complicated implications), but to focus on reliably moving small job log files.
Data Transfers from the WN to CERN (LHCb)
Transfers by LHCb to CERN of MC data files directly from remote worker nodes had problems last week. It is not clear why this does not use the FTS service. This will be discussed when LHCb is present.
Progress on these points will be reviewed by the MB at the end of the month
2.1 CERN Power-cut - T.Cass
The power cut was caused by a short circuit in a substation that receives a 400 KW feed. This caused a fire at 7:45 AM on the 29 July.
All major services were stopped, including Physics Services and all critical servers (web, mail, etc). It took a long time to safely solve the electrical problems and then the services were restarted during the day. The Physics Services were back at midnight.
This was the longest power cut in last 30 years. The DG launched an investigation in order to find out what exactly happened and the possibility that such event may reoccur.
The IT computer center infrastructure must be verified in order to make sure that IT can rely on the diesel generators in the future. A report will be distributed later, probably during September.
L.Robertson remarked that the criticality of the CERN services that are central to LCG operation should be re-assessed looking at:
- critical services to guarantee reliable operations for the grid and the rest of the sites – Ian Bird will look into this
- databases services that are crucial for applications running outside of CERN and how should they be made redundant in case of another CERN outage – Jürgen Knobloch will look into this
2.2 Reliability Measurement (more information) - L.Robertson
L.Robertson distributed the new reliability measurements for July.
Note: The CERN’s stop did not allow the collection of the SAM data for the 29-31 July. Therefore all other Tier-1 sites will be considered 100% available in the period from the start of the power cut until midnight on 31 July.
31 Aug 06 - Sites should send feedback and comment in detail all the periods of unavailability of the site. Mail to the MB list.
L.Robertson noted that the performance is not improving during the whole quarter; therefore either the target values are to be reviewed or actions must to be taken in order to achieve at all sites the target availability agreed.
The target was to have 88% availability by the end of September.
This will be discussed at the MB in BNL (5 September). Before that sites should justify/explain all downtimes of last quarter.
FNAL unavailability is due to a single test failing (“SRM advisory delete” fails in dCache). The problem is fixed in the new version of the dCache SRM but the update will happen on the 20th only. All other attempts to patch it have failed.
2.3 NDGF status
L.Robertson expressed his worry on the lack of a formal participation to the planning, testing, availability and other LCG activities.
An appointment between L.Robertson and the NDGF director will be fixed in the next few days by O.Smirnova.
4. Summary of New Actions
31 Aug 06 - Sites should send feedback and comment in detail all the periods of unavailability of the site. Mail to the MB list.
The full Action List, current and past items, will be in this wiki page before next MB meeting.