LCG Management Board |
||
Date/Time |
Tuesday
16 September 2008, 16:00-17:00 - Phone Meeting |
|
Agenda
|
||
Members |
||
|
(Version 1 - 25.9.2008) |
|
Participants |
A.Aimar
(notes), I.Bird(chair), K.Bos, D.Britton, S.Campana, T.Cass, Ph.Charpentier, L.Dell’Agnello,
M.Ernst, I.Fisk, F.Giacomini,, A.Heiss, F.Hernandez, M.Kasemann, H.Marten,
A.Pace, B.Panzer, R.Pordes, Di Qing, H.Renshall,, M.Schulz, J.Shiers,
O.Smirnova, R.Tafirout, J.Templon |
|
Action
List |
||
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
|
Next Meeting |
Tuesday
30 September 2008 16:00-18:00 – Phone Meeting |
|
1. Minutes and Matters arising (Minutes)
|
||
1.1 Minutes of Previous Meeting
The
minutes of the previous MB meeting were approved. FZK sent a modification to the
minutes of the previous week. Tier-1
2008 procurement: DE-KIT, according to the MoU, has a second milestone in October'08
mainly to increase Alice resources. No any problems to fulfill this
milestone; all hardware is on site, running in test mode and will be provided on October
1st. L.Dell'Agnello
noted that site LFC servers and read-only replicas are not distinguished in
the SAM tests: the LFC read-only replicas are considered unavailable if they
are tested for writing. He suggested that maybe this info could be put in the
GOCDB. M.Schulz replied that currently
this information (whether an LFC is a read-only replica or a not) is not
available in the IS. The IS it is being modified to store this information. H.Renshall asked FNAL for an
update on their 2008 resources. I.Fisk replied that the 2008
pledges are both complete for CPU and disk.
1.2 QR Preparation (June-September)Proposed
calendar: ALICE: 23/9, CMS: 30/9, LHCb: 7/10, ATLAS 14/10. M.Kasemann and Ph.Charpentier
agreed on the calendar. Y.Schutz was not present to the
meeting but agree also for the 30/9. (the 23/9 was since then cancelled). New Action: Experiments’ QR
presentations at the MB. Agreed calendar: ALICE: 30/9, CMS: 30/9, LHCb: 7/10,
ATLAS 14/10. 1.3 2009 Procurement (HLM_20080909)FNAL: On target for all installations in place by
April 2009. Will verify that the disk order is in place. TRIUMF: Tender released in early October; the
material will be received by February 2009. Therefore April 2009 seems
reasonable. IN2P3: Installed a second tape library that will
provide 10 PB of tape storage. Received 4 PB of disk to complete the 2008
pledges and half of 2009 pledges. Launching a framework tender for CPU
purchasing and will receive 30-40% of the 2009 pledges by December. For
budget reasons, the rest can only be ordered in early 2009 and be in place by
May. FZK: Expect no delays for CPU, disk and tape. All pledges will be installed by April 1st 2009. The disk tender is finished and orders are being placed. The tape infrastructure is sufficient and future upgrades are being studied. Details are in an email sent to the MB mailing list. CNAF: The tape library has 10 PB and the tapes for 2009 are already
purchased. Disk and CPU tenders will be sent by end September and the material
will be delivered by end February. Probably only partially by 1st
of April. NL-T1: Power/cooling issues could cause possible problems. Both SARA and NIKHEF are in the process of acquiring space/power/cooling additions. The NIKHEF addition is scheduled to be completed by April and is being handled by an external company. The tender process is being revaluated and new tenders will be sent in October. NDGF: CPU pledges will be fulfilled earlier. Disk and tape resources will also be likely be ready by April. ASGC: Funding for 2009 will be only available then and therefore purchases could be delayed; They will send information after the meeting. RAL: The disk tender is being completed, the material will arrive by January and installed by April. The CPU tender is a month later and the material will arrive by end January. The second tape robot will also arrive in time. The new building in RAL should be delivered by December. If this is late it can cause delays on the installations at RAL. BNL: The tender is ready but exact dates for spending are still being discussed in detail. CERN: The tape robot expansion been ordered. For CPU and disks the procurement is split in two parts: - First tenders: CPU orders are ready, will be delivered by November and installed by early 2009. Disk servers delivery December and service by end of February. - Second part of tenders: Responses will we evaluated and approved in December, delivery expected in February and will be ready by April 2009. |
||
2.
Action List Review (List of actions)
|
||
About LCAS: Ongoing. It will be
installed on the pre-production test bed PPS at CERN and LHCb will test it.
Other sites that want to install it up should confirm it. About SCAS: The SCAS server seems to be
ready and “certifiable” in a week. The client is still incomplete. No news for this week about LCAS and SCAS.
-
DONE. A
document describing the shares wanted by ATLAS -
DONE.
Selected sites should deploy it and someone should follow it up. -
ONGOING.
Someone from the Operations team must be nominated follow these deployments
end-to-end ATLAS will report on
the status of the tests. No news at the F2F meeting. S.Campana
reported that job prioritises have been deployed in Milan but not in Naples
yet. The instructions written after the PPS experience worked well. The
open issues are: -
what to do for sites using LSF (e.g. Rome)? -
in Italy they tested the
gLite “repackaged” by INFN-grid. Would be good to test is on other sites
(NL-T1 and Edinburgh will do it). S.Campana will distribute the instructions on job
priorities to the test sites. |
||
3. LCG Operations Weekly Report (Slides)
- J.Shiers
|
||
Summary of status and progress of the LCG Operations. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings 3.1 Sites ReportsThe ORACLE problems were solved but the report is not sufficiently clear. Reports of corrections must be detailed in order to be useful in the future and to other sites. 3.2 First Beam DayThe pertinent question now is “were we ready? It is clear that some of the (even the key ones) services still need further hardening – and more beside. This is also true for some procedures – e.g. alarm mail handling – put in place too late to be fully debugged during CCRC’08 3.3 Service Load ProblemsService load has been seen on a number of occasions, including ATLAS conditions DB at various sites, as well as ATLAS CASTOR instance at CERN (post-mortem, including discussion of GGUS alarm ticket follow-up). - 10’ for CASTOR expert to be called - 10’ for intervention to start - 18’ for problem to be identified - < 3 hours total from start of problem to confirmation of resolution These problems are likely to persist for at least weeks (months?) – we should understand what are the usage patterns that cause them, as well as why they were not included in CCRC’08 tests. Why were these usage pattern not tested or why where they changed? As the post mortem below shows, the reaction time was very satisfactory.
3.4 Network ProblemsBNL has often reports regarding network-related problems. The primary OPN link failed Thursday night when a fibre bundle was cut on Long Island. Manual failover to secondary link. Need to automate such failovers plus continue to follow-up on (relatively) high rate of problems seen with this link M.Ernst
added that BNL, CERN and ESNET are looking for a clear collaborative solution
to the issues. I.Bird
asked why the automatic fail-over did not work properly. M.Ernst
replied that the solution requires communication between sites and network
infrastructure needs. A different more automated solution is being studied. Network problem at CERN Monday caused 3.5 hour degradation affecting CASTOR T.Cass
added that the causes are being studied with the network support at CERN. 3.5 Database Service EnhancementsSupport (on best effort) for CMS and LHCb online databases was added to the service team responsibility Oracle Data Guard stand-by databases in production for all the LHC experiments production databases; and using hardware going out of warranty by the end of the year. Additional protection against - human errors - disaster recoveries - security attacks 3.6 WLCG at EGEE08The idea is to have a panel / discussion session with 3 main themes: - Lessons learned from the formal CCRC'08 exercise and from production activities - Immediate needs and short-term goals: the LHC start-up, first data taking and (re-)processing of 2008 data; - Preparation for 2009, including the CCRC'09 planning workshop. In each case the topic will be introduced with a few issues followed by a wider discussion involving people also from the floor. Not looking for ‘official statements’ – the opinions and experience of all are valid and important. These panels have worked well at previous events (WLCG workshops, GridPP, INFN etc.) and do not require extensive preparation. It is probably useful to write down a few key points / issues in a slide or two (not a formal presentation!) It is also an opportunity to focus on some of the important issues that maybe have not been fully discussed in previous events. 3.7 Post Mortem ReportsThe participants to the Operations meeting are now preparing timely and detailed post-mortems. But what happens next? E.g. both CASTOR/ATLAS and CASTOR/RAL “post-mortes” propose actions and other follow-up. Without inventing a complex procedure, how do we ensure that this happens? When should go to MB action list? I.Bird
replied that the MB should be involved only when the parties involved do not
follow-up and solve the issue. But not be inserted in the MB list from the
beginning. J.Templon
added that a Savannah tracker can be used to follow the open issues (like
happens at the TMB) for the daily Operations meeting. 3.8 Security IssuesSecurity at the WLCG sites was discussed briefly at the MB. The MB concluded warning all Sites and Experiments to give
maximum attention to the Security initiatives that are being discussed
outside the MB, by the appropriate channels. |
||
4. GDB Summary (Report) - J.Gordon |
||
The MB did not discuss the report below, but the action mentioned should be followed up during next meetings.
|
||
5. Update on End User Analysis (Wiki_page) - B.Panzer |
||
B.Panzer
organized meetings and prepared a proposal with a working group on end user analysis
at CERN. CERN IT
will provide a CASTOR instance from 1 October for user analysis. It will be
providing a pool of 100 TB for ATLAS and CMS. The exact functionality will be
announced later in the week. A.Pace added that the future new version
will prevent users from staging files from tape and have strong
authentication. But if this new version cannot be deployed it will be the
current version with only the xrootd interface added. Ph.Charpentier asked why ALICE and LHCb are
not covered. LHCb has its pools for analysis in their current CASTOR setup. B.Panzer replied that this is set up for
end user analysis to have 1 TB each for a few hundred users. I.Fisk noted that the wiki page does not
explain what the Experiments considered as highly-desirable. I.Bird replied that the Experiments
requirements are in the longer-term plan, but this solution is just to
provide a short-term immediate solution at CERN. No accounting or quota per
user will be provided but access to files is protected. F.Hernandez asked whether the IN2P3 group
can be observer to the group at CERN. I.Bird replied that the general strategy
forT1 and T2 needs to be discussed and a group should be started. The current
group is to find a solution at CERN in the immediate time and its mandate can
now be considered completed. K.Bos added that the scenarios should be
tested by some representatives of the Experiments. A.Pace agreed that the participation and
testing of the Experiment is necessary. New Action: Form a working group for User Analysis with a strategy
including T1 and T2 Sites. |
||
6. AOB
|
||
No meeting next week. The next MB meeting will be on the 30 September 2008. |
||
7. Summary of New Actions |
||
Form a working group for User Analysis with a strategy
including T1 and T2 Sites. Experiments’ QR
presentations at the MB. Agreed calendar: ALICE: 30/9, CMS: 30/9, LHCb: 7/10,
ATLAS 14/10. |