LCG Management Board |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Date/Time |
Tuesday
27 October 2009 16:00-17:00 – Phone Meeting
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Agenda
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Members |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
(Version 2 – 9.11.2009) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Participants |
A.Aimar
(notes), D.Barberis, O.Barring, J-Ph.Baud, I.Bird (chair), K.Bos, M.Bouwhuis,
D.Britton, N.Brook, L.Dell’Agnello,, D.Duellmann, I.Fisk, J.Gordon,
F.Hernandez, M.Kasemann, M.Lamanna, H.Marten, P.McBride, G.Merino, B.Panzer,
H.Renshall, M.Schulz, R.Tafirout |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Invited |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Action
List |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Next Meeting |
Tuesday
10 November 2009 16:00-18:00 – F2F Meeting |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Minutes and Matters arising (Minutes)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1.1 Minutes of Previous Meeting
Feedback
received from F.Hernandez about the minutes of the previous meeting. Minutes
updated (changes are in blue). 1.2 Pledges for CNAFL.Dell’Agnello
reported that the INFN pledges for 2009 will not be installed but the
resources pledged for 2010 will be installed in early 2010 and all before end
of 2010Q2. F.Hernandez
added that the pledges for FR-CCIN2P3 are not those reported to the RRB,
IN2P3 reported their pledges after the deadline for the RRB. He will inform
the Experiments directly and ask S.Foffano to update the pledges table. . L.Dell’Agnello added that the Italian LHCb Tier-2 accounting seems
missing from the accounting report. He
will inform S.Foffano and C.Noble. J.Gordon replied that this was probably already reported and corrected.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. Action List Review (List of actions)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Not done by: FR-CCIN2P3 and NDGF, NDGF
provided a URL but SLS cannot access it because of some certificate issues. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3.
LCG
Operations Weekly Report (Slides)
– J-Ph.Baud
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings 3.1 SummaryThis summary covers the period from 12th to 25th October. There were a few incidents leading to SIR reports: - RAL disk subsystem failures taking down FTS, LFC and CASTOR from 4th to 9th and eventually led to the loss of 10 days data (200000 files) at RAL. - ASGC ATLAS conditions database still not synchronized. - ASGC CASTOR DB corrupted (21th October, not recovered yet) Meeting
Attendance Summary
GGUS
Tickets Summary
L.Dell’Agnello
noted that CMS often posts directly in Savannah the issues with a Site and
not using GGUS tickets. There were 3 alarm tickets in the week starting 12th October: - CERN CASTOR stager stuck reported by LHCb - CERN CASTOR Name Server problem reported by CMS - CERN CASTOR Name Server problem reported by LHCb Problems
with GGUS Tickets and OSG The OIM view provided to GGUS should list only the ‘resource group’ name (BNL_ATLAS) with valid contact-email and emergency-email addresses. BNL has currently under 5 different names instead of one name only. 3.2 SAM results (see slide 7)LHCb LHCb had problems with their SAM tests; DIRAC was using and old version of the SAM tests for a few days. RAL
Disk Failures ATLAS, CMS and LHCb SAM tests saw the RAL LFC, FTS and CASTOR downtimes (4 to 7 October for LFC and FTS and up to 9 October for CASTOR) due to failing disk sub-systems. ALICE only tests their VOBoxes and saw an SL4 to SL5 migration interrupt. RAL CASTOR runs on a SAN with disk systems containing primary and mirrored databases. Hardware faults on mirror since 10 September also hit primary on 4 October and CASTOR went down. Decision was to revert to older hardware then revalidate the failing systems. Suspicion early on was temperature problems. - 8 October CASTOR being restored without loss for ALICE and CMS and losing a few hours’ transactions for ATLAS and LHCb – estimated at 10000 files. List of lost files being prepared for experiment decision. - 9 October CASTOR restored – experiments to recover lost files or to clean catalogues. Vendor working with RAL to understand root cause of failures. - 14 October Discovered problem with database used following the restore. Resulted in loss of around last ten days data added to Castor. The database restore had been OK. The problem arose when Oracle opened the database and picked up the ‘wrong’ disk array. - 21 October List of lost files (200000 for Atlas) produced and LFC cleanup started. D.Barberis commented (by email on
the minutes) that ATLAS lost all data since 25 September, 200k files. As a
matter of fact, this was 100% of the September reprocessing data that were
produced the last week of September. In total, RAL was off for ATLAS for
about a month, as the clean-up finished only around 25th October. Actually only one dataset did not have another copy available at another site. SIRs: are available here: Hardware failures and loss of service. http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091004 Loss of data following restoration of services: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091009 ASGC
DB Problems Two major DB problems. 1) Atlas Condition DB Has not been available for more than 4 weeks now. CERN DM group recommends performing a complete re-instantiation using transportable table spaces. BNL will be the source. Synchronization should happen tomorrow 28th October (09:00 CET). 2) CASTOR DB: Has not been available for almost a week. All recovery attempts failed. Should the DB be reset? A phone conference will take place tomorrow 28th October (09:15 CET) Miscellaneous
Reports Several other smaller issues where found: - CASTOR Name Server problem at CERN due to new CASTOR release (2.1.9). - 180 files lost at NL-T1 (tape destroyed). - Problem at CNAF installing CMS SW release (not understood). - Instability of ATLAS Condition Database at BNL due to high loads from Tier2s; solved by increasing memory. - Problems with new SRM release at CERN (needed a rollback to the previous version). - Problems with new BDII release at CERN (needed a rollback to the previous version). - SRM problem at BNL last Friday due to a Java exception. 3.3 ConclusionsIn summary the major issues were: - Very long standing problems at ASGC (CASTOR and Condition Database). - Serious disk hardware failures at RAL: 200000 files lost. - A number of sites, including ASGC, have been unable to recover production databases from backups / recovery areas with major downtimes occurring as a result. A coordinated DB recovery validation exercise will take place the 26th November: RAL and ASGC are especially encouraged to participate. D.Duellmann
added that Sites should check their recovery capability and have all hardware
available. In addition the media for the archive and the recovery should be
separated. This is valid for all databases at the Sites, 3D and others. All
Tier-1 should test their recovery it at least once; M.Girone will coordinate
the exercise. D.Barberis
noted that the loss at RAL was a major issue. It was one month of work for
ATLAS at RAL lost and the Site was offline for 4 weeks. D Britton explained that it was a
complex situation with multiple, correlated, failures or new high-end hardware.
The back-up and restore had, in fact, worked fine but, unfortunately, the
content of the restored backup was incorrect due to the particular sequence
of failures. RAL acknowledges that the issue is very serious and is investigating
it with ORACLE. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4. Report on GDB
Issues (Paper)
– J.Gordon
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I.Bird also attached the action items from operations summaries (Slides). \ J.Gordon summarized the issues discussed at the GDB meeting in October in a document. GDB
Issues for MB John Gordon 27/10/2009 1.
GDB Meeting Dates
GDB planned for second Wednesday of each month in 2010.
Problems arise in March when there is a clash with the ISGC in Taipei (move
to 3rd?) and April when the scheduled date of 14th
clashes with the week of the EGEE User Forum. The 7th is Easter
Week and the 21st is HEPiX and anyway closer to May. One solution
would be to move the March meeting until 24th and just have one
meeting for March/April. It is planned to hold a couple of meetings outside CERN.
Volunteers are sought. J.Gordon
added that FNAL, IN2P3 and NIKHEF have volunteered to host a GDB meeting
outside CERN in 2010. 2.
CREAM
The CREAM CE has been deployed at a range of T1 and T2 and
tested extensively by Alice from their VO Box. Job submission via WMS is now
available. The functionality metrics have all been met but the performance
and reliability metrics need sustained testing with a wider variety of sites
and experiments. The MB should
recommend that most if not all sites deploy a CREAM CE in parallel with their
existing LCG-CE in order to allow testing of the widest range of use cases. The
SAM tests should from now on also check the CREAM CE installations at the
Sites. O.Barring
asked whether for a whole year the LCG and the CREAM CE will have to run in
parallel. J.Gordon
replied that this is planned in order to test the CREAM CE while some
Experiments still use the LCG CE M.Schulz
added that the Condor interface to CREAM CE is still missing; but it is not
blocking the deployment. Some ATLAS use cases still need to use the LCG CE. D.Barberis
added that to test ATLAS needs the CREAM CE with the Condor interface. M.Schulz
agreed but added that this should not block the installation so that all the
other CREAM features can be tested while the Condor interface is developed. Decision: The MB endorsed the request that all Sites proceed to the
installation of the CREAM CE. 3.
Pilot Jobs
The Experiments have an outstanding requirement to run
multi-user pilot jobs at sites running analysis work. After a strong
recommendation and a policy doc from our own security experts WLCG mandated
that a pilot job workload should reflect the identity of the original user.
The prerequisites for this (glExec, SCAS, analysis of frameworks) are now all
in place so the MB should ask sites to implement glExec with SCAS to allow
the pilot job frameworks of the experiments to change identity. There is a
residual holdup with the WN version of glExec not yet being certified for SL5
but this is days away so sites should draw up their deployment plans for
glExec and SCAS. Multi-user pilot jobs
which do not change identity are considered to be a risk to the infrastructure
through lack of traceability. Sites not prepared to do this should not run
multi-user pilot jobs. M.Schulz
added that the packages had to be re-certified and re-packaged .Now they can
be installed. M.Lamanna
noted that the last GDB concluded that Sites are encouraged to move to glExec
and SCAS; if they do not do Experiments can continue to run analysis with the
current single generic credentials. J.Gordon
replied that Experiments should be allowed to run pilot jobs but not
multi-user pilot jobs. Therefore Sites should install asap and, for instance,
they should not wait for any future release of Argus. I.Bird
replied that Experiments are already running multi-user pilot jobs; Sites
should proceed with the installations in order to avoid security and identity
issues. It is in the interest of the Sites to upgrade. Decision: The MB demands that all Sites install glExec and SCAS and
that multi-user pilot jobs do an identity changes for each job executed. L.Dell’Agnello
asked for the deadline. I.Bird
replied that is the Site that should be worried of the treats and upgrade
asap. Data-taking time was chosen as deadline. 4.
FTS 2.2
Version 2.2 of FTS is required to improve throughput by
splitting GridFTP and SRM but initial tests did not go well. This version is
a major change so there are doubts about whether it can be deployed before
data taking. A plan for further testing and migration during data taking will
be drawn up ATLAS
and LHCb want the checksum features in FTS. But the 2.2 flow splitting does
not allow keeping the rate of FTS 2.1 for the moment. Either the checksum is
ported back to 2.1 or one will have to wait for the FTS 2.2 being fixed. 5.
Pledges
Sites should all have had their 2009 pledges deployed by the
end of September. The experiments were not convinced this had all been done.
While the installed capacity work has made good progress it was not yet
reliable enough to demonstrate the level of pledges at all sites. Discussed
later in the AOB section. 6.
Experiment Operations
The October meeting gave a platform to the experiments to
highlight their operational issues for the sites. This will be repeated
quarterly. Other
issues L.Dell’Agnello asked whether next GDB will
tackle the issues of virtualization. J.Gordon replied that would be in GDB in
November or December. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7. AOB
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7.1 DPM SupportDPM support will be provided to the WLCG to the level of one FTE. If EMI will start there would be more but is not guaranteed at the moment. D.Britton
noted that there are more than 150 end-points using DPM and from the number
of open tickets the current DPM support is clearly overloaded. I.Bird
noted that he had already proposed that the DPM support becomes a community
effort, but nobody had replied and offered help. D.Britton
replied that the UK would be glad to contribute to a user forum where common
tools are shared and developed. M.Schulz
added that the DPM product team in an EMI context (next year) could allow new
participants to help. D.Duellmann added that in the Pre-GDB a DPM user forum was already proposed but more follow up is needed. He will follow it up. D.Britton
just added that the reason of his requests were to highlight that DPM is an
important tool that needs support. I.Bird
replied that this is a good issue to mention at the coming Collaboration
Board. 7.2 Staged deployment of resourcesF.Hernandez reminded that Experiment should have given the profile for the ramp up of resources. ATLAS did it, what about the other Experiments? M.Kasemann
commented that CMS will also send a profile to the Sites in the next 2 weeks.
D.Barberis
added that ATLAS will send an updated profile too. 7.3 Agreement on 2010 pledge installations - June2009: All Experiments need the 2009 requirements installed by now. 2010: By June 2010, instead of April, all 2010 resources should be in place. This was agreed with the Scrutiny group and they expect an answer for next year too. L.Dell’Agnello
repeated that CNAF will have by 2010Q1 a large fraction of the 2010 pledges
and all by 2010Q2. F.Hernandez
asked about the update of the pledges on the table published to the RRB. I.Bird
replied that he will follow up the issue with S.Foffano. 7.4 Response to CRSG timetableA note will be sent to comment about the request of the estimations by March; which seems too early. 7.5 Planning for Christmas/new year - experiment plansThere will not be running of the accelerator but the Experiments will continue running production and analysis. IT will be running on piquet for important services and best effort on many others. ATLAS
and CMS reported that both will collect not much cosmic data but calibration
runs. .Plus MC production and analysis activities. 7.6 Timely responses to Tier 1 accounting requestsC.Noble asks that the Sites reply timely to the accounting requests, not at the last moment. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8. Summary of New Actions |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
No new
actions for the MB. |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||