LCG Management Board
Tuesday 27 October 2009 16:00-17:00 – Phone Meeting
(Version 2 – 9.11.2009)
A.Aimar (notes), D.Barberis, O.Barring, J-Ph.Baud, I.Bird (chair), K.Bos, M.Bouwhuis, D.Britton, N.Brook, L.Dell’Agnello,, D.Duellmann, I.Fisk, J.Gordon, F.Hernandez, M.Kasemann, M.Lamanna, H.Marten, P.McBride, G.Merino, B.Panzer, H.Renshall, M.Schulz, R.Tafirout
Mailing List Archive
Tuesday 10 November 2009 16:00-18:00 – F2F Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
Feedback received from F.Hernandez about the minutes of the previous meeting.
Minutes updated (changes are in blue).
1.2 Pledges for CNAF
L.Dell’Agnello reported that the INFN pledges for 2009 will not be installed but the resources pledged for 2010 will be installed in early 2010 and all before end of 2010Q2.
F.Hernandez added that the pledges for FR-CCIN2P3 are not those reported to the RRB, IN2P3 reported their pledges after the deadline for the RRB. He will inform the Experiments directly and ask S.Foffano to update the pledges table. .
L.Dell’Agnello added that the Italian LHCb Tier-2 accounting seems missing from the accounting report. He will inform S.Foffano and C.Noble.
J.Gordon replied that this was probably already reported and corrected.
2. Action List Review (List of actions)
Not done by: FR-CCIN2P3 and NDGF,
provided a URL but SLS cannot access it because of some certificate issues.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
This summary covers the period from 12th to 25th October. There were a few incidents leading to SIR reports:
- RAL disk subsystem failures taking down FTS, LFC and CASTOR from 4th to 9th and eventually led to the loss of 10 days data (200000 files) at RAL.
- ASGC ATLAS conditions database still not synchronized.
- ASGC CASTOR DB corrupted (21th October, not recovered yet)
Meeting Attendance Summary
GGUS Tickets Summary
L.Dell’Agnello noted that CMS often posts directly in Savannah the issues with a Site and not using GGUS tickets.
There were 3 alarm tickets in the week starting 12th October:
- CERN CASTOR stager stuck reported by LHCb
- CERN CASTOR Name Server problem reported by CMS
- CERN CASTOR Name Server problem reported by LHCb
Problems with GGUS Tickets and OSG
The OIM view provided to GGUS should list only the ‘resource group’ name (BNL_ATLAS) with valid contact-email and emergency-email addresses. BNL has currently under 5 different names instead of one name only.
3.2 SAM results (see slide 7)
LHCb had problems with their SAM tests; DIRAC was using and old version of the SAM tests for a few days.
RAL Disk Failures
ATLAS, CMS and LHCb SAM tests saw the RAL LFC, FTS and CASTOR downtimes (4 to 7 October for LFC and FTS and up to 9 October for CASTOR) due to failing disk sub-systems. ALICE only tests their VOBoxes and saw an SL4 to SL5 migration interrupt.
RAL CASTOR runs on a SAN with disk systems containing primary and mirrored databases. Hardware faults on mirror since 10 September also hit primary on 4 October and CASTOR went down.
Decision was to revert to older hardware then revalidate the failing systems. Suspicion early on was temperature problems.
- 8 October CASTOR being restored without loss for ALICE and CMS and losing a few hours’ transactions for ATLAS and LHCb – estimated at 10000 files. List of lost files being prepared for experiment decision.
- 9 October CASTOR restored – experiments to recover lost files or to clean catalogues. Vendor working with RAL to understand root cause of failures.
- 14 October Discovered problem with database used following the restore. Resulted in loss of around last ten days data added to Castor. The database restore had been OK. The problem arose when Oracle opened the database and picked up the ‘wrong’ disk array.
- 21 October List of lost files (200000 for Atlas) produced and LFC cleanup started.
D.Barberis commented (by email on the minutes) that ATLAS lost all data since 25 September, 200k files. As a matter of fact, this was 100% of the September reprocessing data that were produced the last week of September. In total, RAL was off for ATLAS for about a month, as the clean-up finished only around 25th October.
Actually only one dataset did not have another copy available at another site.
SIRs: are available here:
Hardware failures and loss of service. http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091004
Loss of data following restoration of services: http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091009
ASGC DB Problems
Two major DB problems.
1) Atlas Condition DB
Has not been available for more than 4 weeks now. CERN DM group recommends performing a complete re-instantiation using transportable table spaces. BNL will be the source.
Synchronization should happen tomorrow 28th October (09:00 CET).
2) CASTOR DB:
Has not been available for almost a week. All recovery attempts failed. Should the DB be reset?
A phone conference will take place tomorrow 28th October (09:15 CET)
Several other smaller issues where found:
- CASTOR Name Server problem at CERN due to new CASTOR release (2.1.9).
- 180 files lost at NL-T1 (tape destroyed).
- Problem at CNAF installing CMS SW release (not understood).
- Instability of ATLAS Condition Database at BNL due to high loads from Tier2s; solved by increasing memory.
- Problems with new SRM release at CERN (needed a rollback to the previous version).
- Problems with new BDII release at CERN (needed a rollback to the previous version).
- SRM problem at BNL last Friday due to a Java exception.
In summary the major issues were:
- Very long standing problems at ASGC (CASTOR and Condition Database).
- Serious disk hardware failures at RAL: 200000 files lost.
- A number of sites, including ASGC, have been unable to recover production databases from backups / recovery areas with major downtimes occurring as a result. A coordinated DB recovery validation exercise will take place the 26th November: RAL and ASGC are especially encouraged to participate.
D.Duellmann added that Sites should check their recovery capability and have all hardware available. In addition the media for the archive and the recovery should be separated. This is valid for all databases at the Sites, 3D and others. All Tier-1 should test their recovery it at least once; M.Girone will coordinate the exercise.
D.Barberis noted that the loss at RAL was a major issue. It was one month of work for ATLAS at RAL lost and the Site was offline for 4 weeks.
D Britton explained that it was a complex situation with multiple, correlated, failures or new high-end hardware. The back-up and restore had, in fact, worked fine but, unfortunately, the content of the restored backup was incorrect due to the particular sequence of failures. RAL acknowledges that the issue is very serious and is investigating it with ORACLE.
4. Report on GDB Issues (Paper) – J.Gordon
I.Bird also attached the action items from operations summaries (Slides). \
J.Gordon summarized the issues discussed at the GDB meeting in October in a document.
GDB Issues for MB
John Gordon 27/10/2009
1. GDB Meeting Dates
GDB planned for second Wednesday of each month in 2010. Problems arise in March when there is a clash with the ISGC in Taipei (move to 3rd?) and April when the scheduled date of 14th clashes with the week of the EGEE User Forum. The 7th is Easter Week and the 21st is HEPiX and anyway closer to May. One solution would be to move the March meeting until 24th and just have one meeting for March/April.
It is planned to hold a couple of meetings outside CERN. Volunteers are sought.
J.Gordon added that FNAL, IN2P3 and NIKHEF have volunteered to host a GDB meeting outside CERN in 2010.
The CREAM CE has been deployed at a range of T1 and T2 and tested extensively by Alice from their VO Box. Job submission via WMS is now available. The functionality metrics have all been met but the performance and reliability metrics need sustained testing with a wider variety of sites and experiments. The MB should recommend that most if not all sites deploy a CREAM CE in parallel with their existing LCG-CE in order to allow testing of the widest range of use cases.
The SAM tests should from now on also check the CREAM CE installations at the Sites.
O.Barring asked whether for a whole year the LCG and the CREAM CE will have to run in parallel.
J.Gordon replied that this is planned in order to test the CREAM CE while some Experiments still use the LCG CE
M.Schulz added that the Condor interface to CREAM CE is still missing; but it is not blocking the deployment. Some ATLAS use cases still need to use the LCG CE.
D.Barberis added that to test ATLAS needs the CREAM CE with the Condor interface.
M.Schulz agreed but added that this should not block the installation so that all the other CREAM features can be tested while the Condor interface is developed.
The MB endorsed the request that all Sites proceed to the installation of the CREAM CE.
3. Pilot Jobs
The Experiments have an outstanding requirement to run multi-user pilot jobs at sites running analysis work. After a strong recommendation and a policy doc from our own security experts WLCG mandated that a pilot job workload should reflect the identity of the original user. The prerequisites for this (glExec, SCAS, analysis of frameworks) are now all in place so the MB should ask sites to implement glExec with SCAS to allow the pilot job frameworks of the experiments to change identity. There is a residual holdup with the WN version of glExec not yet being certified for SL5 but this is days away so sites should draw up their deployment plans for glExec and SCAS. Multi-user pilot jobs which do not change identity are considered to be a risk to the infrastructure through lack of traceability. Sites not prepared to do this should not run multi-user pilot jobs.
M.Schulz added that the packages had to be re-certified and re-packaged .Now they can be installed.
M.Lamanna noted that the last GDB concluded that Sites are encouraged to move to glExec and SCAS; if they do not do Experiments can continue to run analysis with the current single generic credentials.
J.Gordon replied that Experiments should be allowed to run pilot jobs but not multi-user pilot jobs. Therefore Sites should install asap and, for instance, they should not wait for any future release of Argus.
I.Bird replied that Experiments are already running multi-user pilot jobs; Sites should proceed with the installations in order to avoid security and identity issues. It is in the interest of the Sites to upgrade.
The MB demands that all Sites install glExec and SCAS and that multi-user pilot jobs do an identity changes for each job executed.
L.Dell’Agnello asked for the deadline.
I.Bird replied that is the Site that should be worried of the treats and upgrade asap. Data-taking time was chosen as deadline.
4. FTS 2.2
Version 2.2 of FTS is required to improve throughput by splitting GridFTP and SRM but initial tests did not go well. This version is a major change so there are doubts about whether it can be deployed before data taking. A plan for further testing and migration during data taking will be drawn up
ATLAS and LHCb want the checksum features in FTS. But the 2.2 flow splitting does not allow keeping the rate of FTS 2.1 for the moment. Either the checksum is ported back to 2.1 or one will have to wait for the FTS 2.2 being fixed.
Sites should all have had their 2009 pledges deployed by the end of September. The experiments were not convinced this had all been done. While the installed capacity work has made good progress it was not yet reliable enough to demonstrate the level of pledges at all sites.
Discussed later in the AOB section.
6. Experiment Operations
The October meeting gave a platform to the experiments to highlight their operational issues for the sites. This will be repeated quarterly.
L.Dell’Agnello asked whether next GDB will tackle the issues of virtualization.
J.Gordon replied that would be in GDB in November or December.
7.1 DPM Support
DPM support will be provided to the WLCG to the level of one FTE. If EMI will start there would be more but is not guaranteed at the moment.
D.Britton noted that there are more than 150 end-points using DPM and from the number of open tickets the current DPM support is clearly overloaded.
I.Bird noted that he had already proposed that the DPM support becomes a community effort, but nobody had replied and offered help.
D.Britton replied that the UK would be glad to contribute to a user forum where common tools are shared and developed.
M.Schulz added that the DPM product team in an EMI context (next year) could allow new participants to help.
D.Duellmann added that in the Pre-GDB a DPM user forum was already proposed but more follow up is needed. He will follow it up.
D.Britton just added that the reason of his requests were to highlight that DPM is an important tool that needs support.
I.Bird replied that this is a good issue to mention at the coming Collaboration Board.
7.2 Staged deployment of resources
F.Hernandez reminded that Experiment should have given the profile for the ramp up of resources. ATLAS did it, what about the other Experiments?
M.Kasemann commented that CMS will also send a profile to the Sites in the next 2 weeks.
D.Barberis added that ATLAS will send an updated profile too.
7.3 Agreement on 2010 pledge installations - June
2009: All Experiments need the 2009 requirements installed by now.
2010: By June 2010, instead of April, all 2010 resources should be in place. This was agreed with the Scrutiny group and they expect an answer for next year too.
L.Dell’Agnello repeated that CNAF will have by 2010Q1 a large fraction of the 2010 pledges and all by 2010Q2.
F.Hernandez asked about the update of the pledges on the table published to the RRB.
I.Bird replied that he will follow up the issue with S.Foffano.
7.4 Response to CRSG timetable
A note will be sent to comment about the request of the estimations by March; which seems too early.
7.5 Planning for Christmas/new year - experiment plans
There will not be running of the accelerator but the Experiments will continue running production and analysis.
IT will be running on piquet for important services and best effort on many others.
ATLAS and CMS reported that both will collect not much cosmic data but calibration runs. .Plus MC production and analysis activities.
7.6 Timely responses to Tier 1 accounting requests
C.Noble asks that the Sites reply timely to the accounting requests, not at the last moment.
8. Summary of New Actions
No new actions for the MB.