LCG Management Board
Tuesday 20 January 2009 - 16:00-17:00
(Version 1 – 23.1.2009)
A.Aimar (notes), L.Betev, I.Bird (chair), D.Britton, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, I.Fisk, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, M.Lamanna, P.McBride, P.Mato, S.Newhouse, A.Pace, R.Pordes, Di Qing, M.Schulz, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout
Mailing List Archive
Tuesday 27 January 2009 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
J.Templon had commented the minutes of the previous week asking for a change of what had been discussed.
The MB decided that if NL-T1 has an official request of changes to the minutes for not matching what was said they should express in detail the changes requested. But if what is written summarizes what was said then it should not be changed.
No other comments. The minutes of the previous MB meeting were approved, pending the NL-T1 request.
1.2 Agenda of the Mini-Review (Agenda)
I.Bird updated the agenda for the mini LHCC Review. Only the names of the representatives of the Experiments are missing. Unless communicated otherwise the MB representatives are considered the speakers.
1.3 Tier-1 Reliability - December 2008 (Report)
A.Aimar distributed the OPS and VO-specific reliability report for December 2009.
Next week there will be the comments from the Experiments on the VO-specific SAM results. .
Action List Review (List of actions)
20.1.2009: a new patch was sent to SA3: There was a configuration problem. NIKHEF is working on the missing part and a new version will arrive soon because is too complicated to configure it manually.
CERN: T.Cass reported that the ALICE and ATLAS confirmed that they agree on the proposed SLA. Ph.Charpentier confirmed that also LHCb agrees on the SLA. M.Kasemann will check in CMS and update the MB next week.
Not done yet. I.Bird proposed that someone is appointed in order to complete this action.
Removed and will be followed by J.Gordon and discussed at the GDB.
Removed. Will be covered by the Architect Forum task on SL5 and gcc 4.3.
Discussed on the 9 Dec 2008. OSG asked for one more month, until end of January.
Removed. Overlaps with another action.
Will be presented on the 27 January.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
3.1 New Services Incidents and GGUS Tickets
“Service incidents” happen regularly – in most cases the service is returned “to normal” relatively quickly (hours to days) and presumably sufficiently rapidly for 2009 pp data taking (but painfully for the support team).
There were a number of significant incidents this week:
- Several “DB-related” problems – at least one critical!
- DCache problems at FZK Similar problems have been seen at other sites: PIC and now Lyon.
- Outstanding problem exchanging (ATLAS) data FZK to/from NDGF, but is not clear where it belongs.
- ATLAS stager DB at CERN blocked on Saturday: Savannah ticket (degradation of ~1 hour)
- FTS transfers blocked for 16h also on Saturday: see “SIR”
- Oracle problems with CASTOR/SRM DB: “big ID” issue seen now at CERN (PPS) and ASGC (probably)
After agreeing that the format of the GGUS summary was adequate it was changed without prior warning. Now is split into submitted by, assigned to and affecting. The format has also slightly changed.
The below table shows tickets submitted by the VOs
3.2 Alarms Tickets “outside” GGUS
There was no GGUS alarm ticket during this period, but ATLAS attempted to use the alarm flow for the FTS problem on Saturday:
- Sat 06:25 K.Bos’ delegated proxy gets corrupted, and transfers start failing
- Sat 11:16 GGUS team ticket opened by Stephane Jezequel
- Sat 21:28 Stephane sends an e-mail to atlas-operator-alarm, and the CERN CC operator calls the FIO Data Services piquet (Jan) at 21:55.
- Sat 22:30 G.McCance (called by Jan) fixes the problem, and informs ATLAS
M.Schulz will tell the EMT about the importance of fixing this long standing bug 33641 (in Savannah)
- [DM group] Increase priority of bug opened Feb 19, 2008.
- [G.McCance] Reinstate temporary (huh!) workaround
- [Jan] verify alarm flow: no SMS received by SMoD's. Understood: mail2sms gateway cannot handle signed e-mails. Miguel will follow this up [ new gateway later this month will at least send the subject line, which is ok ]
- No e-mail received by SMoDs. Understood: missing posting permissions on email@example.com, unclear how/when they disappeared.
3.3 DB Services
There were some DB Service issues:
- Online conditions data replication using Streams between the ATLAS offline database and the Tier1 site databases has been restored on Thursday 15.01 This action was pending after the problem affecting the ATLAS Streams setup before the CERN Annual Closure (it took just over 1 month total)
- There are still problems with CMS streaming from online to offline: The replication usually works fine but the capture process fails on a regular basis (once a day in average). It seems that the problem is related to some incompatibilities between Oracle Streams and Oracle Change Notification features. We are in contact with Oracle Support on that but in parallel we are looking for ways to abandon using Change Notification functionality.
- The Tier1 ASGC (Taiwan) has been removed from the ATLAS (conditions) Streams setup. This database is down since Sunday 04.01 due to a block corruption in one table space and cannot be restored because they are missing a good backup. A new h/w setup is being established which will use ASM (as is standard across “3D” installations) and will hopefully be in production relatively soon.
The missing DBA expertise at ASGC is still an area of concern and was re-discussed with S.Lee last Thursday
3.4 DM Services
From GridKA: We had massive problems with SRM last weekend, which we solved together with the dCache developers by tuning SRM database parameters, and the system worked fine again from Sunday till Wednesday early morning. Then we run into the same problem again. There are several causes for these problems; the most important are massive SRMls requests, which hammer on the SRM. If the performance of SRM is ok, then the load is again too high for the PNFS database. So tuning SRM only brings problems on PNFS and vice versa. This resulted in: solving one problem we run immediately into the next problem.
Furthermore we had some trouble because one experiment tried to write into full pools. This was a kind of "deny of service" attack, since many worker nodes kept repeating this request and the SRM SpaceManager had to refuse it. After detecting this problem we worked together with the experiment representative to free up some space and relaxed this situation.
However the current situation is that we reduced the amount of SRM threads at a time, to allow the once accessing the system to have a chance to finish. It seems that this has relaxed the situation; we still have transfers failing because we run into the limit, and reject these transfers. But it seems this is the only way to protect us from failing completely. In my opinion we should learn to interpret the SAM tests, in our case a failing SAM test currently does not mean we are down, but fully loaded. Next week we will try to optimise some of our parameters to see if we still can cope with more requests, but currently it seems better to let the system relax and to finish the piled jobs.
The request is from GridKA but valid for all Sites is that:
We (experiments, developer and administrator) should all work together to find a solution to not stress the storage system unnecessarily, tuning by itself does not solve this problem in the long run.
A.Heiss reported that for the moment FZK will reduce the number of CMS jobs that can run concurrently, while CMS modifies some it its applications. FZK allows only 40 SRM concurrent requests and 400 queued requests. If there are more they are rejected. And the system is running properly. But the availability of the system will go down even if the system is fully used.
F.Hernandez added that IN2P3 had the same issue in mid-December and that if there are commands (e.g. srm-ls) that should not be used then they should not be available to the users. Because often are just causes of problems.
I.Fisk reported that also some other commands that are never used in the production activities were not tested enough and often do not work properly. For example “srm-mv” disconnects the files but does not move them. Leaving unreachable files in SRM!
M.Kasemann added that the Experiments need to do tests at higher rates before real data taking happens.
I.Bird added that these tests could be arranged at the Workshop in Prague before CHEP and also that some Tier-1 Sites and MSS should be reviewed.
“Chronic incidents” – ones that are not solved (for whatever reason) for > or >> 1 week still continue.
The proposal is:
After 1 week (2? More? MB to decide) of serious degradation or outage the site responsible for a service ranked as “very critical” or “critical”, by a VO using that site, should provide a written report to the MB on their action plan and timeline for restoring this service asap.
E.g. the proposal to perform a clean CASTOR and Oracle installation at ASGC was made in October 2008 and has still not been performed.
I.Bird proposed, and the MB accepted, that if a service is down for more than one week the Site should send to the MB list a Service Restoration Plan. And all should still be tracked via a GGUS ticket.
Ph.Charpentier asked that Sites do not close tickets until the issue is really solved. Not when the Restoration Plan is presented.
4. ALICE Quarterly Report 2008Q4 (Slides) – L.Betev
4.1 Data Taking
ALICE had several activities in preparation for data taking. There was a planned intervention, involving cabling modification, installation of additional detectors; therefore data taking was stopped in October 2008.
The total data volume acquired during Q4 was of 100TB and all Tier-0 tasks were continuously run (except RAW replicas to Tier-1 Sites – only some spot checks).
The on-line condition parameter gathering is working properly for DAQ, HLT, DCS, etc. The on-line reconstruction of a sampled set of data run synchronously with data taking and the on line Monitoring and QA is partly ready: The general framework is operational the detector implementation is in progress.
The data taking of cosmics with complete detector will resume in June 2009, according to the current preliminary schedule.
4.2 Data Processing and Transfers
All collected cosmic data was “Pass 1” reconstructed; with additional reconstruction passes on selected samples of cosmic data done with updated reconstruction algorithms and condition parameters both at the Tier-0 and Tier-1 Sites.
The reconstructed ESDs are available on the Grid and at two Analysis Facilities (CERN and GSI) and the data processing of cosmic data with complete detector will resume in June 2009.
ALICE performed periodic tests of FTS/FTD from Tier-0 to Tier-1 Sites and the migration to the new FTS is now completed.
4.3 Monte Carlo Data
ALICE MC production is only run when needed, not all the time. There is a large production for EMCAL PPR in progress and several pp with various signals. It was run unattended over Christmas but was sufficiently reliable.
The end-user analysis involved studies of various type of SE performance in order to implement some tuning of SE parameters at each Site. There was the highest number of users ever, with about ~80 users at any time doing analysis on the Analysis Farm at end of 2008.
Production of PbPb events, two impact parameters classes, produces a very high amount of data.
There will be continuous MC production over the whole year with large pp Min Bias productions to be started as soon the LHC plans are known and several smaller “first physics” productions depending on 2009/2010 LHC plans.
4.4 ALIROOT Software
Significant code refactoring is progressing with simplified data access strategies, the introduction of Cmake and the corrected usage of polymorphic containers. There was also work to achieve overlap-free geometry. In addition the code was ported and validated on many platforms, Linux flavours and compiler versions.
PROOF-based parallel reconstruction is progressing with additional further development of the ALICE analysis framework.
Calibration and alignment with cosmics data was performed.
The new version of AliEn deployed was in December 2008. The ALICE job submission is now using only WMS, not the RB anymore.
The tuning of submission parameters is ongoing, following the end of 2008 exercise.
Additional WMS instances are needed around the world – WMS is currently provided at CERN, NIKHEF, RDIG, GridKA but more are needed.
The CREAM CE deployment is ongoing, but is slow. Is currently available at GridKA, Kolkata, Subatech and IHEP (RDIG).
The next ALICE milestones are:
- MS-129 Mar 09: ALICE analysis train operational
- MS-130 Jun 09: CREAM CE deployed at all ALICE sites
- MS-131 Jun 09: AliRoot release ready for data taking
M.Schulz asked for the list of Sites that need to have the Cream CE in operation for ALICE.
L.Betev replied that the list will be sent to M.Schulz: there are about 80 Sites.
Ph.Charpentier asked whether the CREAM CE could be installed at CERN for testing purposes.
T.Cass replied that there was no plan to do it urgently but he will check and report to the MB. It could probably be installed on the PPS.
L.Betev replied that the PPS provides a testing environment too limited for a CE. An initial availability in production would help real testing.
M.Schulz noted that they were waiting (CERN and many Sites) for proxy renewal to be implemented in the CREAM CE before it can be deployed everywhere. And it should be run in parallel to the LCG CE not replacing it; in this way if there are issues the VOs can still use the LCG CE as fall-back solution.
I.Bird noted that the CREAM CE was on the PPS for some amount of time already (unused) but it could also be installed on a part of the production system for wider testing. The Sites should be informed that they should install it even if an upgrade will be needed when proxy renewal is finally implemented.
J.Gordon reminded that the Sites were under pressure for the production activities of the VOs and they had not time to install the CREAM CE.
5. LHCb Quarterly Report 2008 Q4 (Slides) – Ph.Charpentier
5.1 Activities during the Quarter
LHCb is completing commissioning of DIRAC3 with more production tools and complex workflows. Analysis integration in DIRAC3 is continuing with the Ganga backend can submit to DIRAC3 and started the migration of first users to DIRAC3 (retirement of DIRAC2 on January 12th). In order to remove completely dependency on LCG-RB.
LHCb completed the full SRM v2 data migration in October 2008. All LFC entries have been migrated to SRM v2.2 endpoints. Ph.Charpentier thanked FIO group for the support.
Other activities included extensive pre-staging tests at all Tier-1 Sites and continuous MC simulation production at all sites. The graph below shows the number of concurring jobs during the quarter with 39 sites, of the 109 total Sites, shown on the graph. On average there were 5000 jobs concurrently.
The graph below splits the jobs by type of jobs and the throughput by site source.
LHCb exceeded 1 GB per second, but Sites reach a sufficient rate. The fluctuations are due to insufficient bandwidth.
5.2 Issues Encountered
There are crucial problems with the Software Repository at the Sites. The applications must run on all worked nodes and all start from the Software Repository. On some sites there are some permission problems. LHCb needs gssklog, but is still not working at CERN and is installed manually at CERN.
LHCb had scalability issues maybe due to hardware limitations. For instance CNAF had serious problems and was reported in December: GGUS 44729: open 11.12.08 (closed and re-opened twice).
L.Dell’Agnello replied that CNAF needs more information about submitted problems. The ticket did not provide information and requests of clarifications where not answered. Problems must be explained better or cannot be solved promptly. New hardware will be installed in one month for LHCb.
Ph.Charpentier agreed but noted but added that the issue was mentioned in several meetings and repeated details were provided there.
The central Software Repository is a major investment. For example at CERN: there are four load-balanced read-only AFS servers. This is an essential requirement and originally was neglected by Sites and Experiments in their applications. For instances the SQLite usage on NFS s impossible, LHCb caused problem at some Sites because of their usage of SQLite. : The new version copies the SQLite files locally on the WN.
Ph.Charpentier thanked SARA for solving the performance issues with transfers out of SARA.
5.3 Plans for 2009
Below are the plans for the first half of 2009:
March: MC 2009 production.
Starting now: Full Experiment
System Test (FEST09).
5.4 Resources for 2009
The timeline document will be ready mid March 2009 to be presented at the WLCG workshop in Prague. The final version must be ready for mid April in order to present it to the C-RRB.
The 2009 LHCb analysis model is being redefined using the 2008 experience. The goal is to adapt Computing Model to the new trigger strategy with more inclusive triggers and increased importance of stripping. Probably LHCb will introduce “physics group” organized production activities (like the ALICE analysis train) in order to avoid different users accessing many times the same data sets.
LHCb will also review the number of DST replicas and revisit their figures for event sizes, CPU needs for simulation, reconstruction. And will also to move to the new WLCG units (factor 4).
F.Hernandez asked how the plots above were produced and how they can be available to the Sites.
Ph.Charpentier will send the URL of the DIRAC plots but the same information should soon be also available in the dashboard for the Sites.
F.Hernandez added that IN2P3 is preparing replication of AFS volumes for the central software areas of all Experiments. But when a new release is installed the replication must be re- triggered. Currently it is a manual operation but will be automated or executed on a fixed schedule.
Ph.Charpentier noted that at CERN the Experiments have the rights to trigger the replication by themselves.
F.Hernandez added that would be good if the same process was done at different Sites with similar procedures (although with different implementations).
7. Summary of New Actions
No new actions.