LCG Management Board
Tuesday 17 July 2007 16:00-17:00 - Phone Meeting
(Version 1 19.7.2007)
A.Aimar (notes), O.Barring, I.Bird, N.Brook, F.Carminati, L.Dell’Agnello, T.Doyle, M.Ernst, F.Hernandez, J.Gordon, C.Grandi, J.Knobloch, M.Lamanna, E.Laure, U.Marconi, G.Merino, R.Pordes, Di Quing, L.Robertson (chair), H.Renshall, J.Shiers, R.Tafirout
Mailing List Archive:
Tuesday 24 July 2007 16:00-17:00 - Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No comments received. Minutes of the previous meeting approved.
1.2 Requirements from the Experiments - L.Robertson
L.Robertson asked the LHC Experiments about the progress of the update of their resources requirements until 2012.
ATLAS and CMS were not represented at the meeting.
LHCb - The Megatable is being updated and will be returned. The other resources requirements being prepared up to 2010.
L.Robertson noted that next 5 years should be covered, as that is what we ask the funding agencies to provide planning data for at the C-RRB. At the October 2007 meeting they will be asked to state the pledge for 2008 and their planning for 2009-2012. Without requirements from the experiments the funding agencies cannot provide this level of advance planning.
24 Jul 2007 - L.Robertson will ask via email to the Experiments to provide their update to the requirements until 2012.
1.3 SRM 2.2 Weekly Update (Email) - J.Shiers
Below is the information distributed by J.Shiers via email.
From: Jamie Shiers
Sent: Tuesday, July 17, 2007 10:27
To: storage-classes-wg (Mailing list of the Grid Storage System
Subject: Status of SRM v2.2 milestones for MB report on Tuesday July 17th 2007
Here is the (brief) status for the MB heads-up today (updated with
SRM-01: done (FZK)
SRM-02: done (IN2P3)
SRM-03: done (BNL)
SRM-05: done (CERN)
SRM-09: done (
SRM-08: in progress (LAL)
Replies and more details in the web archive:
SRM-04 (SARA) (on target)
SRM-06 (NDGF) (done - published in IS)
SRM-07 (CNAF) (expect to be complete in ~1 week)
Summary of site milestones: good progress - most sites have / are meeting schedule
SRM-10 testing experiment scenarios with experiment certificates
- action on LHCb to assign lhcb/lcgprod role to Flavia
- Harry is creating SRM2 accounts for each VO for these testing activities
- Lana will send requested information today to LHCb so that they can
prepare for their testing
SRM-13 definition of tests, incl. SRM V1
- See ATLAS and LHCb plans on GSSD page:
The milestones for 11 and 18.7.2007 are on schedule and should be completed.
Therefore the sites should be ready for August for LHCb’s testing, and for ATLAS’s testing in September.
LHCb will test in August. ATLAS will start after CHEP in September.
The plans of the experiments’ tests are being defined by ATLAS and LHCb; the details are available from the GSSD wiki page.
ATLAS has now added a second phase to test data access from jobs, not simply network transfer testing, as previously planned.
L.Robertson quoted a mail from M.Mazzucato (Email, NICE login required) where he was asking for StoRM to be included in the SRM Roll-out plan.
L.Dell’Agnello added that CNAF has ready a StoRM installation for testing transfers D1T0 transfers with CERN but the tests have to be from an experiment. The main contacts are with the Italian representatives of LHCb, but no formal contacts yet.
U.Marconi confirmed that LHCb’s plans and resources regarding StoRM are being defined.
L.Robertson asked CNAF and LHCb to specify additional milestones for StoRM to add to the SRM Roll-out plan (i.e. extending milestones SRM-7 and SRM-17 to include testing of StoRM).
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
Not done. L.Dell’Agnello will distribute a summary by end of July.
Not done. Waiting for a candidate reviewer to return from holiday.
C.Grandi reported that
the YAIM version removing the wrong configuration is now implemented. It
reintroduces the “deny” function and is being tested on the preview test bed.
But in the working group there are doubts about that solution and about
moving it towards its certification.
3. GDB Summary (Document) - J.Gordon
The SL4 issues were already presented last week, J.Gordon mention a few of the other points present in his Summary document.
Accounting - The Tier-1 sites should verify that the Tier-2s associated with them publish the Tier-2 accounting data into the APEL database.
SL4 Upgrade - All sites were asked, at the Operations meeting, to proceed with the SL4 update before the end of August. An EGEE Broadcast will be sent to make sure the information reaches all sites.
Security - There is the need of coordination on how security and identity information is implemented in the middleware components. A proposal will be circulated to the GDB’s and other lists. A discussion with developers could take place to foster security.
Glexec - Sites would like to see that the VO is responsible for what is executed under the general VO accounts. A draft document by the JSPG is available and will soon be public for feedback.
G.Merino asked whether there is any news about the packages required by the VOs and whether there are conflicting packages required.
J.Gordon replied that LAL and RAL have installed the packages asked by the Experiments on the CIC portal’s VO-cards.
No conflicts on versions and dependencies were identified. He also reminded that all sites should refer to the packages required by the VOs on the CIC Portal that is the official place for such requirements.
GDB in TRIUMF - The next GDB in TRIUMF has now a registration page (http://indico.cern.ch/conferenceDisplay.py?confId=17747).
Registration is required in order to organize transportation between CHEP and TRIUMF.
4. Service Interventions Update (Slides) - J.Shiers
J.Shiers provided a summary of the status of service interventions at the sites.
4.1 Log Analysis
There are clear correlations between the service intervention log gleaned from EGEE broadcasts (and other sources) and the SAM results. This is hardly surprising – if e.g. the DB used by FTS/LFC is down, the SAM tests will show the database failures.
Now sites need to understand why things go down – what are the key causes – and fix them.
This requires clear and systematic reporting – in particular to the weekly Operations meeting.
4.2 Strengthening the Grid Services
In 2005, there was an “a priori” analysis of main services and their criticality. This led to a deployment plan, including h/w requirements, operations issues etc. (WLCG service dashboard). Some m/w issues then were identified, but never resolved.
Database-based services should have load-balanced servers and Oracle clusters, if needed. With limited amount of H/A Linux solutions because there are some limitations (e.g. the H/A nodes must be on same network switch, therefore is again a risk in case of failure of that switch!).
Sites have now quite extensive experience running production services, and hence an “a posteriori” analysis is going to be very useful.
But this must be a coordinated, end-to-end analysis, and not just individual (component) services (“if what the user sees is not working, it’s not working.”) See also EGEE II recommendation 42 section 1.3 below
1.3) For each one of these services :
1.3.1) Assess if the service is grid-wide (Network, BDII, VOMS, LFC, ...) or local (CE, SE, ...).
1.3.2) Assess how critical the service is, if high availability and transparent failover is necessary, if a downgraded service is acceptable, how long the service could stay unavailable, ...
1.3.3) Describe existing failure and performance monitoring for the service.
1.3.4) Assess the intrinsic robustness of the service (for example transparent grid-wide failover), and the possibility to improve its local robustness (for example with a cluster, redundant hot-plug power supplies, fans, disks and a daily backup).
1.3.5) Describe failure patterns, in particular the list of the different error messages that can appear, their meaning, and failure propagation.
1.3.6) Assess if monitoring is necessary for the whole service and for its failure patterns as snapshots, if failure and performance monitoring archives should be available on-line as time series, if existing monitoring should be improved, and if new monitoring should be developed.
4.3 Key Grid Services
Storage management services have been a cause of concern for some time and have had a corresponding impact on e.g. effective inter-site transfer throughput. From outside is difficult to analyze specific site issues, but deployment complexity, single-points-of-failure and shortage of resources are believed to be key causes of problems at most sites.
Data management and other services typically include a database backend – these too require particular attention (appropriate resources, configuration, expertise, etc). A not insignificant number of problems are related to ‘database housekeeping’ issues – this is independent of the type of DB used (Oracle, PostgreSQL, MySQL, etc.
Useful general database measures are:
- Table defragmentation, pruning, clean-out of old records etc.
- Oracle Certified Professional training for all Tier1 DBAs
CERN plans to present the results of the review of CERN services – together with techniques for handling the main types of intervention – at the WLCG workshop (at the Operations BOF) before CHEP (FTS 2.0 wiki).
The proposal is that the key areas of data management and storage management be addressed systematically across at least Tier0 and Tier1 sites:
- The workshop before CHEP seems a good place to kick this off…
- Possibly a follow-up in a (small) workshop in November?
- Production deployment early 2008 (first round)
With the agreement on a roll-out schedule for SRM v2.2 services, the deployment phase of WLCG services for the initial running period of the LHC is close to completion.
The next main challenge is improving service quality to an acceptable and affordable level – essential if we are to optimally exploit the LHC.
Aside from infrastructure (power, cooling, network) failures, the primary causes of unscheduled interventions and service loss are:
- Storage management services;
- Data management and other database-related services.
Techniques for building robust services are mature and well understood and therefore should be implemented.
4.6 A Positive Final Remark
J.Shiers concluded on a positive remark (and thanking R.Trompert) highlighting that a recent event at SARA had been handled exactly according to the rules defined: set channel inactive, broadcast, fix, etc. Using the features of the FTS service that can retry the transfers if it is paused properly.
L.Robertson asked whether the workshop at CHEP will clearly identify and recommend good practices at the sites.
J.Shiers replied that yes good practices for all sites are needed and should be applied, for instance mirrored databases should be on different power supplies, etc.
J.Gordon supported the proposal and added that most sites have done this kind of analysis. He also expressed the worry that in order to support 2008 resources the number of people assigned will not be sufficient, and that many additional server systems may be required.
I.Bird replied that the number of CEs, and other servers will have to be increased at all sites as there will be half a million jobs executed each day next year.
L.Robertson asked whether other sites (RAL?) would be ready to present their services at CHEP in addition to CERN. J.Gordon replied that RAL will see whether it is possible to find some adequate speaker who will be present at CHEP.
F.Hernandez stressed the fact that, in addition to trained DBAs at the sites, the developer teams should also do training for the developers on safer programming and usage of the databases. Applications should less “optimistic” and, for instance, should foresee and implement proper recovery and safe retries from database failures or time-outs.
J.Shiers agreed and mentioned that in the past a course based on the book Effective Oracle by Design was very beneficial to the database developers that attended it at CERN. The book recommends that DBAs and developers work together on addressing the “top 10 issues” with their database applications.
I.Bird pointed out that there is no time to rewrite any application within the next two years. The good practices and training should be used to improve what is currently available. F.Carminati strongly supported this opinion.
L.Robertson concluded that analyzing and improving the database applications is useful and should be used in order to find improvements and work around to current problems. Future solutions should be studied once the current applications are working adequately.
5. Sites Reliability June 2007 (Sites Reports; Slides) - A.Aimar
A.Aimar presented a summary of the Site Reliability from January to June 2007. All details are available in the Sites Reports.
As the table below shows the number of sites above target (88% until May and 91% now) is not improving and therefore the main issues should be addressed and followed up more closely.
The table below summarises the problems encountered and the solutions adopted (when available) at the different sites.
Operational Issues and problems with the SAM testing infrastructure seem to be more relevant than in previous months.
F.Hernandez noted that SAM did not take into account the IN2P3 scheduled downtime announced in the GOCDB. Seems that the monitoring must be turned off at the moment of the shutdown, not when the shutdown is announced. This is very inconvenient.
O.Barring and I.Bird replied that monitoring can be paused when the intervention is booked, not at the moment of the interventions.
J.Gordon added that other sites said that their downtimes seem not taken into account accurately in SAM a clarification is needed.
24 Jul 2007 - A clarification with the SAM team is useful and a discussion at the MB will be prepared for next week.
raised the issue that several sites in the
J.Shiers replied that lower reliability would cause major issues for the experiments and require more resources for the sites (i.e. much higher buffer sizes, complex recovering operations, and unfair distribution of load on the most reliable sites).
L.Robertson added that VO-specific tests should be also considered and that the main issues should be solved by the sites.
I.Bird added that CMS has the SAM tests in production and the other experiments have a set of SAM tests being developed.
Note: The VO tests implement specific tests and also the general SAM tests that are relevant to the VO.
24 Jul 2007 - A.Aimar will distribute to the MB the summary of the VO specific SAM tests for June 2007.
Future site reliability reports (from July 2007) will include:
- SAM sites results
- VO-specific SAM results
- Job- reliability results
A.Aimar reported that the several operational issues were discussed with the Monitoring Working Group in order to define standard probes that should anticipate standard errors (e.g. file systems becoming full, certificate expiration, etc).
7. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.