LCG Management Board
Tuesday 28 August 2007 16:00-17:00 - Phone Meeting
(Version 1 - 2.9.2007)
A.Aimar (notes), I.Bird, T.Cass, Ph.Charpentier, F.Donno, M.Ernst, X.Espinal, P.Fuhrmann, J.Gordon, C.Grandi, A.Heiss, J.Knobloch, E.Laure, D.Liko, P.Mato, P.McBride, D.Petravick, R.Pordes, L.Robertson (chair), R.Tafirout, J.Templon
Mailing List Archive:
Tuesday 11 September 2007 16:00-17:00 - Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous meeting were approved.
1.2 SRM Management Meeting
L.Robertson reported that the SRM management meeting that he had proposed (see minutes of last week and action list below) will not take place. After discussions with the people concerned it will be replaced by separate dCache and CASTOR specific meetings (see emails in the MB Mailing List Archive, NICE login required).
1.3 Site Reliability Reports for July 2007 (Slides) – A.Aimar
A.Aimar distributed the summary of the Sites Reliability Report for July 2007 (see Slides).
1.3.1 Site Reports
July was considerably better than June 2007. As shown below, 7 sites were above target (>91% in green) and another 2 within 90% of the target (>82% in orange).
Most issues are related to MSS Systems and to operational problems at the sites (see slide 6 for details).
1.3.2 VO-specific SAM tests
Starting this month also the VO-specific SAM tests results are collected.
Below is a comparison of the general (OPS) SAM tests and those of each VO at each Tier-0 or Tier-1 site. See table below.
One can observe that:
- ALICE tests: Probably the tests need to be completed or fixed
- ATLAS at IN2P3: SAM tests are all failing.
- CMS at INFN: The SAM results are considerably different compared to the OPS tests and to the results of CMS at other sites.
ALICE, ATLAS, and CMS should verify the VO-specific tests results above. An action is already in the MB Action List.
Globally the average reliability of the 8 best sites was above the target but for all sites was just below.
The Quarterly Reports will include Site Reliability information from 2007Q3, as proposed later in the meeting
J.Gordon asked whether if was clear why July results were considerably better than in June. Whether there had been any major improvement of any service to explain the change.
A.Aimar replied that there was no special reason for such improvement; but all sites had been above the target in the past months already just never all during the same month.
L.Robertson added now, hopefully, the improvements are because the sites verify and fix more promptly problems and issues.
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
Not done. L.Dell’Agnello will distribute a summary by end of July.
Not done. Planned for this week but postponed to next MB meeting due to lack of time.
Not done. No representatives from the 3 experiments present at the MB meeting.
Removed. Proposal distributed by L.Robertson but not accepted. See Section 1.2 above.
3. SRM Update (SRM Plan) – F.Donno
F.Donno presented an update of the SRM Roll-Out plan status (see SRM Plan).
For convenience here is the text (in blue) of the summary extracted from the plan, together with the comments and discussions that followed.
With the help of the dCache development team last week almost all sites are configured for the LHCb and ATLAS exercises and pass the basic tests. Several configurations are being exercised. Tests with experiment scenarios and certificates are on-going.
A set of critical issues have been identified with SRM v2 server implementations and with the high-level tools.
Because of problems getting the dCache sites properly configured and with the high-level tools such as gfal and lcg-utils, the test slot reserved for LHCb has slipped by about 2 months. We are trying to keep up with the milestones described in this plan. However, we might need to review all targets after the 31st of August 2007.
Here we list the critical issues that need immediate attention in order to succeed with the roll-out plan.
1. Replica/Custodial-Online space reservation not implemented as foreseen by WLCG: space occupied by deleted files is not returned as available.
2. Clearer configuration and management documentation.
3. Management utilities (list associations between files, tokens, disk pools and tapes, space management and reconfiguration, etc.)
4. Some srm calls not conforming to the SRM v2.2 specification (April 2007).
5. Some srm calls not conforming to the SRM v2.2 specification (April 2007).
6. Some not-understood and not easily reproducible security problem that results in a "Current user does not own file” error.
7. The server reports "Too many threads" because of an internal bookkeeping error.
8. The performance of Get requests slows with time. It is not fully understood why.
9. Ls of big directories crashes the server (the fix seems to be already available but not fully tested yet)
10. Space reservation associated to tokens and independent of paths.
11. Some srm calls not conforming to the SRM v2.2 specification (April 2007).
High-level tools: gfal and lcg-utils
12. Light version of lcg-utils not fully independent of BDII (bugs to be fixed)
13. More robust directory existence checking
14. Correct handling of TURLs returned by the various implementations (CASTOR and rfio)
F.Donno also added that now the schedule for the Experiments’ tests is the following:
- LHCb from 25 September to end of October
- ATLAS from 10 September to the 7 October
Therefore ATLAS sites and ATLAS needs should be addressed first.
L.Robertson asked whether these issues are really all critical and all stopping LHCb from running any test.
F.Donno replied not all are show-stoppers. But that, for instance, because of the different behaviour of some SRM calls (points 4,5,11 above) some LHCb tests cannot be executed on all sites.
Referring to point 2, P.Fuhrmann stated that the current documentation was sufficient to configure some of the sites (e.g. FZK) in a few hours. F.Donno agreed but added that some management documentation is needed e.g. for finding in which pool a file is stored. P.Fuhrmann replied that this documentation is not necessary for the execution of the LHCb tests. This dCache management documentation and tools (e.g. about locating a file) is not an SRM-related issue but is a dCache feature request. He considered that only issues preventing LHCb from running should be addressed in this context.
F.Donno asked that issues in points 4, 6 and 11 (i.e. SRM calls not conforming to the SRM specifications) are solved because they are needed for the proper execution of lcg-utils and GFAL. She also asked that the issues in point 1 (about replica and custodial space) are fixed because are needed by ATLAS.
L.Robertson said that the show-stoppers should be identified in sufficient detail. Each srm call that needs to be fixed should be identified separately. F.Donno clarified the situation saying that all issues are already listed in detail on the GSSD web, and the above text is just a summary. Every time there is a change she informs the SRM mailing list. P.Fuhrmann responded that some issues are present only at some sites and the information available is not always sufficient. In addition often those sites cannot be accessed by the dCache team.
L.Robertson stated that SRM 2.2 is considered to be essential by the experiments. It is therefore important to maintain priority on the current work to complete the testing and deployment of the four implementations. F.Donno should ensure that the critical issues are clearly identified and the status kept up to date on the GSSD web and the developers should use this to drive their priorities. The “show-stoppers” are the most critical and should be specified very clearly, and distinguished from less critical issues.
P.Fuhrmann said that he considers that WLCG should provide more support for dCache. I.Bird said that if by WLCG he meant CERN this is not feasible as CERN does not have the knowledge to provide dCache support; such support is needed from the dCache team. There is expertise on installing, configuring and operating dCache at several sites in Europe but there is no developer expertise outside the dCache team.
R.Pordes said that OSG has 1.5 FTE in VDT for support, and that includes support for dCache. She asked who is providing the support for the EGEE sites. I.Bird replied that there has been a major effort by CERN and EGEE in the last years in order to include dCache in the gLite distribution. GD group even wrote the original installation documentation because none was available. In EGEE 2 one of DESY’s responsibilities is the support of dCache.
L.Robertson added that dCache is widely used by many WLCG sites and very successfully. Some of these sites have now considerable expertise gained through several years experience of operating dCache. The issue that is under discussion here is the upgrade to version 1.8 and the problems encountered in the final stages of deployment.
I.Bird acknowledged the importance of the help provided by P.Fuhrmann in the previous week, but said that at this stage of the deployment there is the need for close support from the whole dCache development team as the problems emerge. D.Petravick replied that developers’ effort at FNAL has been stretched in the recent months because of vacations, to support for Run 2 experiments, installations at BNL and FNAL, etc. A developer was sent to BNL to help with the installation. But he also stressed that support for the 1.8 upgrade is currently a second priority item at FNAL.
I.Bird noted that the upgrade to 1.8 for all LCG sites was agreed as the top priority for the SRM providers, before any other fixes or new features. Progress was made at some sites because of the effort provided by P.Fuhrmann, but more help is needed from the developers in order to complete within the new schedule.
P.Fuhrmann and T.Perelmutov will be at CHEP. R.Pordes added that she, Gene Oleynik and Matt Crawford will be at CHEP and the dCache support issues could be discussed during the CHEP week. L.Robertson added that a phone conference can be set up in order to involve D.Petravick in the discussion if necessary.
L.Robertson asked that the 1.8 upgrade of all WLCG dCache sites is a top priority for all SRM teams and MSS developers.
D.Petravick stated that a weekly phone meeting with F.Donno will be useful in order to align the issues found by the GSSD tests and the dCache team activities.
P.Fuhrmann asked that the dCache team at FNAL takes over most of the support in solving the SRM 2.2 issues reported by GSSD.
During CHEP - Continued discussions about dCache support for the Version 1.8 / SRM 2.2 upgrade.
4.1 July accounting for CERN and Tier-1 sites – L.Robertson
L.Robertson had distributed the July accounting data to the MB (see emails in the MB Mailing List Archive, NICE login required).
No comments received from the LCG MB therefore those values are now official.
4.2 Automatic Accounting (Paper) – J.Gordon
J.Gordon reported about the correctness of the data reported in the APEL repository for the month of July. The paper attached shows the correctness (site by site) of the data inserted by the sites in the APEL repository and extracted by the automatic accounting procedure.
J.Gordon sent the following text (in blue below) for the minutes.
1. Tier1 Reporting
I have reviewed the differences between the accounting taken from the APEL Portal and the corrections reported by the sites in their manual reports for April and May. There is a definite improvement each month. In July only 7/12 sites were happy with their published APEL cpu numbers. The best month achieved was May 8/12. 10/12 sites have been green at one time. The remaining two are: AGSC, and NGDF. NGDF are starting tests now. AGSC report but the numbers are not correct. No fundamental problem in sites publishing correctly but reliability is not complete so sites still need to check and republish. Tests exist to highlight this but they are not yet in SAM. .
R.Pordes added that OSG is reporting the Tier-1 data and it is ready to report the Tier-2 accounting data and also to interface with the SAM tests verifying it.
She reminded that OSG persons to be contacted for any technical issues are Philippe Canal and Rob Quick.
APEL portal now has a Tier2 Tree which shows for each country the Tier2 Federations from Sue Foffano's report. The federation structures are manually loaded from a spreadsheet of Sue. Once this is stable an interface will be provided so that Sue can manage changes. Work still to be done includes a single automatic report of all Tier2s like Sue's manual one. This requires storage and use of the pledge information. This will also be loaded manually from the same spreadsheet.
3. Storage Accounting
The storage accounting portal has been improved so that it now shows Used and Allocated storage per VO per site with aggregates up the tree. For some information providers there is still a problem with double counting of free space (allocated=used+free)
The proposal is to start distributing the storage accounting information. J.Gordon will discuss it with the Tier-1 sites during CHEP.
5. Job Priorities Working Group (Slides) – D.Liko
Presentation postponed to next MB Meeting.
6. Simplification of future Quarterly Reports (Slides) – A.Aimar
A.Aimar proposed a simplification of the quarterly status and progress reporting activities.
6.1 Motivation for the Changes
The current QR reports were useful in order to follow the preparation of the WLCG infrastructure and with specific milestones for each of the site during the beginning of LCG Phase 2. At the time (from 2005 to beginning 2007) no metric was collected from the sites (reliability, accounting, etc).
Now the WLCG project is in a different phase: services are running, the initial infrastructure is being completed and the sites are equipped. Now the work it is more about commissioning and operating hardware and software and milestones are followed in the High Level Milestones dashboard.
Currently the High Level Milestones monitor the general progress and several metrics are collected every month (about Accounting, Sites Reliability, Job Efficiency, etc). This information can now be used to monitor the overall progress, measure performance, explain problems and report every quarter.
6.2 Proposed Changes
Tier-0 and Tier-1 Sites
We can now use different metrics that we collect to display the status and issues of the WLCG sites
- High Level Milestones
- Accounting Installations vs. Pledges
- Site Reliability
- Job Efficiency
- FTS transfers rates
The proposal is that the sites:
- will not produce a quarterly report anymore
- will only be asked to explain, in writing for the Overview Board, each issue marked as “red” in the milestones or metrics collected.
Therefore if a site is “green” under all metrics it will not have anything to write a report for that quarter.
The current Quarterly Report will continue to be provided by main projects/activities (i.e. Applications Area, ARDA, 3D DB, SRM, and maybe others). We do not see an alternative way to monitor and report about the progress of the projects. Suggestions are welcome.
LCG Services and GDB Activities
LCG Services and GDB will not provide anymore the Quarterly Report in the current format. But they should prepare a summary for the OB about achievements and issues during the quarters.
LHC Experiment will not provide anymore the Quarterly Report in the current format. Their progress will be summarized by a (short) presentation at the MB, by each Experiment, at the end of the quarter. The summary of those presentations will be in the MB minutes and, once approved by the Experiments, will be used for the quarterly report to the OB.
6.3 Reporting and Review Process
No changes. The material above will be collected and prepared at the end of the quarter. The reviewers will check the material and ask for additional information (as now) The final documents will be sent to the Overview Board, after adding an Executive Summary.
6.4 Actions To Do
In order to collect all information needed by the new reporting process a number of actions need to be completed:
If approved by the MB the changes above will be proposed to the Overview Board and they will be used starting with next quarterly reports (2007Q3).
15 Sept 2007 - The MB members should send feedback and comments to A.Aimar about the changes to the QR report.
No MB meeting next week. The next MB meeting will be on the 11 September 2007.
8. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.