LCG Management Board
Tuesday 15 May 2007 16:00-17:00
(Version 1 16.5.2007)
A.Aimar (notes), D.Barberis, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, X.Espinal, I.Fisk, S.Foffano, D.Foster, J.Gordon, F.Hernandez, J.Knobloch, H.Marten, R.Pordes, Di Qing, L.Robertson (chair), J.Shiers, O.Smirnova, R.Tafirout, J.Templon
Mailing List Archive:
Tuesday 22 May 2007 16:00-17:00
1. Minutes and Matters arising (minutes)
1.1 Minutes of Previous Meeting
Minutes of the 24 April 2007 approved.
The minutes of the 8 May 2007 will be distributed after the MB meeting.
2. Action List Review (list of actions)
Actions that are late are highlighted in RED.
Not done. J.Gordon said that he will send to the MB a proposal in a couple of weeks.
3. Site Reliability Reports (Reliability Data; Site Reports; Slides) - A.Aimar
In the attached Slides there are the April reliability daily values (slide 2 and 3) that are summarized and compared to the values since January in slide 4 and in the table below (sites ordered as in the Reliability Data document):
Reliability >= 88% (>= Target)
Reliability >= 79% (>= 90% of Target)
Reliability < 79% (< 90% Target)
The target of 88% for the best 8 sites not reached but 10 sites were within 90% of the target:
- 7 Sites > 88% (target)
- 7+3 Sites > 79% (90% of target)
The table below summarizes the issues and solutions sent by the sites.
One can note that:
- dCache problems affecting several sites are still related to gridftp doors issues
- sam-bdii.cern.ch timeouts can be the cause of several issues and need to be investigated
- the rest are mostly sporadic alarms of operational issues
J.Gordon added that the sam-bdii problems cause the tests to be in an unknown state and this should not be considered as a failure for the site. A monitoring session will be take place at next GDB where these issues will be discussed.
In conclusion one can see that: the average values (excluding NDGF) are above the 88% target:
- Average 8 best sites: 92%
- Average all sites: 88%
Down-time is lower and therefore there are fewer issues such as dCache (GridFTP doors), CASTOR (pools resources), BDII timeouts, CE (unclear) and sam-bdii.cern.ch.
We also started collecting reliability data at the Operations meeting every week. For now the reports from the sites are not sufficiently clear (to say the least). They will be sent back to the authors. The MB members are invited to pass the message of filling more carefully the weekly reports to their representatives at the Operations meeting. If the weekly reports will not improve we will still do the monthly reliability reports for the month of May 2007 and in successive months until the weekly reports become adequate.
H.Marten noted that CMS is running their own site availability tests with different values although they are apparently using SAM for the tests.
I.Fisk explained that SAM is a framework and CMS has added five VO-specific tests and remove some general SAM tests not relevant for CMS.
I.Fisk will circulate to the MB the description of the CMS tests.
F.Hernandez noted that (as written in the IN2P3 and BNL reports) there are differences between the Reliability Data distributed and the same data as seen in GridView. This issue will be followed up verifying how the reliability data is generated by GridView.
Issue to follow:
An update on the situation on the SAM tests should be followed and a presentation at the F2F MB organized. Including the situation about VO-specific tests and the comparison with GridView data.
1. Job Reliability Reports (document) - L.Robertson
A first proposal was discussed at the MB in
The Job Reliability reports (document) now include:
- CMS: CRAB user analysis jobs
- LHCb: pilot jobs
The ARDA team is also working with ATLAS to start providing similar data.
The reports will be generated monthly from now on. The MB is asked to check the values and send feedback and comments to L.Robertson.
L.Dell’Agnello asked whether it is possible to investigate further the reasons of the failures shown on the graphs like one can now do with the site reliability reports.
L.Robertson replied that there are operations interfaces to enable the errors to be investigated, and he will ask M.Lamanna to contact him to explain how to use these features.
The MB agreed to start distributing the Job Reliability reports every month and review the situation after a few months.
GDB Summary (document) - J.Gordon
1.1 Organizational Issues
The August 1st meeting is cancelled.
Proposal to move the October meeting from 3rd, where it clashes with EGEE Conference, to the 10th.
The LHC Experiments agreed to move the October GDB to the 10th.
Countries are encouraged to keep their GDB membership details up to date. Countries with a Tier1 should nominate a Tier2 representative too if appropriate.
1.2 Middleware Issues
See the document for more details.
SL4 - WN has been through one round of testing in 32 bit mode on PPS, bugs fixed, and due for release. UI and WN in 32bit are top priority. Now in PPS and ready for release and experiments should test it urgently.
WMS - At the March MB Ian proposed some evaluation criteria for WMS and now he gave a status report. gLite WMS 3.1 has achieved 15,000 jobs/day for 7 days (criteria 10k/day for 5 days). 0.3% failures with no restart. And all jobs completed when restarted. The Logging and Bookkeeping service is capable of much higher rates and is not a bottleneck. This is encouraging.
CE - Current status of gLite CE. Close to 100% success of job submission – after resolving a number of timing issues with Condor. Submissions of 6,000 jobs to a CE (max ~3000 at any time). Several Condor issues were found – not yet clear on a timescale for resolving them. Not proven yet.
The fallback proposal could be:
- Keep the LCG-CE “as-is” - there is no effort to port to SL4 (which implies GT4 and potentially many issues)
- Deploy either on SL3 nodes (or SL4 with Xen/SL3). Contrary to previous reports
SL3 support will not stop in October 07 (SLC3 could also be continued for exceptional cases – like the CE). RHEL3 security patches will continue (until 2010) so it is feasible to continue with LCG-CE on SL3.
- Set up a CREAM instance in parallel and subject it to the same testing procedure because JRA1 effort is focusing on CREAM and not on improving the gLite CE.
J.Templon noted that a CREAM-only solution will not have a Globus interface and some sites need it because it is used by some of the VOs hosted at the sites.
1.3 Top 5 Issues
Technical summaries of groupings of issues were presented to GDB.
Castor - Tier0 Issues being addressed by a special task force. A new LSF Plug-in should address many of these issues. For Tier1s D1 storage classes are the highest priority. There is no firm plan for the remaining issues yet but it is being reviewed. Castor SRMv2.2 implementation is proceeding on track from a slow start. The Xrootd development is done by SLAC so negotiations are required.
Integration & testing of data and storage management components - The main outstanding issue that has not yet been addressed is multi-VO testing of Tier0-Tier1 transfers to demonstrate the nominal rates. This should be come feasible soon when ATLAS restart bulk transfers. CMS are repeatedly transferring.
SRMv2.2 - The main implementations are tested for the functionalities requested. SRM v2.2 is available for the experiments to test. It is very important to have the experiments on the pre-production test-bed testing the environment as soon as possible in order to understand if SRM v2.2 is ready for production
Job Management - WMS 3.1 is making good progress. It addresses many of the issues. An outstanding issues (for future GDB?) is the deployment of glexec, firstly on the CE and then on the Worker Nodes.
J.Templon added that nothing should prevent the installation of glexec at the sites (in two possible configurations, one limited to log-only mode) but this should be discussed in detail and explained to the sites at the GDB. Also the scalability should be tested.
and Data Storage Management - FTS version 2.0 is
certified – pilot service used by experiments for testing. It has interfaces
to both SRM v1.1 and SRM v2.2 and includes VOMS-aware proxy renewal (
added that, like
F.Hernandez added that sites need control of the dataflow in and out of the site in order to tune it and limit it in case one VO blocks all others or when the site needs to reduce the incoming flow.
If there is no flow control, as is provided in FTS, the only solution will be to close the VO’s dataflow channels to the site.
J.Templon added that if transfers are executed using VO-specific tool and configurations the problems will be difficult to investigate.
N.Brook noted that LHCb needs communication from 80 T2 sites to all T1 sites, a complete FTS solution would require defining 80 FTS channels at each Tier-1 site.
L.Robertson stated that non-standard solutions will be much more complicated to solve and be lower-priority for the sites compared to standard FTS transfers.
Top5 Summary - Experiments expressed content with the technical status reported but would await progress in coming months. There will be future reports to GDB but the Management Board should consider and present a management plan (see next section)
1.4 Other Topics
Security - Agreed the final version of Grid Site Operations Policy. Agreed good draft version of Grid Security Policy top-level document. To send the MB for approval.
Grid Policy on Handling Logged Personal Information which is relevant to user level accounting privacy issues. This has not yet discussed by JSPG It is foreseen that OSG and EGEE have top-level documents.
Does WLCG need a user level privacy document for the sites neither in EGEE nor OSG?
L.Robertson replied that once the common EGEE and OSG document is available D.Kelsey should come to the MB with a proposal for the LCG, after having it discussed it with EGEE and OSG.
File systems Working Group - M.Jouvin reported from the workshop held during HEPiX in April in DESY. A work plan has been agreed and evaluation started. Most Tier1s and many other sites are involved. The target is a final report for HEPiX in spring 2008.
2. Follow-up to Top 5 Issues from the Experiments (Slides) - L.Robertson
L.Robertson went through the main items and presented how the MB will follow them.
Task Force established to understand problems, prepare plan for addressing them
Report to MB once per month.
- Tier-0/CAF: first step LSF plug-in (delivered); next step is load balancing
- SRM 2.2: Being monitored as part of the SRM 2.2 activity. Basic and use case tests passed; stress tests starting.
- Support for Disk 1 (abort request when disk full)
- Improved Tier-1 deployment model
Longer term developments
- Review in September
- Access control/VOMS – current expectation 2Q08
- Quotas – current expectation 2009. Need to agree specification
Review at end of each quarter.
- Reliability issues
- Name experiment contacts needed
- Define problems; agree priorities with the experiments contacts
SRM 2.2: Being monitored as part of the SRM 2.2 activity. Basic and use case tests passed; stress tests under way.
As for Castor there are longer term issues: Access control/VOMS; quotas, etc.
Monthly review by MB already established. Available for experiment testing on PPS. Test and deployment plan being agreed with experiments, sites
- Access control/VOMS (request by CMS). Supported by DPM, in development for dCache. Need to agree that these are consistent and satisfy use cases; then deployment/development plans from dCache, Castor
- Quotas (request by ATLAS). First agree on requirements/feasibility
- File pinning (LHCb): This is not part of the agreed SRM 2.2 functionality
Organise GSSD/pre-GDB meeting(s) to agree use cases, functionality; then formal agreement in GDB
L.Dell’Agnello added that INFN will use StoRM and should support VOMS like DPM.
N.Brook stated that “file pinning” is part of SRM 2.2 but that CASTOR had agreed to implement it by allowing life-time extension on files.
T.Cass confirmed that file “life-time extension” is the approach chosen by CASTOR.
Sites should ensure that they are following deployment guidelines. I.Bird and M.Schulz should prepare proposals for a longer term solution. A new version using indices to provide significant speed-up expected soon. Separation of static and dynamic data should be considered
Reviewed monthly by MB. Agreed functionality in final stages of deployment
In final test prior to distribution in gLite release. In production at CERN. Functionality, performance as already agreed
Policy to be agreed at the GDB. Deployment of logging-only system may be acceptable to all sites
Full user switch is unlikely to be agreed.
Pilot project under way will be presented to the GDB for review, and agreement if this satisfies requirements in order to proceed with the deployment plan.
LFC Bulk operations
Agreed bulk operations have been deployed. Additional functionality would have to be agreed prior to establishing development plan
File Management tools (ATLAS)
Basic tools are already in SRM 2.2.
Disk 1 management tools: Consistency of SE and experiment catalog is an application responsibility.
A utility to extract list of all files in SE is missing for all three MSS implementations
- CASTOR prototype/evaluation implementation: Developed by SLAC. Progress slow – long delays in testing, few SLAC resources to react to problems. Being tested by ALICE.
DPM prototype: Available for
testing by ALICE – specific build as it includes xrootd code and
dependencies. Proposal for eliminating xrootd dependencies has been designed
and is being implemented by
dCache prototype: Independent
of SLAC code and in test by
Need to establish SLAC commitment to support before including it in the WLCG planning.
Investigate possibility of US-ALICE providing resources
A.Aimar and L.Robertson circulate follow-up milestones on the VOs Top 5 Issues to the MB.
L.Dell’Agnello asked for news about the HEPiX workgroup on benchmarking. INFN needs to launch some tenders and hardware providers do not publish SpecInt2k benchmarks anymore. What should sites do?
L.Robertson will distribute a proposal that had been prepared after the MB presentation on benchmarking in March.
A presentation on benchmarking will be scheduled for next F2F MB meeting.
4. Summary of New Actions
22 May 2007 - I.Fisk will circulate to the MB the description of the CMS SAM tests.
22 May 2007 - L.Robertson will distribute a proposal that had been prepared after the MB presentation on benchmarking in March.
29 May 2007 - A.Aimar and L.Robertson circulate follow-up milestones on the VOs Top 5 Issues to the MB.
The full Action List, current and past items, will be in this wiki page before next MB meeting.