LCG Management Board
Tuesday 07 November 2006 at 16:00
(Version 1 - 10.11.2006)
A.Aimar (notes), L.Bauerdick, S.Belforte, L.Betev, K.Bos, N.Brook, T.Cass, Ph.Charpentier, L.Dell’Agnello, Di Quing, C.Eck, B.Gibbard, J.Gordon, D.Foster, F.Hernandez, M.Lamanna, M.Litmaath, J.Knobloch, H.Marten, M.Mazzucato, G.Merino, P.Nyczyk, B.Panzer, R.Pordes, G.Poulard, L.Robertson (chair), M.Schulz, J.Shiers, J.Templon
Tuesday 14 November from 16:00 to 17:00, CERN time
1. Minutes and Matters arising (minutes)
1.1 Minutes of Previous Meeting
No comments. Minutes approved.
1.2 LHCC Referees Meeting
Will take place on Monday 13 November, 14:00-16:00 (see agenda).
Two referees will be absent (F.Forti and P. Dauncey) the topics proposed are:
- Report from CMS on the status CSA06 (20’ + 10’ for questions)
Report on the status and
progress of the current
In addition to the already agreed:
- Summaries of the revised computing requirements for 2007-2010, by each experiment.
2. Action List Review (list of actions)
Actions that are late are highlighted in RED.
Done. B.Panzer distributed a document (document).
Comments and feedback are expected.
Done. The sites have included their in their plans the milestones for the installation of the 3D databases.
Ongoing. H.Renshall will present the situation at next MB.
J.Shiers said that three out of four have sent their requests. CMS is discussing the requirements in their MB and will provide the information to H.Renshall. In general, CMS’ intention is to use all resources that are available after SC4.
To be done.
3. September 2006 Accounting
L.Robertson asked for comments on the September Accounting Reports that were distributed.
H.Marten suggested that the MoU commitment and the installed capacity line on the graphs should be dotted because when they overlap one of the two is not visible.
J.Templon asked that there is additional information with a comparison between the resources requested, at each site, by each VO vs. those actually used by the VO. The information available in terms of pledges and the planning in the Megatable is available only for 2008. It was agreed that the regularly updated requests that are maintained by H.Renshall should be the reference; as there the values are specified by month, by experiment and by site. L.Bauerdick disagreed and explained that CMS had asked for 25-30 TB for each of their Tier-1 sites and some sites did not fulfil the request – which could therefore not be used. On the other hand in-complete usage by the experiments is normal in this phase and sites should not be worried because the resources (esp. CPU) are used with peaks in which the resources are needed. G.Merino supported J.Templon’s proposal and asked that the requests should all come in the same format and with the same granularity and with the same kind of information (showing incremental and cumulative needs).
The MB requested that all resources requests should have the same format, granularity (monthly) and the cumulative monthly values.
H.Renshall should propose the template to use and update it monthly.
4. SAM Tests Review (document)
The SAM tests for October where distributed and the sites provided feedback on the results.
Not all sites have moved to allow the OPS VO for the tests; some sites still only allow DTEAM. This should be corrected urgently
4.1 Status of the SAM Tests - P.Nyczyk (slides)
The presentation provided a summary of the:
- Deployment status
- Sensors status
- Existing tests
- VO specific tests
- Availability metrics
- Open issues
Slide 3 (with animations) shows how the SAM system has replaced the SFT submission mechanism and how Oracle is used to replace mysql for storing the tests information and results.
The sensors available (slide 4) cover the main services, the ones still missing are for:
- gLite WMS
- 3D databases at the Tier-1 sites
The VO-specific test submission (slide 5) is the following:
- LHCb: has already provided VO-specific jobs, but using the old SFT framework. Only monitoring CEs for now and needs to migrate to SAM.
- ATLAS: The standard jobs for all sensors submitted from SAM UI with Atlas credentials. But no specific tests developed by ATLAS.
- CMS: The account on the SAM UI has been created and some sample jobs were sent. But no regular submission of the standard jobs is done.
Slide 6 and 7 show the algorithm used to calculate the “site availability” values. The service and site status values are recorder every hour (24 snapshots per day).
Daily, weekly, monthly availability is calculated using integration (averaging) over the given period, if there is scheduled downtime information from GOC DB this is also recorded and integrated into the calculations.
Details of the algorithm are available on the GOC wiki portal: http://goc.grid.sinica.edu.tw/gocwiki/SAM_Metrics_calculation
Several issues are still open (slide 8), in particular:
All sensors have to be
reviewed and fixed:
- Several missing sensor/tests have to be still developed
- All tests should be better documented (Tested inline doc + Wiki)
- “Job wrapper” tests should be put in production (simple display or data export needed)
- Availability metric should be reviewed: is the current one the way to go?
4.2 Summary of the Sites Reports and Issues (slides) - G.Merino
G.Merino summarized some of the issues received from the Tier-1 sites.
4.2.1 Differences between Sam and GridView Reports
Slides 2 and 3 show the summary of the values by SAM and GridView, both are extracting them from the same database.
The graphs do not match because the SAM report distributed as the monthly report considers only the OPS VO, while the GridView one includes also DTEAM.
For ASGC and TRIUMF dteam was used in both graphs because they had not moved to OPS yet at the beginning of the month. They should move urgently.
Slide 4 shows the main differences for IN2P3, CNAF, TRIUMF and ASGC.
P.Nyczyk explained that in the period (20-27 October) some debugging was done to the SAM “aggregation module”. That explain why the SAM graphs are probably wrong.
14 Nov 2006 - The SAM report for October should be recalculated and redistributed, using the best of OPS and dteam VOs. When the SAM tests are invalid or undefined this should be noted on the graphs (not note as a site failure).
For now having two systems allows to spot inconsistencies in the calculations, in the long term GridView will be the one developed further in order to provide more information (daily, weekly, monthly averages, export to Excel, etc) in addition to the graphs.
J.Gordon noted that sites should use GridView to monitor daily their site and dig into the tests that fail when they see failures.
M.Mazzucato noted that the tests should check what the experiments use in order to provide what is relevant to the VOs supported by a site.
J.Templon asked that the values be visible “by service” so that when a site is considered down it should be visible what was not running properly. It was pointed out that this can be seen easily from GridView.
4.2.2 Operational Issues
Several sites report about the “hanging gridftp doors” dCache scalability problems causing their site unavailability.
Some of them report about corrective actions taken:
- NIKHEF: Improvement of problem-detection scripts and after the upgrade to dCache v1.7 the situation improved
- RAL: reduced the TCP window size on GridFTP doors then the situation improved
FZK: Solves inconsistency:
SRM from gLite was “too” new. Tuned several dCache parameters (max
logins, number of streams per client) and automatically restarting the
gridftp doors. The situation improved for experiments, but not for the SAM
Some sites reported stability issues:
- RAL: upgraded CE to a 2xCPU, 2xCore with 4GB RAM and really improved load-related issues
PIC: suffering a lot from
users submitting through many RBs and killing the CE.
The Torque security vulnerability (20-Oct, Friday) was tackled in different ways:
RAL: Patch applied in less
than one day, but CE SAM tests still failing for about 3-4 days later.
- FZK: PBSPro released the patch on Monday. Queues at FZK were closed for the whole weekend.
- PIC: Patch applied in “urgent” mode, created other problems in the WNs configuration that caused CE intermittent unavailability for some days.
4.2.3 SAM Issues
For example SARA-NIKHEF is a Tier-1 centre formed by two LCG sites. The CE service from SARA-NIKHEF Tier-1 is 90% at NIKHEF, so SARA CE unavailability is not representative of the whole service.
ISSUE: SAM not able to manage Tier-1s made of an “aggregation” of sites
Error in the Summary Information
From In2P3: It would very useful if SAM reports were easier to interpret and allowed site managers to easily spot problems. Also the individual tests results (get, put, JS, JL, etc.) should be visible. This is available in the GridView where is visible which services are failing.
OPS vs DTEAM VOs
In the October availability report, only
CNAF reports that it had issues with configuring the OPS VO in the SE service. DTEAM tests ok. As of yesterday, SRM service still does not show up in the SAM OPS VO page for CNAF
Slide 11 shows a typical case of miscalculation. The CE test “lcg-rm” (replica management) fails due to a simple time out, but it is not retried for next 5h, when the next SAM test series is run. This causes the site to be down 21% of the day. The proposal is to repeat the “lcg-rm” test soon after the error to check for spurious failures. Another example (slide 12) is during 7-8 October when GridView shows that data is missing. So it is a problem of the framework but this lowers the average for all sites.
IN2P3 reported (slide 13) that on the 21 October only 4 tests were executed and on the 22 October only one test (!).
This is seen on all site graphs and is a SAM problem, therefore should not be accounted on the sites and the framework improved.
J.Gordon stressed, and many at the MB agreed, the fact that these cases should be highlighted because checking the failures takes a lot of time for the sites.
Lcg-rm CE Test
This test introduces correlation between CE and SE services, and makes it more difficult to understand where the failure is.
M.Schulz commented that some tests checking the whole system are important and are needed. Having separate services working does not imply that the site is properly working and that the interfaces between services are configured correctly.
M.Mazzucato stressed that the SAM tests should all be run under the experiments VO and this would increase the realism of the tests.
P.Nyczyk agreed and explained that this has been started (for ATLAS and soon for CMS) but it takes a long time to get the accounts, the certificates and to set up the cron jobs for all experiments.
asked whether it is worth the effort to show these reports in public if the
values are still very incorrect and often because of the SAM tests still
being debugged. This is confusing the
L.Robertson agreed with the statement but he also highlighted that only by distributing the results finally, after several months since April 2006, the problems are discussed and followed up by the sites.
M.Lamanna mentioned that the “job wrappers” tests will also improve the situation, once they are put in place (will be presented in December’s GDB). This will reproduce the usage of each experiment under its perspective and the services it uses.
The MB agreed with H.Marten that the OPS tests should used as the metric until the problems have been fixed.
5. Status Review of the SRM 2.2 - M.Litmaath (slides)
5.1 CASTOR status
The CASTOR SRM v.2.2 SRM endpoint (srm-v2.cern.ch) was announced on the 1 November 2006.
Now it is being configured as front-end for the experiments instances:
- Grid mappings as for srm.cern.ch
- Stage mapping
- Service classes to map for the experiments storage classes
A second endpoint will be setup at RAL, the hardware is being installed. J.Gordon said that it should be available in a week
5.2 DCache status
FNAL has also installed an endpoint (fledgling06.fnal.gov) for the test system setup.
The Tape1 classes are emulated by an Enstore “null mover”, which is not copying to tape but will return a file full of zeros. But this can be used for all tests as a tape.
The v2.2 functionality is mostly complete, still missing:
- srmGetSpaceTokens expected this week
- Minimal permission functions soon
- Further development when ACLs become available (1-2 months). But this is not needed for the current SRM release.
The status of the SRM requests viewable from this web page: http://cduqbar.fnal.gov:8080/srmwatch/
A second endpoint will be installed at DESY, within a week or two. This will also be usable by ATLAS because DESY is also an ATLAS site.
5.3 DPM status
The DPM endpoint lxdpm01.cern.ch is available since several months, with 1.6 TB of disk.
The few features missing are not concerning the LCG needs but other VOs.
It is the reference for the GFAL/lcg-utils tests. ATLAS asked for a second endpoint possibly at another site than CERN.
6. Next Steps with the Megatable - C.Eck
The status will be presented at the GDB. The goal here is to agree what are the next actions and by when the values in the Megatable have to be completed.
This presentation was cancelled because already covered at the GDB the next day.
J.Gordon asked when the 2007 milestones will be discussed. A.Aimar replied that after the QR reports are completed (this week) the LCG milestones will be prepared and circulated to the MB, in a week or two.
8. Summary of New Actions
No new actions.
The full Action List, current and past items, will be in this wiki page before next MB meeting.