LCG Management Board |
|
Date/Time:
|
Tuesday 07 November 2006 at 16:00 |
Agenda: |
|
Members: |
|
|
(Version
1 - 10.11.2006) |
Participants: |
A.Aimar (notes), L.Bauerdick, S.Belforte, L.Betev, K.Bos, N.Brook, T.Cass, Ph.Charpentier, L.Dell’Agnello, Di Quing, C.Eck, B.Gibbard, J.Gordon, D.Foster, F.Hernandez, M.Lamanna, M.Litmaath, J.Knobloch, H.Marten, M.Mazzucato, G.Merino, P.Nyczyk, B.Panzer, R.Pordes, G.Poulard, L.Robertson (chair), M.Schulz, J.Shiers, J.Templon |
Action List |
|
Next
Meeting: |
Tuesday 14 November from 16:00 to
17:00, CERN time |
1. Minutes and Matters arising (minutes) |
|
1.1 Minutes of Previous MeetingNo comments. Minutes approved. 1.2 LHCC Referees MeetingWill take place on Monday 13 November, 14:00-16:00 (see agenda). Two referees will be absent (F.Forti and P. Dauncey) the topics proposed are: - Report from CMS on the status CSA06 (20’ + 10’ for questions) -
Report on the status and
progress of the current In addition to the already agreed: -
Summaries of the revised
computing requirements for 2007-2010, by each experiment. |
|
2. Action List Review (list of actions)Actions that are late
are highlighted in RED. |
|
Done. B.Panzer
distributed a document (document). Comments and
feedback are expected.
Done. The sites have
included their in their plans the milestones for the installation of the 3D
databases.
Ongoing.
H.Renshall will present the situation at next MB. J.Shiers said that
three out of four have sent their requests. CMS is discussing the requirements
in their MB and will provide the information to H.Renshall. In general,
CMS’ intention is to use all resources that are available after SC4.
To be done. |
|
3. September 2006 Accounting |
|
L.Robertson asked for comments on the September Accounting Reports that were distributed. http://lcg.web.cern.ch/LCG/MB/accounting/accounting_summaries.pdf H.Marten suggested that the MoU commitment and the installed capacity line on the graphs should be dotted because when they overlap one of the two is not visible. J.Templon asked that there is additional information with a comparison between the resources requested, at each site, by each VO vs. those actually used by the VO. The information available in terms of pledges and the planning in the Megatable is available only for 2008. It was agreed that the regularly updated requests that are maintained by H.Renshall should be the reference; as there the values are specified by month, by experiment and by site. L.Bauerdick disagreed and explained that CMS had asked for 25-30 TB for each of their Tier-1 sites and some sites did not fulfil the request – which could therefore not be used. On the other hand in-complete usage by the experiments is normal in this phase and sites should not be worried because the resources (esp. CPU) are used with peaks in which the resources are needed. G.Merino supported J.Templon’s proposal and asked that the requests should all come in the same format and with the same granularity and with the same kind of information (showing incremental and cumulative needs). The MB requested that all resources requests should have
the same format, granularity (monthly) and the cumulative monthly values. H.Renshall should propose the template to use and update it
monthly. |
|
4. SAM Tests Review (document) |
|
The SAM tests for October where distributed and the sites provided feedback on the results. Not all sites have moved to allow the OPS VO for the tests;
some sites still only allow DTEAM. This should be corrected urgently 4.1 Status of the SAM Tests - P.Nyczyk (slides)The presentation provided a summary of the: -
Deployment
status -
Sensors
status -
Existing
tests -
VO
specific tests -
Availability
metrics -
Open
issues Slide 3 (with animations) shows how the SAM system has replaced the SFT submission mechanism and how Oracle is used to replace mysql for storing the tests information and results. The sensors available (slide 4) cover the main services, the ones still missing are for: - gLite WMS - MyProxy - VOMS - R-GMA - 3D databases at the Tier-1 sites The VO-specific test submission (slide 5) is the following: -
LHCb:
has already provided VO-specific jobs, but using the old SFT framework. Only
monitoring CEs for now and needs to migrate to SAM. -
ATLAS:
The standard jobs for all sensors submitted from SAM UI with Atlas
credentials. But no specific tests developed by ATLAS. -
CMS: The
account on the SAM UI has been created and some sample jobs were sent. But no
regular submission of the standard jobs is done. -
Slide 6 and 7 show the algorithm used to calculate the “site availability” values. The service and site status values are recorder every hour (24 snapshots per day). Daily, weekly, monthly availability is calculated using integration (averaging) over the given period, if there is scheduled downtime information from GOC DB this is also recorded and integrated into the calculations. Details of the algorithm are available on the GOC wiki portal: http://goc.grid.sinica.edu.tw/gocwiki/SAM_Metrics_calculation Several issues are still open (slide 8), in particular: -
All sensors have to be
reviewed and fixed: - Several missing sensor/tests have to be still developed - All tests should be better documented (Tested inline doc + Wiki) - “Job wrapper” tests should be put in production (simple display or data export needed) - Availability metric should be reviewed: is the current one the way to go? 4.2 Summary of the Sites Reports and Issues (slides) - G.MerinoG.Merino summarized some of the issues received from the Tier-1 sites. 4.2.1 Differences between Sam and GridView ReportsSlides 2 and 3 show the summary of the values by SAM and GridView, both are extracting them from the same database. The
graphs do not match because the SAM report distributed as the monthly report
considers only the OPS VO, while the GridView one includes also DTEAM. For ASGC and TRIUMF
dteam was used in both graphs because they had not moved to OPS yet at the
beginning of the month. They should move urgently. Slide 4 shows the main differences for IN2P3, CNAF, TRIUMF and ASGC. P.Nyczyk explained that in the period (20-27 October) some debugging was done to the SAM “aggregation module”. That explain why the SAM graphs are probably wrong. Action: 14 Nov 2006 - The SAM
report for October should be recalculated and redistributed, using the best
of OPS and dteam VOs. When the SAM tests are invalid or undefined this should
be noted on the graphs (not note as a site failure). For now having two systems allows to spot inconsistencies in the calculations, in the long term GridView will be the one developed further in order to provide more information (daily, weekly, monthly averages, export to Excel, etc) in addition to the graphs. J.Gordon noted that sites should use GridView to monitor daily their site and dig into the tests that fail when they see failures. M.Mazzucato noted that the tests should check what the experiments use in order to provide what is relevant to the VOs supported by a site. J.Templon asked that the values be visible “by service” so that when a site is considered down it should be visible what was not running properly. It was pointed out that this can be seen easily from GridView. 4.2.2 Operational IssuesdCache
Issues Several sites report about the “hanging gridftp doors” dCache scalability problems causing their site unavailability. Some of them report about corrective actions taken: - NIKHEF: Improvement of problem-detection scripts and after the upgrade to dCache v1.7 the situation improved - RAL: reduced the TCP window size on GridFTP doors then the situation improved -
FZK: Solves inconsistency:
SRM from gLite was “too” new. Tuned several dCache parameters (max
logins, number of streams per client) and automatically restarting the
gridftp doors. The situation improved for experiments, but not for the SAM
tests. CE
Issues Some sites reported stability issues: - RAL: upgraded CE to a 2xCPU, 2xCore with 4GB RAM and really improved load-related issues -
PIC: suffering a lot from
users submitting through many RBs and killing the CE. The Torque security vulnerability (20-Oct, Friday) was tackled in different ways: -
RAL: Patch applied in less
than one day, but CE SAM tests still failing for about 3-4 days later. - FZK: PBSPro released the patch on Monday. Queues at FZK were closed for the whole weekend. - PIC: Patch applied in “urgent” mode, created other problems in the WNs configuration that caused CE intermittent unavailability for some days. 4.2.3 SAM IssuesSite
Aggregation For example
SARA-NIKHEF is a Tier-1 centre formed by two LCG sites. The CE service from
SARA-NIKHEF Tier-1 is 90% at NIKHEF, so SARA CE unavailability is not
representative of the whole service. ISSUE: SAM not able to manage Tier-1s made of an
“aggregation” of sites Error
in the Summary Information From In2P3: It
would very useful if SAM reports were easier to interpret and allowed site managers
to easily spot problems. Also the individual tests results (get, put, JS, JL,
etc.) should be visible. This is available in the GridView where is
visible which services are failing. OPS vs DTEAM VOs In the October availability report, only
TRIUMF and CNAF reports that it had issues with configuring the OPS VO in the SE service. DTEAM tests ok. As of yesterday, SRM service still does not show up in the SAM OPS VO page for CNAF False
positives Slide 11 shows a typical case of miscalculation. The CE test
“lcg-rm” (replica management) fails due to a simple time out, but
it is not retried for next 5h, when the next SAM test series is run. This
causes the site to be down 21% of the day. The proposal is to repeat the
“lcg-rm” test soon after the error to check for spurious
failures. Another example (slide 12) is during 7-8 October when GridView
shows that data is missing. So it is a problem of the framework but this
lowers the average for all sites. Test Rates IN2P3 reported
(slide 13) that on the 21 October only 4 tests were executed and on the 22
October only one test (!). This is seen on all
site graphs and is a SAM problem, therefore should not be accounted on the
sites and the framework improved. J.Gordon stressed, and many at the MB agreed,
the fact that these cases should be highlighted because checking the failures
takes a lot of time for the sites. Lcg-rm CE Test This test introduces
correlation between CE and SE services, and makes it more difficult to
understand where the failure is. M.Schulz commented that some tests checking
the whole system are important and are needed. Having separate services
working does not imply that the site is properly working and that the
interfaces between services are configured correctly. M.Mazzucato
stressed that the SAM tests should all be run under the experiments VO and
this would increase the realism of the tests. P.Nyczyk
agreed and explained that this has been started (for ATLAS and soon for CMS)
but it takes a long time to get the accounts, the certificates and to set up
the cron jobs for all experiments. H.Marten
asked whether it is worth the effort to show these reports in public if the
values are still very incorrect and often because of the SAM tests still
being debugged. This is confusing the L.Robertson
agreed with the statement but he also highlighted that only by distributing
the results finally, after several months since April 2006, the problems are
discussed and followed up by the sites. M.Lamanna
mentioned that the “job wrappers” tests will also improve the
situation, once they are put in place (will be presented in December’s
GDB). This will reproduce the usage of each experiment under its perspective
and the services it uses. The MB agreed with H.Marten that the OPS tests should used
as the metric until the problems have been fixed. |
|
5. Status Review of the SRM 2.2 - M.Litmaath (slides) |
|
5.1
CASTOR
status
The CASTOR SRM v.2.2 SRM endpoint (srm-v2.cern.ch) was announced on the 1 November 2006. Now it is being configured as front-end for the experiments instances: -
Grid
mappings as for srm.cern.ch -
Stage
mapping -
Service
classes to map for the experiments storage classes A second endpoint
will be setup at RAL, the hardware is being installed. J.Gordon said that it
should be available in a week 5.2
DCache
status
FNAL has also
installed an endpoint (fledgling06.fnal.gov) for the test system setup. The Tape1 classes
are emulated by an Enstore “null mover”, which is not copying to
tape but will return a file full of zeros. But this can be used for all tests
as a tape. The v2.2 functionality
is mostly complete, still missing: -
srmGetSpaceTokens
expected this week -
Minimal
permission functions soon -
Further
development when ACLs become available (1-2 months). But this is not needed
for the current SRM release. The status of the
SRM requests viewable from this web page: http://cduqbar.fnal.gov:8080/srmwatch/ A second endpoint
will be installed at DESY, within a week or two. This will also be usable by
ATLAS because DESY is also an ATLAS site. 5.3
DPM
status
The DPM endpoint
lxdpm01.cern.ch is available since several months, with 1.6 TB of disk. The few features
missing are not concerning the LCG needs but other VOs. It is the reference
for the GFAL/lcg-utils tests. ATLAS asked for a second endpoint possibly at
another site than CERN. 5.4
|
|
6. Next Steps with the Megatable - C.Eck
The status will be presented at the GDB. The goal here is to agree what
are the next actions and by when the values in the Megatable have to be
completed. |
|
This presentation was cancelled because already covered at the GDB the next day. |
|
7. AOB |
|
J.Gordon
asked when the 2007 milestones will be discussed. A.Aimar replied that after
the QR reports are completed (this week) the LCG milestones will be prepared
and circulated to the MB, in a week or two. |
|
8. Summary of New Actions |
|
No new actions. The full Action List, current and past items, will be in this wiki page before next MB meeting. |