LCG Management Board

Date/Time:

Tuesday 07 November 2006 at 16:00

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=a0632703 

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 10.11.2006)

Participants:

A.Aimar (notes), L.Bauerdick, S.Belforte, L.Betev, K.Bos, N.Brook, T.Cass, Ph.Charpentier, L.Dell’Agnello, Di Quing, C.Eck, B.Gibbard, J.Gordon, D.Foster, F.Hernandez, M.Lamanna, M.Litmaath, J.Knobloch, H.Marten, M.Mazzucato, G.Merino, P.Nyczyk, B.Panzer, R.Pordes, G.Poulard, L.Robertson (chair), M.Schulz, J.Shiers, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 14 November from 16:00 to 17:00, CERN time

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of Previous Meeting

No comments. Minutes approved.

1.2         LHCC Referees Meeting

Will take place on Monday 13 November, 14:00-16:00 (see agenda).

 

Two referees will be absent (F.Forti and P. Dauncey) the topics proposed are:

-          Report from CMS on the status CSA06 (20’ + 10’ for questions)

-          Report on the status and progress of the current ALICE tests (20’ + 10’ for questions)

 

In addition to the already agreed:

-          Summaries of the revised computing requirements for 2007-2010, by each experiment.

 

 

2.      Action List Review (list of actions)

Actions that are late are highlighted in RED.

  • 6 October - B.Panzer distributes to the MB a document on “where disk caches are needed in a Tier-1 site” everything included (buffers for tapes, network transfers, etc).

 

Done. B.Panzer distributed a document (document).

Comments and feedback are expected.

 

  • 10 Oct 2006 - The 3D Phase 2 sites should provide, in the next Quarterly Report, the 3D status and the time schedules for installations and tests of their 3D databases.

 

Done. The sites have included their in their plans the milestones for the installation of the 3D databases.

 

  • 13 Oct 2006 - Experiments should send to H.Renshall their resource requirements and work plans at all Tier-1 sites (cpu, disk, tape, network in and out, type of work) covering at least 2007Q1 and 2007Q2.

 

Ongoing. H.Renshall will present the situation at next MB.

J.Shiers said that three out of four have sent their requests. CMS is discussing the requirements in their MB and will provide the information to H.Renshall. In general, CMS’ intention is to use all resources that are available after SC4.

 

  • 27 Oct 2006 - The MB members should send to I.Bird names of candidates for coordination and participation to the three groups (Site Management, Monitoring and System Analysis).

 

To be done.

 

3.      September 2006 Accounting

 

 

L.Robertson asked for comments on the September Accounting Reports that were distributed.

 

http://lcg.web.cern.ch/LCG/MB/accounting/accounting_summaries.pdf

 

H.Marten suggested that the MoU commitment and the installed capacity line on the graphs should be dotted because when they overlap one of the two is not visible.

 

J.Templon asked that there is additional information with a comparison between the resources requested, at each site, by each VO vs. those actually used by the VO. The information available in terms of pledges and the planning in the Megatable is available only for 2008. It was agreed that the regularly updated requests that are maintained by H.Renshall should be the reference; as there the values are specified by month, by experiment and by site. L.Bauerdick disagreed and explained that CMS had asked for 25-30 TB for each of their Tier-1 sites and some sites did not fulfil the request – which could therefore not be used. On the other hand in-complete usage by the experiments is normal in this phase and sites should not be worried because the resources (esp. CPU) are used with peaks in which the resources are needed. G.Merino supported J.Templon’s proposal and asked that the requests should all come in the same format and with the same granularity and with the same kind of information (showing incremental and cumulative needs).

 

The MB requested that all resources requests should have the same format, granularity (monthly) and the cumulative monthly values.

H.Renshall should propose the template to use and update it monthly.

 

4.      SAM Tests Review (document)

 

 

The SAM tests for October where distributed and the sites provided feedback on the results.

 

Not all sites have moved to allow the OPS VO for the tests; some sites still only allow DTEAM. This should be corrected urgently

4.1         Status of the SAM Tests - P.Nyczyk (slides)

The presentation provided a summary of the:

-          Deployment status

-          Sensors status

-          Existing tests

-          VO specific tests

-          Availability metrics

-          Open issues

 

Slide 3 (with animations) shows how the SAM system has replaced the SFT submission mechanism and how Oracle is used to replace mysql for storing the tests information and results.

 

The sensors available (slide 4) cover the main services, the ones still missing are for:

-          gLite WMS

-          MyProxy

-          VOMS

-          R-GMA

-          3D databases at the Tier-1 sites

 

The VO-specific test submission (slide 5) is the following:

-          LHCb: has already provided VO-specific jobs, but using the old SFT framework. Only monitoring CEs for now and needs to migrate to SAM.

-          ATLAS: The standard jobs for all sensors submitted from SAM UI with Atlas credentials. But no specific tests developed by ATLAS.

-          CMS: The account on the SAM UI has been created and some sample jobs were sent. But no regular submission of the standard jobs is done.

-          ALICE: no information

 

Slide 6 and 7 show the algorithm used to calculate the “site availability” values. The service and site status values are recorder every hour (24 snapshots per day).

 

Daily, weekly, monthly availability is calculated using integration (averaging) over the given period, if there is scheduled downtime information from GOC DB this is also recorded and integrated into the calculations.

 

Details of the algorithm are available on the GOC wiki portal: http://goc.grid.sinica.edu.tw/gocwiki/SAM_Metrics_calculation

 

Several issues are still open (slide 8), in particular:

-          All sensors have to be reviewed and fixed:
- check if tests reflect real usage by the experiments
- avoid dependencies on central services and third party services if possible
- increase reliability of results, become more resistant to any other failures not related to site configuration
- increase tests verbosity, make it easier to find real problem and site debugging

-          Several missing sensor/tests have to be still developed

-          All tests should be better documented (Tested inline doc + Wiki)

-          “Job wrapper” tests should be put in production (simple display or data export needed)

 

-          Availability metric should be reviewed: is the current one the way to go?

 

4.2         Summary of the Sites Reports and Issues (slides) - G.Merino

G.Merino summarized some of the issues received from the Tier-1 sites.

4.2.1          Differences between Sam and GridView Reports

Slides 2 and 3 show the summary of the values by SAM and GridView, both are extracting them from the same database.

The graphs do not match because the SAM report distributed as the monthly report considers only the OPS VO, while the GridView one includes also DTEAM.

 

For ASGC and TRIUMF dteam was used in both graphs because they had not moved to OPS yet at the beginning of the month. They should move urgently.

 

Slide 4 shows the main differences for IN2P3, CNAF, TRIUMF and ASGC.

 

P.Nyczyk explained that in the period (20-27 October) some debugging was done to the SAM “aggregation module”. That explain why the SAM graphs are probably wrong.

 

Action:

14 Nov 2006 - The SAM report for October should be recalculated and redistributed, using the best of OPS and dteam VOs. When the SAM tests are invalid or undefined this should be noted on the graphs (not note as a site failure).

 

For now having two systems allows to spot inconsistencies in the calculations, in the long term GridView will be the one developed further in order to provide more information (daily, weekly, monthly averages, export to Excel, etc) in addition to the graphs.

 

J.Gordon noted that sites should use GridView to monitor daily their site and dig into the tests that fail when they see failures.

 

M.Mazzucato noted that the tests should check what the experiments use in order to provide what is relevant to the VOs supported by a site.

 

J.Templon asked that the values be visible “by service” so that when a site is considered down it should be visible what was not running properly. It was pointed out that this can be seen easily from GridView.

4.2.2          Operational Issues

 

dCache Issues

Several sites report about the “hanging gridftp doors” dCache scalability problems causing their site unavailability.

 

Some of them report about corrective actions taken:

-          NIKHEF: Improvement of problem-detection scripts and after the upgrade to dCache v1.7 the situation improved

-          RAL: reduced the TCP window size on GridFTP doors then the situation improved

-          FZK: Solves inconsistency: SRM from gLite was “too” new. Tuned several dCache parameters (max logins, number of streams per client) and automatically restarting the gridftp doors. The situation improved for experiments, but not for the SAM tests.
Their dCache upgrade from v1.6.6.0 to v1.6.6.5 did not improve the situation.

 

CE Issues

Some sites reported stability issues:

-          RAL: upgraded CE to a 2xCPU, 2xCore with 4GB RAM and really improved load-related issues

-          PIC: suffering a lot from users submitting through many RBs and killing the CE.
Planning to deploy a second CE (redundant service) on a more powerful hardware.

 

The Torque security vulnerability (20-Oct, Friday) was tackled in different ways:

-          RAL: Patch applied in less than one day, but CE SAM tests still failing for about 3-4 days later.
After this patching, problems appeared in the CE (a WN can stop the server scheduling jobs). Still investigating.

-          FZK: PBSPro released the patch on Monday. Queues at FZK were closed for the whole weekend.

-          PIC: Patch applied in “urgent” mode, created other problems in the WNs configuration that caused CE intermittent unavailability for some days.

4.2.3          SAM Issues

 

Site Aggregation

For example SARA-NIKHEF is a Tier-1 centre formed by two LCG sites. The CE service from SARA-NIKHEF Tier-1 is 90% at NIKHEF, so SARA CE unavailability is not representative of the whole service.

 

ISSUE: SAM not able to manage Tier-1s made of an “aggregation” of sites

 

Error in the Summary Information

From In2P3: It would very useful if SAM reports were easier to interpret and allowed site managers to easily spot problems. Also the individual tests results (get, put, JS, JL, etc.) should be visible.  This is available in the GridView where is visible which services are failing.

 

OPS vs DTEAM VOs

In the October availability report, only TRIUMF and Taiwan have some “DTEAM availability” at the beginning of the month, this should be considered for all sites or for none.

CNAF reports that it had issues with configuring the OPS VO in the SE service. DTEAM tests ok. As of yesterday, SRM service still does not show up in the SAM OPS VO page for CNAF

 

False positives

Slide 11 shows a typical case of miscalculation. The CE test “lcg-rm” (replica management) fails due to a simple time out, but it is not retried for next 5h, when the next SAM test series is run. This causes the site to be down 21% of the day. The proposal is to repeat the “lcg-rm” test soon after the error to check for spurious failures. Another example (slide 12) is during 7-8 October when GridView shows that data is missing. So it is a problem of the framework but this lowers the average for all sites.

 

Test Rates

IN2P3 reported (slide 13) that on the 21 October only 4 tests were executed and on the 22 October only one test (!).

This is seen on all site graphs and is a SAM problem, therefore should not be accounted on the sites and the framework improved.

 

J.Gordon stressed, and many at the MB agreed, the fact that these cases should be highlighted because checking the failures takes a lot of time for the sites.

 

Lcg-rm CE Test

This test introduces correlation between CE and SE services, and makes it more difficult to understand where the failure is.

 

M.Schulz commented that some tests checking the whole system are important and are needed. Having separate services working does not imply that the site is properly working and that the interfaces between services are configured correctly.

 

M.Mazzucato stressed that the SAM tests should all be run under the experiments VO and this would increase the realism of the tests.

P.Nyczyk agreed and explained that this has been started (for ATLAS and soon for CMS) but it takes a long time to get the accounts, the certificates and to set up the cron jobs for all experiments.

 

H.Marten asked whether it is worth the effort to show these reports in public if the values are still very incorrect and often because of the SAM tests still being debugged. This is confusing the OB and the site managers. The framework should be improved under the OPS VO before executing it under the VO credentials and making public the results. J.Gordon agreed with that statement.

L.Robertson agreed with the statement but he also highlighted that only by distributing the results finally, after several months since April 2006, the problems are discussed and followed up by the sites.

 

M.Lamanna mentioned that the “job wrappers” tests will also improve the situation, once they are put in place (will be presented in December’s GDB). This will reproduce the usage of each experiment under its perspective and the services it uses.

 

The MB agreed with H.Marten that the OPS tests should used as the metric until the problems have been fixed.

 

5.      Status Review of the SRM 2.2 - M.Litmaath (slides)

 

 

5.1         CASTOR status

The CASTOR SRM v.2.2 SRM endpoint (srm-v2.cern.ch) was announced on the 1 November 2006.

Now it is being configured as front-end for the experiments instances:

-          Grid mappings as for srm.cern.ch

-          Stage mapping

-          Service classes to map for the experiments storage classes

 

A second endpoint will be setup at RAL, the hardware is being installed. J.Gordon said that it should be available in a week

5.2         DCache status

FNAL has also installed an endpoint (fledgling06.fnal.gov) for the test system setup.

 

The Tape1 classes are emulated by an Enstore “null mover”, which is not copying to tape but will return a file full of zeros. But this can be used for all tests as a tape.

 

The v2.2 functionality is mostly complete, still missing:

-          srmGetSpaceTokens expected this week

-          Minimal permission functions soon

-          Further development when ACLs become available (1-2 months). But this is not needed for the current SRM release.

 

The status of the SRM requests viewable from this web page: http://cduqbar.fnal.gov:8080/srmwatch/

 

A second endpoint will be installed at DESY, within a week or two. This will also be usable by ATLAS because DESY is also an ATLAS site.

5.3         DPM status

The DPM endpoint lxdpm01.cern.ch is available since several months, with 1.6 TB of disk.

The few features missing are not concerning the LCG needs but other VOs.

 

It is the reference for the GFAL/lcg-utils tests. ATLAS asked for a second endpoint possibly at another site than CERN.

5.4         Berkeley SRM and StoRM status

The front-end for the tests of the production system is available (dmx09.lbl.gov)

It is the reference for SRM-Tester suite

 

CNAF has installed a second front end for the tests system (ibm139.cnaf.infn.it)

5.5         Client status

GFAL/lcg-utils have been successfully tested against DPM:

-          Tar ball to become available this week, not the RPM yet.

-          Tests against CASTOR showed some issues and are being debugged (on client and server side)

-          Tests against other endpoints are to be started soon

5.6         FTS

First version compatible with v2.2 is not expected before December.  The development was delayed due to the production system support load.

This is a real risk to testing and deployment and should not be delayed further.

 

A test UI+BDII has been prepared and BDII will include in the statically published v2.2 endpoints.

5.7         Storage Classes WG status

The summary of the Oct. 3 pre-GDB meeting are here: http://indico.cern.ch/conferenceDisplay.py?confId=a058490

F.Donno’s minutes (in the agenda) provide a condensed and complete summary.

 

In summary:

-          “Tape1Disk1” is a supported storage class

-          The transitions “Tape1Disk0 to/from Tape1Disk1” are supported

-          The transitions “Tape0Disk1 to/from Tape1DiskN” are NOT supported

 

A space token can be associated to a specific tape set (so that files are not mixed on the same tape).

This means that the name space can remain orthogonal to the storage class but VOs are advised to structure their name spaces to take advantage for their optimizations.

 

The presentations by some Tier-1 sites during Nov. 7 pre-GDB meeting were very useful and highlighted very different uses and choices.

Can the other Tier-1 sites follow the examples and present their plans? And some Tier-2 sites?

They also need more details from the experiments (disk space needed, etc) in order to find all show stopper and misunderstandings.

5.8         Test suite status

The Berkeley SRM-Tester is available (http://sdm.lbl.gov/srm-tester/). Currently it is run manually, typically 5/7 days a week.

Problems are reported to developers concerned and to a mailing list. The tester will be presented at SC’06 (http://sdm.lbl.gov/sc2006).

 

 

The S2 Test Suite is now further developed by F.Donno, with occasional advice from the original author J.Mencak.

This tester can be run as a cron job and the summary web pages are almost ready for general use and will be distributed once they are available.

5.9         GLUE schema status

The documents (by F.Donno et al.) have been submitted to the GLUE schema working group:

-          http://glueschema.forge.cnaf.infn.it/Spec/V13

-          Storage Element Model for SRM 2.2 and GLUE schema description

 

They present the WLCG use cases in detail.

 

After the August SRM workshop a “Proposal for GLUE 1.3 for Storage” (by J.Jensen et al.) was written describing the proposal of the changes asked. The documents do not completely agree and were presented at the GLUE collaboration meeting held at IC, Oct. 30-31. Some of the proposed changes were too drastic for Glue 1.3 which is intended to be mostly a “bug fix” release. Too many changes could jeopardize interoperability with other projects and some aspects might be expensive for WMS ClassAds SE representation.  The GLUE experts proposed a simplified set of changes still sufficient to deal with the WLCG client use cases. This is being discussed, small changes still allowed; the agreed deadline is for end of the week.

5.10     Plan of work

The main actions are:

-          Discuss issues in phone conferences + mailing lists. Avoid non-trivial changes to WSDL/spec for the time being and collaborate with Storage Classes WG

-          Continue interoperability testing using the Berkeley SRM-Tester and the S2 Test Suite from cron job.

-          Develop statistics summaries

-          Start using the test UI with GFAL/lcg-utils now, and FTS later. Including functionality tests and stress tests that will be introduced gradually.

 

L.Robertson asked when the SRM v2.2 will be available for testing by the experiments. M.Litmaath replied that will be before end of the week. The BDII needs to be configured and the libraries distributed. S.Burke is already testing for ATLAS and other experiments will take advantage of the bugs found.

 

L.Robertson asked a clarification about the FTS availability, which seems late and a possible risk. M.Litmaath said that will be available during December. M.Schulz will report to next MB.

 

J.Templon asked whether GLUE 1.3 is compatible with 1.2. M.Litmaath said that compatibility is guaranteed.

 

 

6.      Next Steps with the Megatable - C.Eck

The status will be presented at the GDB. The goal here is to agree what are the next actions and by when the values in the Megatable have to be completed.

 

 

This presentation was cancelled because already covered at the GDB the next day.

 

 

7.      AOB

 

 

J.Gordon asked when the 2007 milestones will be discussed. A.Aimar replied that after the QR reports are completed (this week) the LCG milestones will be prepared and circulated to the MB, in a week or two.

 

 

8.      Summary of New Actions 

 

 

No new actions.

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.