LCG Management Board

Date/Time:

Tuesday 6 June 2006 at 16:00

Agenda:

http://agenda.cern.ch/fullAgenda.php?ida=a061504

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 08.06.2006)

Participants:

A.Aimar (notes), J.Bakken, D.Barberis, L.Bauerdick, I.Bird, K.Bos, S.Belforte, N.Brook, T.Cass, L.Dell’Agnello, D.Foster, J.Gordon, B.Gibbard, F.Hernandez, J.Knobloch, E.Laure, M.Lamanna, S.Lin, G.Merino, L.Robertson (chair), O.Smirnova, J.Templon, J.van Wezel

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Next Meeting:

Tuesday 13 June 2006 at 16:00

1.      Minutes and Matters arising (minutes)

 

1.1         Minutes of the Previous Meeting

Minutes approved.

1.2         NDGF participation to the MB

L.Robertson welcomed Oxana Smirnova, who will represent NDGF.

 

2.      Action List Review (list of actions)

 

 

Note: In RED the milestones that are still due.

  • 21 May 06 - J.Shiers and M.Schulz: Flavia’s list should be updated, maintained and used to control changes and releases. A fixed URL link should be provided to that list.

Which list to use will be clarified after the gLite 3.0 presentation at a forthcoming MB meeting.

 

  • 23 May 06 - Sites should send to lcg.office@cernNOSPAM.ch the names of the contact persons attending, possibly in person, the Internal Review meeting.

Done. The Internal Review is on the 8-9 June 2006.

 

  • 23 May 06 - Tier-1 sites should confirm via email to J.Shiers that they have set-up and tested their FTS channels configuration for transfers to Tier-1 and Tier-2 sites.

Not done by most sites. Should be sent to J.Shiers in the framework of the SC4 Coordination Meeting. On the 30.05.06 only: RAL, FZK and PIC have done it.

 

Update: No new mails. Are the SAM tests sufficient? Have the Tier-1 sites configured their transfers to Tier-1 and Tier-2 sites?

  • 26 May 06 - Sites should complete and send the questionnaire for the Internal Review to the chair, V.Guelzow.

Questionnaire missing from INFN (30.05.06).

 

Update: Done. INFN questionnaire sent on the 7 June 2006.

  • 30 May 06 - J.Shiers and N.Brook will add a note in the document to remind of the need of announcing in advance draining of jobs at the sites.

Not done. J.Shiers waits for a suitable sentence from LHCb to add to the document.

 

  • 30 May 06 - ALICE, CMS and LHCb will send to J.Shiers the list of the Tier-2 sites to monitor in SC4.

Ongoing. Presented to the GDB and a few issues of some Tier-1/Tier-2 associations are still under discussion.

 

  • 31 May 06 - J.Shiers define mechanism to coordinate maintenance interruptions of Tier-1s.

Done. And now there is also the verification that the procedures are followed by the sites.

 

  • 31 May 06 - M.Litmaath presents the results of the SRM workshop in FNAL.

Done at June's GDB.

  • 31 May 06 - C.Grandi presents to the MB the EGEE middleware priorities and development of the features needed by the LHC (in Flavia's list).

This will be done in June.

  • 31 May 06 - J.Shiers should clarify with the LCG Operations how, and to whom, the announcements of interventions should be distributed.

Done. See above.

  • 31 May 06 - K.Bos should start a discussion forum to share experience and tools for monitoring rates and capacities and to provide information as needed by the VOs.

Not done.

  • 31 May 06 - J.Shiers proposes a plan for demonstrating capability to recover short and long interventions on the Tier-0 and Tier-1 sites.

Not done yet.

 

3.      Job Priority (transparencies) - D.Liko

-          TCG Job Priorities working group page (more information)

 

 

D.Liko reported on the work of the TCG Job Priority working group.� Slide 2 shows the membership of the WG from JRA1 Middleware, LHC Experiments and the Grid Sites.

 

A document reporting the work of the WG is being completed and should be distributed within a week.

 

Action:

15 Jun 2006 - D.Liko to distribute the Job Priority WG report to the MB.

 

In the WG the experiments expressed their requirements (in particular by ATLAS and CMS) and it was decided that, at the beginning, a simple schema would be preferable. Having a few job priority groups managed manually would be sufficient for now. More complicated scenarios will be defined and implemented in the future but are not urgent.

 

ATLAS, for example, needs to have four shares for:

-          Production jobs

-          Long jobs

-          Short jobs

-          Special jobs defined locally

The intention is to have the minimal setup that allows production and analysis work to be done in parallel.

3.1         Proposals

The proposal of the WG (slide 6), supported by the experiments, is to have:

-          VOMS groups mapped to shares in the batch system

-          Two queues (long/short) for ATLAS and CMS, so that first tests can be performed

-          Publish the information on the shares in the VOView

-           

-          Patch the WMS to be aware of this information so that the WMS can be used to send jobs

-          The dynamic settings will be done first by email, later by GPBOX or some other mechanism.

 

The first three points do not require any development but are only configuration issues.

The patch to the WMS should have been implemented already and only needs to be tested.

 

Changes that imply re-configuration of the sites should be tested on the corresponding batch systems and on the PPS services.

 

An alternate proposal, developed outside the WG (in the EGEE/LCG task forces, etc), proposes to use right away GPBOX to map VOMS attributes to service classes. This set-up is being implemented at CNAF and the proposal is to use:

-          Three service classes: Gold, Silver, Bronze (called “olympic” model)

-          With, in the beginning, static mapping between the predefined shares

3.2         Considerations about the proposals

The strategy for job priorities is needed urgently. The WG feels that GPBOX has not been sufficiently discussed and the experiments have yet not clearly understood if and how to use the “olympic” model.

 

If the first proposal is adopted less work needs to be done urgently. The experiments will want a more complicated scenario than the initial one, but they first need to test GPBOX and define the scenarios in detail, in order to check if the olympic model is suitable.

3.3         Conclusions

The experiments agreed, in the WG, that the simple solution is sufficient for them for the moment, and that sites should proceed to implement and test it. Therefore the points on which the WG agrees are:

-          Start to configure sites along the lines of the proposed procedure and batch configuration.
If some development needs to be done on some back-end batch systems (Condor, LSF, etc), it should be implemented urgently.

-          Following the mandate of the TCG, move the setup to the pre-production service

-          The WMS patch for VOMS needs to be released and deployed in order to test it adequately

-          The patch to the VOView is implemented and needs to be tested and integrated

 

The strategy for testing and evaluating GPBOX needs to be defined. The proposal is to organize the evaluation of GPBOX. This will be discussed at the JRA1 middleware presentation in a forthcoming MB meeting.

 

Several sites are interested in starting user analysis soon and therefore the implementation of the initial solution (back-ends modification, batch configuration and middleware patches) is more urgent. Additional components, for more refined scheduling and authorization policies, can also be installed, at lower priority, in order to test them for middleware development and for fine grain authorization, as needed by the experiments in the future.

-                       

4.      Sites issues with SC4

-          In particular installation and testing of FTS services and channels.

 

During the meeting there was a roundtable to assess the current status of the services on the Tier-1 sites, and to gather experience and issues encountered during the deployment of gLite 3.0.

 

4.1         Status of the sites

In the order of the membership list.

 

TRIUMF

Not present at the MB meeting.

 

IN2P3

FTS installed and configured, still to be published in the BDII information system. All other services are installed (CE, etc).

By Friday all services should be in place.

 

FZK/GridKa

Everything will be installed by the end of the week. Some problems due to lack of memory on some hosts and the nodes had to be replaced.

 

CNAF

Glite installed on the production system. FTS configured with the help of the CERN team.

 

SARA

The services at SARA are running gLite 3.0 “classic” services, not the “new” glite services (gLite CE, etc).

FTS is configured and all channels tested.

 

PIC

PIC has installed and configured gLite 3.0. Both the classic and the new gLite services (gLite CE, gLite RB).

The deployment of the dCache SRM disk-only is not in production yet and there are some problems. The backup solution is to install a Castor 1 SRM disk-only service.

 

ASGC

The FTS should be installed and configured, as well as the other services.

 

RAL

The update to FTS 1.5 was successful; had some problems testing dCache.

 

BNL

No show stoppers, but the installation was slower than hoped.

Currently testing FTS 1.5 and should be ready by the end of the week.

 

FNAL

Similar to BNL. But support for mysql in FTS is missing in the latest version.

This requires the retrieval of the FTS configuration from previous FTS installations.

Submitted a GGUS ticket.

 

Update: The FTS developers are investigating the MySQL scalability issues, but it is not high in their priority list and

a solution will not be available in the near future.

4.2         General discussion and comments

The setup of the sites took longer than four week and no major problems seem to have occurred because very few GGUS tickets were submitted.

 

Sites did not have the same configuration and, in some cases, there were other installations and upgrades during the same period.

 

I.Bird reminded sites that if they have problems they should always submit GGUS tickets, and not report issues only in operations meetings or via informal channels to individuals.

 

The period of four weeks is in general considered sufficient for such installation if there is adequate preparation in advance and with experience that needs to be gained by the staff doing the installations on the sites.

 

5.      Accounting Report (tables) - L.Robertson

 

 

No feedback received. Suggestions on new graphic presentations are welcome.

 

The slides show

-          A summary of CERN+Tier-1 sites

-          A summary of the Tier-1 sites

-          The CERN (=Tier-0+ CAF) accounting

-          The individual accounting of each Tier-1 site

 

The graphs include the nominal (MoU pledges) and the installed capacity both including the standard efficiency factors (cpu: 85%, disk:70%, tape:100%).

 

Comments at the MB

 

From the graphs one can see that many sites have not been fully utilized by the experiments in the first quarter. But in May the usage has increased considerably.

 

J.Templon showed the graphs of the SARA-NIKHEF utilization for the month of April:

http://www.nikhef.nl/grid/stats/ndpf-prd/voview-long

 

The CERN disk storage usage is a fraction of what is in reality used, because only the usage from Castor 2 is accounted, not the usage on disks still under Castor1.

 

The accounting reports for Tier-2s will be sent when it becomes clear which Tier-2 s participate in Service Challenge 4.

 

Decision:

The MB agreed to make this accounting information public.

 

Action:

The sites should send the accounting data for May before 16 June 2006.

 

 

6.      SAME results for May (tables) - P. Nyczyk

 

The tables attached show the availability of CERN and of the Tier-1 sites as monitored by the SAM (renamed from SAME, previously SFT) system.

 

The tests are performed at regular fixed periods. When a test fails it is not retried more frequently and the service is considered down until the next successful execution. SAM will repeat failed tests more frequently in the future, but this feature is not yet in operation.

 

Service downtime includes both scheduled and unscheduled downtime periods. The SAM monitoring is currently used to check “availability” and not reliability of the sites. The MB agreed that in the future scheduled and pre-announced downtime could be presented in different ways than unscheduled unavailability.

 

Comments at the MB

 

CERN commented that the low values on the period 23-25 May 2006 were due only to one missing certificate (to NDGF) and that all other services and connections were fully operational to the other 10 Tier-1 sites.

 

BNL have not deployed the CE and this makes the SAM test fail. NDGF should get into the standard deployment as the other sites.

 

The exact description and results of all these tests are available daily on the SAM site (a certificate is needed).

See: https://lcg-sam.cern.ch:8443/sam/sam.cgi

 

The MB in principle is not against making these values public in the future. But for now it should be limited to the MB members until the reasons of the downtimes can be commented and explained by the site. The usage of these tests and their results should be discussed during the Operations Workshop (19 June2006). It was also noted that the SAM values are anyway already available via the EGEE reports.

 

Action:

30 Jun 2006 - P.Nyczyk should compile a web page with the description of each SAM test. This had already been done - see email; from L.Robertson on 7 June

-          The web page http://lxb2063.cern.ch:8080/sqldb/site_avail.xsql allows you to extract the daily availability data by VO, site and time period.

-          The test sets used to calculate availability are:  CE, SE, site_BDII. The details of these tests can be found via the main SAM page https://lcg-sam.cern.ch:8443/sam/sam.cgi - select the test set (sBDII, CE or SE) and then click ShowSensorTests 

 

Action:

13 Jun 2006 - I.Bird to add the “discussion on the SAM tests and results” to the Operations Workshop agenda.

 

 

7.      AOB

 

7.1         Internal Review Questionnaires

The MB suggested that the questionnaire filled by the sites should be only available to the reviewers.

But the decision will be done by the GDB of the following day.

 

 

8.      Summary of New Actions

 

 

Action:

15 Jun 2006 - D.Liko to distribute the Job Priority WG report to the MB.

 

Action:

The sites should send the accounting data for May before 16 June 2006.

 

Action:

13 Jun 2006 - I.Bird to add the “discussion on the SAM tests and results” to the Operations Workshop agenda.

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.