LCG Management Board


Tuesday 23 May 2006 at 16:00




(Version 2 - 29.5.2006)


A.Aimar (notes), D.Barberis, I.Bird, K.Bos, S.Belforte, N.Brook, L.Dell’Agnello, J.Gordon, B.Gibbard, I.Fisk, T.Kleinwort, H.Marten, G.Merino, Di Qing, L.Robertson (chair), J.Shiers, J.Templon

Action List 

Next Meeting:

Tuesday 30 May 2006 at 16:00

1.      Minutes and Matters arising (minutes)


1.1         Minutes of the Previous Meeting

Minutes approved.


1.2         Draining Jobs at the Sites

The issue of “draining jobs at the sites” was discussed further, from the previous meeting.


LHCb would like a footnote to remind that a site should drain the jobs for all interventions longer than 12 hours. Sites that are draining jobs should not accept new batch jobs, in order to prepare for the intervention.


In addition, when possible, sites should announce one week in advance that they do not accept new submissions of long jobs. In this way experiments can queue their jobs to other grid sites.


When an intervention cannot be announced in advance (e.g. urgent security update), experiments should re-submit the jobs that could not be executed.


While the policy document should not contain operational details, and the circumstance of different interventions will require different operational procedures, it is important that the issues concerning job queue mqnqgement are taken into account and discussed at the Operations meeting. Sites and experiments should always participate and have a constant dialogue on progress, and changes on the LCG grid.



J.Shiers and N.Brook will add a note in the document to remind of the need of announcing in advance draining of jobs at the sites.

1.3         2006Q1 Reports and Executive Summary (documents )

The Executive Summary and all QR reports are available for comments until Friday. Both documents will then be sent to the Overview Board.


2.      Action List Review (list of actions, more information)



-          09 May 06 - L.Robertson will discuss with E.Laure and C.Grandi the status of the development of the features needed by the LHC (in Flavia’s list).

Done. L.Robertson discussed with C.Grandi about the features being developed in JRA1. C.Grandi will present priorities and next developments to the MB.


-          20 May 2006 – All SC4 sites send accounting data using the report form that will be sent to them by Fabienne Baud-Lavigne (for Jan, Feb, Mar, Apr 2006).

All Tier-1 sites replied except:

-          CNAF, but assured that would send the report before Wed 24 May 2006

-          NDGF and TRIUMF, which received the accounting sheets to fill a few days after the other sites.


Note. The LCG Office email is: L.Robertson apologised for the erroneous address in his original mail.


-          23 May 06 – Tier-1 sites should confirm via email to J.Shiers that they have set-up and tested their FTS channels configuration for transfers to Tier-1 and Tier-2 sites.

Not done by most sites. Should be sent to J.Shiers in the framework of the SC4 Coordination Meeting.


-          23 May 06 – V.Guelzow will add a question on the “site monitoring system status” to the Internal Review’s questionnaire.

Action removed. It was decided by the reviewers that the questionnaire is already covering this issue.


-          21 May 06 – J.Shiers and M.Schulz: Flavia’s list should be updated, maintained and used to control changes and releases. A fixed URL link should be provided to that list.

This will be clarified after the gLite 3.0 presentation at next MB meeting.

The gLite presentation should cover all points mentioned in Flavia’s list and not be a general presentation.


3.      Conclusion of the Review of Site Monitoring and Operation in SC4 (more information) - J.Shiers

-          Analysis of the Tier-0/Tier-1 throughput tests

-          Metrics


This item continues the discussion on the attached document, as revised following last weeks’ comments.

3.1         Metrics

The “Metrics” (page 2) proposed intend to quantify the status of the services and operations provided by a site. And in particular they take into account:

-          Ability to ramp up to the nominal rate

-          Stability of the services

-          Submission of explanatory weekly operations reports

-          Attendance to the weekly meeting

-          Site monitoring and operation log

-          Scheduled and unscheduled interventions following the agreed process.


The values of the metrics are:


            1. Excellent – consistently meets targets;

            2. Good – normally meets targets;

            3. Average – sometimes meets targets;

            4. Poor – rarely meets targets.


3.2         Table with metrics

The table on page 3 was discussed during the MB meeting but not on the specific sites and values.

The table currently refers only to the SC4 disk-disk Throughput Phase and intends to identify where the services are adequate and the aspects that should be improved. These metrics are proposed as a starting point to quantify the status of the different sites: suggestions, other rating criteria and new metrics should be sent to J.Shiers.


The definitions of the metrics and the criteria used to assign the values will be documented by J.Shiers. This table will be produced every month and submitted to the MB, but will not at present be published. Volunteers to take turns at producing the table should contact J.Shiers.

3.3         Debugging and managing the connections

Obviously all sites involved must monitor and operate their end of all channels, and should always help in debugging network and other problems, but the general responsibility for managing channels is as follows::

-          Tier-0<->Tier-1 – CERN

-          Tier-1<->Tier-2 – The Tier-1 site

-          Tier-1<->Tier-1 – the receiving Tier-1


If a transfer fails, experiments should submit a ticket to GGUS and the LCG Operations will channel it adequately.


From now all issues will be submitted as tickets in order to have detailed logs and will not be constantly monitored or “informally” fixed from CERN and the involved sites. The intention is to give the highest priority to “service stability” and move to run LCG service as stable operations.


Note: Tier-1 sites should test and confirm to J.Shiers that their FTS set-up is configured for transfers to other Tier-1 and to their Tier-2 sites (action already due). Sites should do their own tests but, if needed, they should contact CERN that can help them, using the network load generator.


Pages 5 to 7 contain comments for each site, which were not discussed at the MB meeting.



4.      SC4 Status


4.1         Installations of gLite 3.0 at the sites (more information ) - Sites Representatives


Deployment on all sites should be completed, and tested, by the 1st June 2006, as agreed.


The status of the gLite 3.0 installations at the sites is (ordered as discussed):

-          TRIUMF: installed

-          IN2P3: starting deployment this week (absent, information from the Operations meeting)

-          FZK/GridKa: started last week. All services installed but not completely tested. Problems with upgrades for FTS 1.5, gLite CE and the gLite WMS. The upgrade to gLite RC5 was done with priority on the Production System, not yet done on the Pre-Production System.

-          INFN: installation is being completed.

-          SARA/NIKHEF: information on SARA will be distributed via email to the MB list. NIKHEF is installing the components before the week end.

-          PIC: upgrade done for all LCG services. The gLite CE and RB will be deployed this week.

-          ASGC: installation and testing ongoing. Information will be distributed via email.

-          RAL: updates done. Not upgrading dCache, the RB and the UI (until the longstanding ticket about the rpm modifying the root configuration is fixed)

-          BNL: started upgrade but checking that there are no issues of coupling the new gLite CE with the Condor batch system. The LCG CE should not have any issue because it is unchanged.

-          FNAL: LCG 2.7 installations is stable and upgrades should be easy. Focus is on testing the gLite CE but documentation is poor and installation scripts had to be modified. The coupling with the Condor batch system is being investigated. Tickets were sent to GGUS.

-          CERN installed the glite WMS, worked on CE submission to the LSF batch system. Installation, via Quattor, on all nodes should be done within a week.

4.2        Names of the Tier-2 sites participating to the SC4 - Experiments Representatives


The experiments should specify in detail which Tier-2 sites will participate to SC4:

-          ALICE: not present to the meeting.


30 May 2006 - ALICE will send to J.Shiers the list of the sites to monitor in SC4.

-          ATLAS: positive replies from 5 sites: Frascati, Manno, Milano, Paris (GRIF) and Prague.
Other sites are preparing for SC4, but probably they will not ready for June.

-          CMS: about 20 sites should be ready to start in June.


30 May 2006 - CMS will send to J.Shiers the list of the sites to monitor in SC4.

-          LHCb: for the DC using only the CERN Tier-1 site. Using the other sites for production all available resources (now about 50 sites). In the MoU there are 14 sites specified for LHCb and those should be monitored in particular.


30 May 2006 - LHCb will send to J.Shiers the list of the sites to monitor in SC4.

4.3         Summary of Experiment Plans for SC4 Start-up (more information) - J.Shiers


Postponed to next MB meeting.


5.      AOB





6.      Summary of New Actions



J.Shiers and N.Brook will add a note in the document to remind of the need of announcing in advance draining of jobs at the sites.



30 May 2006 - ALICE will send to J.Shiers the list of the sites to monitor in SC4.



30 May 2006 - CMS will send to J.Shiers the list of the sites to monitor in SC4.



30 May 2006 - LHCb will send to J.Shiers the list of the sites to monitor in SC4.



The full Action List, current and past items, will be in this wiki page before next MB meeting.