LCG Management Board |
|
Date/Time: |
Tuesday 23 May 2006 at 16:00 |
Agenda: |
|
Members: |
|
|
(Version
2 - 29.5.2006) |
Participants: |
A.Aimar (notes), D.Barberis, I.Bird,
K.Bos, S.Belforte,
N.Brook, L.Dell’Agnello, J.Gordon, B.Gibbard, I.Fisk, T.Kleinwort,
H.Marten, G.Merino, Di Qing, L.Robertson (chair),
J.Shiers, J.Templon |
Action List |
|
Next Meeting: |
Tuesday
30 May 2006 at 16:00 |
1.
Minutes and Matters arising (minutes)
|
|
1.1
Minutes of the Previous
Meeting
Minutes approved. 1.2
Draining Jobs at the Sites
The issue of “draining jobs at the
sites” was discussed further, from the previous meeting. LHCb would like a footnote to remind that
a site should drain the jobs for all interventions longer than 12 hours.
Sites that are draining jobs should not accept new batch jobs, in order to
prepare for the intervention. In addition, when possible, sites should
announce one week in advance that they do not accept new submissions of long
jobs. In this way experiments can queue their jobs to other grid sites. When an intervention cannot be announced
in advance (e.g. urgent security update), experiments should re-submit the
jobs that could not be executed. While the policy document should not
contain operational details, and the circumstance of different interventions
will require different operational procedures, it is important that the issues
concerning job queue mqnqgement are taken into
account and discussed at the Operations meeting. Sites and experiments should
always participate and have a constant dialogue on progress, and changes on
the LCG grid. Action: J.Shiers and N.Brook
will add a note in the document to remind of the need of announcing in
advance draining of jobs at the sites. 1.3
2006Q1 Reports and Executive
Summary (documents )
The Executive Summary and all QR reports
are available for comments until Friday. Both documents will then be sent to
the Overview Board. |
|
2. Action List Review (list
of actions, more
information)
|
|
-
09
May 06 - L.Robertson will discuss
with E.Laure and C.Grandi the status of the development of the features
needed by the LHC (in Flavia’s list). Done. L.Robertson
discussed with C.Grandi about the features being developed in JRA1. C.Grandi
will present priorities and next developments to the MB. -
20 May 2006 – All SC4 sites send
accounting data using the report form that
will be sent to them by Fabienne Baud-Lavigne (for Jan, Feb, Mar, Apr 2006). All Tier-1 sites replied except: -
CNAF, but assured that would
send the report before Wed 24 May 2006 -
NDGF and TRIUMF, which received
the accounting sheets to fill a few days after the other sites. Note. The LCG Office email is: lcg.office@cern.ch.
L.Robertson apologised for the erroneous address in his original mail. -
23
May 06 – Tier-1 sites should
confirm via email to J.Shiers that they have set-up and tested
their FTS channels configuration for transfers to Tier-1 and Tier-2 sites. Not done by most
sites. Should be sent to J.Shiers in the framework of the SC4 Coordination
Meeting. -
23 May 06 – V.Guelzow will add a question on the “site monitoring
system status” to the Internal Review’s questionnaire. Action removed. It was decided by the
reviewers that the questionnaire is already covering this issue. -
21 May 06 – J.Shiers and M.Schulz: Flavia’s list should be updated, maintained and used to control changes and releases. A fixed URL
link should be provided to that list. This will be clarified after the gLite
3.0 presentation at next MB meeting. The gLite presentation should cover all points
mentioned in Flavia’s list and not be a general presentation. |
|
3.
Conclusion of the Review of
Site Monitoring and Operation in SC4 (more information) - J.Shiers
-
Analysis of the Tier-0/Tier-1
throughput tests -
Metrics |
|
This item continues the discussion on the
attached document, as revised following last
weeks’ comments. 3.1
Metrics
The “Metrics” (page 2)
proposed intend to quantify the status of the services and operations
provided by a site. And in particular they take into account: -
Ability to ramp up to the
nominal rate -
Stability of the services -
Submission of explanatory
weekly operations reports -
Attendance to the weekly
meeting -
Site monitoring and operation
log -
Scheduled and unscheduled
interventions following the agreed process. The values of the metrics are: 1.
Excellent – consistently meets targets; 2.
Good – normally meets targets; 3.
Average – sometimes meets targets; 4.
Poor – rarely meets targets. 3.2
Table with metrics
The table on page 3 was discussed during
the MB meeting but not on the specific sites and values. The table currently refers only to the
SC4 disk-disk Throughput Phase and intends to identify where the services are
adequate and the aspects that should be improved. These metrics are proposed
as a starting point to quantify the status of the different sites:
suggestions, other rating criteria and new metrics should be sent to
J.Shiers. The definitions of the metrics and the
criteria used to assign the values will be documented by J.Shiers. This table
will be produced every month and submitted to the MB, but will not at present
be published. Volunteers to take turns at producing the table should contact
J.Shiers. 3.3
Debugging and managing the
connections
Obviously all sites involved must monitor
and operate their end of all channels, and should always help in debugging
network and other problems, but the general responsibility for managing
channels is as follows:: -
Tier-0<->Tier-1 –
CERN -
Tier-1<->Tier-2 –
The Tier-1 site -
Tier-1<->Tier-1 –
the receiving Tier-1 If a transfer fails, experiments should
submit a ticket to GGUS and the LCG Operations will channel it adequately. From now all issues will be submitted as
tickets in order to have detailed logs and will not be constantly monitored
or “informally” fixed from CERN and the involved sites. The
intention is to give the highest priority to “service stability”
and move to run LCG service as stable operations. Note: Tier-1
sites should test and confirm to J.Shiers that their FTS set-up is configured
for transfers to other Tier-1 and to their Tier-2 sites (action already due).
Sites should do their own tests but, if needed, they should contact CERN that
can help them, using the network load generator. Pages 5 to 7 contain comments for each
site, which were not discussed at the MB meeting. |
|
4.
SC4 Status
|
|
4.1
Installations of gLite 3.0 at
the sites (more information ) -
Sites Representatives
Deployment on all sites should be completed, and tested, by the 1st
June 2006, as agreed. The status of the gLite 3.0 installations at the sites is (ordered
as discussed): -
TRIUMF: installed -
IN2P3: starting deployment
this week (absent, information from the Operations meeting) -
FZK/GridKa: started last
week. All services installed but not completely tested. Problems with
upgrades for FTS 1.5, gLite CE and the gLite WMS. The upgrade to gLite RC5
was done with priority on the Production System, not yet done on the
Pre-Production System. -
INFN: installation is being
completed. -
SARA/NIKHEF: information on
SARA will be distributed via email to the MB list. NIKHEF is installing the
components before the week end. -
PIC: upgrade done for all LCG
services. The gLite CE and RB will be deployed this week. -
ASGC: installation and
testing ongoing. Information will be distributed via email. -
RAL: updates done. Not
upgrading dCache, the RB and the UI (until the longstanding ticket about the
rpm modifying the root configuration is fixed) -
BNL: started upgrade but
checking that there are no issues of coupling the new gLite CE with the
Condor batch system. The LCG CE should not have any issue because it is
unchanged. -
FNAL: LCG 2.7 installations
is stable and upgrades should be easy. Focus is on testing the gLite CE but
documentation is poor and installation scripts had to be modified. The
coupling with the Condor batch system is being investigated. Tickets were
sent to GGUS. -
CERN installed the glite WMS,
worked on CE submission to the LSF batch system. Installation, via Quattor,
on all nodes should be done within a week. 4.2
Names of
the Tier-2 sites participating to the SC4 - Experiments Representatives
The experiments should specify in detail which Tier-2 sites will
participate to SC4: -
ALICE: not present to the
meeting. Action: 30 May 2006 - ALICE
will send to J.Shiers the list of the sites to monitor in SC4. -
ATLAS: positive replies from
5 sites: Frascati, Manno, Milano, Paris (GRIF) and Prague. -
CMS: about 20 sites should be
ready to start in June. Action: 30 May 2006 - CMS will
send to J.Shiers the list of the sites to monitor in SC4. -
LHCb: for the DC using only
the CERN Tier-1 site. Using the other sites for production all available
resources (now about 50 sites). In the MoU there are 14 sites specified for
LHCb and those should be monitored in particular. Action: 30 May 2006 - LHCb will
send to J.Shiers the list of the sites to monitor in SC4. 4.3
Summary of Experiment Plans
for SC4 Start-up (more information) - J.Shiers
Postponed to next MB meeting. |
|
5.
AOB
|
|
No AOB. |
|
6.
Summary of New Actions
|
|
Action: J.Shiers and N.Brook
will add a note in the document to remind of the need of announcing in
advance draining of jobs at the sites. Action: 30 May 2006 - ALICE
will send to J.Shiers the list of the sites to monitor in SC4. Action: 30 May 2006 - CMS will
send to J.Shiers the list of the sites to monitor in SC4. Action: 30 May 2006 - LHCb will
send to J.Shiers the list of the sites to monitor in SC4. The full Action List, current and past items, will be in this wiki page before next MB meeting. |