LCG Management Board |
|||||||||||||||||||||
Date/Time: |
Tuesday
4 December 2007 16:00-18:00 – F2F Meeting |
||||||||||||||||||||
Agenda: |
|||||||||||||||||||||
Members: |
|||||||||||||||||||||
|
(Version 1 - 7.12.2007) |
||||||||||||||||||||
Participants:
|
A.Aimar
(notes), I.Bird, K.Bos, F.Carminati, Ph.Charpentier, L.Dell’Agnello, A.Di Girolamo,
T.Doyle, M.Ernst, I.Fisk, S.Foffano, J.Gordon, C.Grandi, F.Hernandez,
R.Kalmady, M.Kasemann, M.Lamanna, E.Laure, S.Lin, U.Marconi, H.Marten,
P.Mato, P.Mendez Lorenzo, G.Merino, L.Robertson (chair), Y.Schutz, J.Shih, J.Shiers, O.Smirnova, R.Tafirout, J.Templon
|
||||||||||||||||||||
Action
List |
|||||||||||||||||||||
Mailing
List Archive: |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
||||||||||||||||||||
Next Meeting: |
Tuesday
11 December 2007 16:00-17:00 – Phone Meeting |
||||||||||||||||||||
1. Minutes and Matters arising (Minutes) |
|||||||||||||||||||||
1.1 Minutes of Previous Meeting
The
minutes of the previous MB meeting were approved. |
|||||||||||||||||||||
2. Action List Review (List of actions)Actions that are late are highlighted in RED. |
|||||||||||||||||||||
Not done. Missing
date estimates for the 2008 capacity from: TW-ASGC, NDGF, NL-T1. The other sites
said that they will fulfill the 2008 capacity by 1 April 2008. DE-KIT: H.Marten clarified that they will fulfill the 2008 pledges in
two separated milestones and not all in April. Respective planning information
was sent to H.Renshall and S.Foffano in October 2007. FR-CCIN2P3:
F.Hernandez confirmed that IN2P3 will match their pledges as agreed in the
MoU. TW-ASGC:
S.Lin clarified that by end of the year they will receive the 2008 capacity
and all will be installed by April. Only for the tapes they will not be at
the 2008 capacity but that is easy to recover when needed. They will send the
revised version of their accounting data to S.Foffano. NL-T1:
J.Templon confirmed that they will send to H.Renshall a new estimate for the
delivery of the 2008 capacity. New Action: NDGF and NL-T1 should send to H.Renshall an estimate about
the delivery of their 2008 capacity. J.Gordon
asked what will happen if a Site cannot deliver all that was committed to an
Experiment? It seems that, for instance, ATLAS has redistributed the
reduction from CNAF to all the other ATLAS sites, increasing their required
capacity. The other Sites cannot adapt to such changes but follow their own
pledges. L.Robertson
replied that if there is enough capacity for 2008 at a site it should not be
a problem. But in general if a site reduces its capacity the Experiment will
received less capacity; unless some action is taken to solve the specific
issue with the funding organizations. Experiments cannot change their
requirement to other Sites when a specific site delivers less than agreed,
they have to ask the sites and funding agencies to see where (and whether) it
is possible to increase capacity elsewhere. K.Bos
added that for the CCRC 2008 challenges this will not be a problem. But this
use case situation should be clarified in order to know what to do when will
happen that the capacity is insufficient at some sites.
Not done.
Received from TW-ASGC and IT-INFN (Stefano Zani). RAL's OPN
contact is Robin Tasker K.Bos explained that contacts for the OPN
from all WLCG Tier-1 sites are needed. Some representatives at the OPN group
are representing the NREN organizations and not the Tier-1 sites. In addition, DANTE has agreed to provide the
monitoring of the OPN. A document describing the proposal should be reviewed
by the sites representatives in the OPN. Only once it is agreed at the OPN
level will be presented to the MB for approval. New Action: D.Foster should send a mail to clarify its request for a
representative for the OPN network site operations. Update: D.Foster asked
the MB members for the names of whom to contact at the sites for reviewing
the document about OPN Monitoring. See his email here (NICE login
required). |
|||||||||||||||||||||
3. SRM 2.2 Weekly Update (Slides) - J.Shiers |
|||||||||||||||||||||
J.Shiers presented an update of the SRM 2.2 Production Deployment. The main points are: - The production deployment is basically on schedule. At FNAL – because of the (delayed) “CMS global run” - the deployment was rescheduled for this week. - Three new bugs were found and a new release should go to certification and in pre-release for the Applications Area shortly. - Information about storage classes is now sufficient for the February CCRC. - The space for files recalled does not seem to be well defined (i.e. the knowledge is “lost”). It is important that the same solution is found by all SRM implementations and for all VOs. - If more development is required the new versions of the SRM implementations will be released and deployed after February. Here is the slide showed.
|
|||||||||||||||||||||
4. Update on CCRC-08 Planning (Slides) - J.Shiers |
|||||||||||||||||||||
The F2F CCRC Meeting at the pre-GDB managed to: -
Conclude
on scaling factors. -
Conclude
on the scope and scale of February’s challenge. Not all 2008 resources will
be available at all sites -
The
minimum exercise at Tier-1 sites is to “loop over all events with calibration
DB look-up”. -
Monitoring,
logging and reporting the (progress of the) challenge was agreed. Discussion
at WLCG Service Reliability workshop prepared with concrete proposal to
CCRC’08 planning meeting. Other issues are still to complete: -
Conclude on SRM v2.2 storage setup details -
Stop-gap and long-term solutions for “storage
token” on recall -
‘Walk-throughs’ by experiments of ‘blocks’ of
activities, emphasising
the “Critical Services” involved and the appropriate scaling factors -
CDR challenge in December –
splitting out ‘temporary’ (challenge) and permanent data -
Are
there other tests that can be done prior to February? -
De-scoping – this needs to be clarified. The results of the F2F meeting were positive: Good exposure of
plans from the Experiments, in particular by LHCb, with several pictures.
They had some questions which were answered. And the presentations have
generated quite a few questions / comments from the sites. Ideally, one needs
the same level of detail (and presentation) from all Experiments. The fact
that the LHCb presentation generated questions suggests (to me at least) that
there would also be questions for the other Experiments. ATLAS Tier-1
Jamboree this week and the CMS week next week will help to clarify things.
But there’s no time for another F2F this year. It will have one in January or
later. There are problems with the Sites resources
for CCRC’08 & LHC Data taking in 2008: the current plans seems to be to
deliver too little too late for the official schedule The next steps will include monitoring the status of preparations and
progress with tests on a (at least) daily basis latest beginning of next
year. This is needed in order to know what’s going on at least at the (week)
daily operations meeting, via the EIS people. The “Service
Coordinator on duty” – aka “Run Coordinator” that was proposed by the
Experiment Spokesmen in 2006 will be needed. Most likely two level or run
coordination, as per the ALEPH model: -
Management
level – i.e. Harry or myself -
Technical
level – knows the full technical detail, or knows who knows… F.Hernandez
commented that is still not so clear what is the amount of data and the rates
needed by all Experiments. Information like the one from LHCb seems useful
and he will try to show it to the IN2P3 experts (in mass storage, networking,
etc) in order to check whether that kind of information is sufficient for the
Sites services setup (Tier-1 storage, conditions databases, networking, etc).
L.Robertson
proposed that a few sites representatives (L.Dell’Agnello, F.Hernandez and
G.Merino) could agree on a set of questions and topics that need to be
clarified by the Experiments. The MB agreed. New Action: L.Dell’Agnello, F.Hernandez and G.Merino prepare a
questionnaire or a check list for the Experiments in order to collect the
Experiments requirements in a form suitable for the Sites. H.Marten
asked whether the clarification of the SRM requirements at the sites is also
in the mandate of the GSSD group. J.Gordon
replied that GSSD is more covering the requirements of the SRM
functionalities and deployment at the sites; not for performance, throughput
and rates. He also added that the GSSD’s mandate should be revised in January
2008. |
|||||||||||||||||||||
5. Storage Reporting (Slides) - L.Robertson |
|||||||||||||||||||||
During the Comprehensive Review it was asked how the space available at a site can be known by the Experiments. The proposal is that this issue should probably be formalised and
brought together as part of CCRC’08 coordination. A short note should define
exactly what the parameters are and how they map to different storage
systems. Harry’s table and the monthly accounting
should be extended to collect and report on these parameters. The Automated
Storage Accounting should evolve to collect and report the same set of values. The Megatable was a good tool for collecting the Experiments’ information but is now not up to date and it will be abandoned. Required and available resources should be specified via Harry’s tables until a replacement is centrally implemented. The Megatable expresses, for each T-1 and
experiment the requirements, by storage class - including the disk cache
“hidden” in T1D0. The MB agreed that if the disk cache size is in the order
of a 1-2 % of the total it could be neglected. |
|||||||||||||||||||||
6. WLCG Reliability Workshop (Paper; Slides) - J.Shiers |
|||||||||||||||||||||
J.Shiers summarized the Reliability workshop of the previous week. The details are in the attached slides. The focus of the workshop was on the fact that the best way to achieve the requested level of resilience is by building fault tolerance into the services, including experiment-specific ones. The techniques are simple and well tested. A paper summarizes these techniques (Paper). -
DNS load
balancing -
Oracle
“Real Application Clusters” -
H/A
Linux (less recommended, because it is not really H/A…) Reliable services take less effort to run than unreliable ones. And some ‘flexibility’ not needed by this community has sometimes led to excessive complexity, i.e. complexity is the enemy of reliability (e.g. WMS) At least one WLCG service (VOMS) middleware does not currently meet the stated service availability requirements. Need also to work through experiment services using a ‘service dashboard’ as was done for WLCG services (use of a service map?)
The text description from ATLAS is the clearest. Because not all services can be fixed in the same amount of time and have the same real criticality. Depending on the nature of the problem a problem can take days to be solved. But clearly the 3 categories are well defined: down, seriously degraded and perturbed. Here are just some examples of critical services specified by each Experiment: - ATLAS: only 2 services are in top category (ATONR, DDM central services) - CMS: long list containing also numerous IT services (including phones, Kerberos, …) - LHCb: CERN LFC, VO boxes, VOMS proxy service - ALICE: CERN VO box, CASTOR + xrootd@T0 (which is surprising) In summary the main goal for the future months are: - Measure the improvements in service quality by April workshop - Monitor the progress possibly using a ‘Service Map’ (Size of box = criticality; colour = status wrt “checklist”) - Have by CHEP 2009: all main services at required service level 24x7x52. - Database(-dependent) and data / storage management services appear (naturally) very high in the list + experiment services - 24x7 stand-by should be put in place at CERN for these services, at least for initial running of the LHC. For some services there are not enough experts and they must be reached when not at CERN. Therefore the call-out system must be in place. The operators can only restart the DB services but often a DB expert is needed on site to solve the problem. Defining more reliable services will also reduce the need of call-outs to the experts. J.Gordon
asked details about the 24x7 CERN milestones. T.Cass
replied that there is a 24x7 presence on site and an on-call team of
technicians for the storage and the main services. The current arrangement
satisfies the WLCG milestone because the 24x7 team and procedures are
established and in operation. L.Robertson
asked why for some sites it seems difficult to complete the 24x7 milestones
with people and procedures in place. J.Gordon
replied that the sites have to define the response for all possible alarms.
In addition some sites did not have already all the administrative and
employment procedures to support and allow 24x7 operations. |
|||||||||||||||||||||
7. Tier-2 Reliability Reports - L.Robertson |
|||||||||||||||||||||
The Tier-2 reliability data is now published every month. The Tier-2 sites are not officially represented at the MB; therefore the Tier-2 reliability data will sent to the Collaboration Board representatives. Starting next January the CB members of, for instance, the “6 (or 10) least reliable sites” will be asked for an explanation, just like the Tier-1 representatives comment the Tier-1 monthly reliability data. G.Merino
asked how the reliability of the Tier-2 federations is calculated. L.Robertson
replied that it is a “non-weighted average of the sites in a Tier-2
federation”. Each site counts for one, because the pledges of each site are
not available. |
|||||||||||||||||||||
8. Status of the VO-specific SAM tests (VO tests results - new and old) |
|||||||||||||||||||||
8.1 ALICE (Slides) - P.Mendez LorenzoPatricia Mendez Lorenzo presented a summary of usage on SAM in ALICE. 8.1.1 SAM and FCR for ALICEALICE uses SAM to monitor the VOBOXES only, with specific tests associated to the VOBOXES service only. FCR is not used to black list any site because ALICE has already its own mechanism. For the CEs, ALICE assumes the test suite defined by the SAM developers and all sites are set to “YES”. While for VOBOXES, FCR is used to define the list of Critical Tests. The SAM developers have implemented special sensors for VOBOXES, so that ALICE can decide the test suite to run in VOBOXES. Implementation and visualization in MonaLisa is possible via an interface SAM-MonaLisa. The Alarm system by ALICE provides direct notification of problems to the sites. See slide 5 for the details of how the SAM features are wrapped for the execution of the ALICE test suite and collect the results every 2 hours. SAM developers implemented a registration procedure which ensures the freedom required by ALICE. In this way ALICE creates the list of VOBOXES and put it in a www area. This list can be changed as much as needed. Each 1h a tool read that files and registers new entries. Also deletes old entries that are not monitored for 1 week. J.Gordon
asked why ALICE has sites that are not registered in GOCDB. P.Mendez
Lorenzo replied that there are sites not in EGEE that are not registered in
GODCB (e.g. in South America). The SAM tools were modified in order to
provide this infrastructure to those sites not in GOCDB. 8.1.2 ALICE VOBOXES TestsAll tests can be defined as Critical Tests and this is the unique use of FCR by ALICE. This defines the global status of the site. Tests checking proxies issues - Proxy renewal service of the VOBOX - User proxy registration - VOBOX registration within the MYPROXY server - Proxy of the machine - Duration of the delegated proxy Tests checking UI and services features - Access to the software area - JA tests (NEW) - Status of the RB under used (NEW) - Could not be defined until the failover mechanism was not extended to all VOBOXES - Status of the CE (NEW) 8.1.3 SAM, MonaLisa and the Alarm SystemThe results of SAM have been interfaced with MonaLisa. SAM allows queries to the SAM DB to get different information and SAM developers created a special query for ALICE to visualize the status of the tests in MonaLisa. The status of all individual tests and access to the SAM page where the details of the tests are explained in the link here The SAM developers have created a tool able to send emails and SMS to the contacts persons at each site in the case that the CT fails. Currently ALICE uses the email feature only. Simultaneously a detailed message is also sent from the MonaLisa interface. It needs just a config file that contains: Name of VOBOX, Contact person mail, Phone. The contact persons and or lists have been provided by each site. 8.1.4 Pending Issues and RequirementsIs it needed the connection to the GOCDB to establish which VOBOXES are in scheduled downtime. The idea is to avoid sending the test suite and emails to those sites declaring the VOBOXES in downtime. SAM developers provided a query to the GOC DB but the implementation is ALICE’s responsibility and the code is ready. Improve the flexibility to delete tests and sites. ALICE needs to decide when to include new sites and tests.
Improve the alarm system to include ALICE requirements. Reports via GGUS: SITES REQUIREMENT. Up to now is not possible. The GGUS DB does not provide read access: More detailed messages on the emails. The message only warns about the error but does not provide additional details (link to the test into the SAM interface, etc). J.Templon
commented that accessing, even in read-only mode, the GGUS database allows
access to information that is not specific to a project. A better solution
would be to track the transitions of a ticket and the project would know
exactly the status of each ticket by tracking those emails. P.Mendez
Lorenzo replied that receiving an email would require to monitor a mailbox
specific to this purpose and duplicate some of the GGUS database information.
L.Robertson
asked how, for instance, the October test results and the failures displayed
in GridView are to be considered for ALICE. P.Mendez
Lorenzo replied that only the VOBOXES set is to be considered. The others
test results (i.e. on CE, SE and SRM) are set by the SAM team but are not
used by ALICE. F.Hernandez
noted that GridView does not show the VOBOXES results but all the other test
suites (CE, SE and SRM). P.Mendez
Lorenzo replied that this needs to be changed in GridView. And also those
other tests showed in GridView are not checked because ALICE has its own
black-listing mechanism. F.Hernandez
added that in GridView is not clear where to find where the tests that are
actually interesting for the VOs are. The MB agreed that will be important to organize, outside
the MB, a review of the tests and what is displayed in GridView for the
VO-specific SAM results. 8.2 ATLAS (Slides) - A.Di GirolamoA.Di Girolamo then presented the ATLAS-Specific SAM tests. 8.2.1 ATLAS Critical TestsNow running standard
OPS tests using ATLAS credentials (i.e. the original SAM tests run under the
ATLAS VO) are: SE & SRM: -
put: lcg-cr using cern-prod LFC, files in
SAM test directory -
get: lcg-cp from site to the SAM UI -
del: lcg-del - clean the catalog and the
storage CE -
Check CA
RPMs version -
Job
Submission on a WN tests -
VO swdir
(sw installation directory) LFC -
lfc-ls,
lfc-mkdir FTS -
glite-transfer-channel-list,
Information System configuration and publication The list of sites
supporting ATLAS is taken from GOCDB 8.2.2 Current DevelopmentCurrently they are developing and testing ATLAS-specific SAM tests in order to: - monitor the availability of ATLAS critical Site Services - verify the correct installation and the proper functioning of the ATLAS software on each site - checking the SE & SRM & CE endpoints definition: cross-checking the information in the GOCDB and TiersOfATLAS (ATLAS specific sites configuration file with Cloud Model) SE & SRM (centrally from SAM
UI): - put: lcg-cr with Cloud LFC, with and without using BDII infos - get: lcg-cp CE (job submitted on each
ATLAS CE): - keep on running large part of OPS suite -
for ATLAS Tier1 and Tier2: - Check the presence of the required version of the ATLAS sw - Compile and execute a real analysis job based on a sample dataset -
Test put/get to local storage via native protocols (dccp, rfcp …) 8.2.3 November 2007 ResultsThe SAM Critical
Tests were not reliable for: - France: BDII configuration (ATLAS endpoint should be explicitly put). - NDGF/BNL: different service setup The SAM Critical
Tests last month failures: - FZK: real SRM failures. Problems under investigation with site responsible -
SARA: (mainly) due to not scheduled network problems F.Hernandez
asked for details on the ATLAS issues at IN2P3. A.Di
Girolamo replied that they are already working with the site responsible. The
full endpoint name must be specified at IN2P3 and ATLAS was not doing so. OPS
had the correct configurations and therefore the OPS tests are succeeding. 8.2.4 Next StepsIn preparations are
new ATLAS specific tests (now running in pre-production) will be more
realistic for the Experiment. The alarms are not sent during this test phase
but will be turned on soon. In particular the
focus is on improving the completeness of the monitor information: - Information across TiersOfATLAS, GOCDB and BDII. - ATLAS Cloud topology view is needed for better monitoring - Integration with Ganga Robot and other ATLAS tools -
Integration with the ATLAS dashboard for easier visualization Ph.Charpentier
asked why there are two sites NIKHEF and SARA in GridView. J.Templon
replied that the reason is that SAM is not able to handle CEs and SEs at
different sites. Ph.Charpentier
asked how GridView knows which the Tier-1 sites are for a given VO. R.Kalmady
replied that the list is coded inside GridView and the VO should send this
information to the GridView team. J.Gordon
noted that this information is available also in the GOCDB database and
therefore should be updated and retrieved from there. H.Marten
presented some examples where it was difficult for the site to understand why
the OPS tests would succeed and the VO specific would fail on exactly the
same node. The
explanation from A.Di Girolamo was that different VO will use different
credentials and this could cause different results. Or it could be that
different VOs have different time-out definitions for the same test and
therefore successes or failures on the same node of the same test. The MB agreed that the VO and the sites should work together
to investigate behaviours that look inconsistent. And the Vo-specific tests should be reviewed in a more in
–depth dedicated meeting. |
|||||||||||||||||||||
9. AOB |
|||||||||||||||||||||
No AOB. |
|||||||||||||||||||||
10. Summary of New Actions |
|||||||||||||||||||||
The full Action List, current and past items, will be in this wiki page before next MB meeting. |