LCG Management Board

Date/Time:

Tuesday 4 December 2007 16:00-18:00 – F2F Meeting

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=22189

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 7.12.2007)

Participants:

A.Aimar (notes), I.Bird, K.Bos, F.Carminati, Ph.Charpentier, L.Dell’Agnello, A.Di Girolamo, T.Doyle, M.Ernst, I.Fisk, S.Foffano, J.Gordon, C.Grandi, F.Hernandez, R.Kalmady, M.Kasemann, M.Lamanna, E.Laure, S.Lin, U.Marconi, H.Marten, P.Mato, P.Mendez Lorenzo, G.Merino, L.Robertson (chair), Y.Schutz, J.Shih, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 11 December 2007 16:00-17:00 – Phone Meeting

1.    Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

 

2.    Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

  • 21 October 2007 - Sites should send to H.Renshall their resources acquisition plans for CPU, disks and tapes until April 2008

Not done. Missing date estimates for the 2008 capacity from: TW-ASGC, NDGF, NL-T1. The other sites said that they will fulfill the 2008 capacity by 1 April 2008.

 

DE-KIT: H.Marten clarified that they will fulfill the 2008 pledges in two separated milestones and not all in April. Respective planning information was sent to H.Renshall and S.Foffano in October 2007.

 

FR-CCIN2P3: F.Hernandez confirmed that IN2P3 will match their pledges as agreed in the MoU.

 

TW-ASGC: S.Lin clarified that by end of the year they will receive the 2008 capacity and all will be installed by April. Only for the tapes they will not be at the 2008 capacity but that is easy to recover when needed. They will send the revised version of their accounting data to S.Foffano.

 

NL-T1: J.Templon confirmed that they will send to H.Renshall a new estimate for the delivery of the 2008 capacity.

 

New Action:

NDGF and NL-T1 should send to H.Renshall an estimate about the delivery of their 2008 capacity.

 

J.Gordon asked what will happen if a Site cannot deliver all that was committed to an Experiment? It seems that, for instance, ATLAS has redistributed the reduction from CNAF to all the other ATLAS sites, increasing their required capacity. The other Sites cannot adapt to such changes but follow their own pledges.

 

L.Robertson replied that if there is enough capacity for 2008 at a site it should not be a problem. But in general if a site reduces its capacity the Experiment will received less capacity; unless some action is taken to solve the specific issue with the funding organizations. Experiments cannot change their requirement to other Sites when a specific site delivers less than agreed, they have to ask the sites and funding agencies to see where (and whether) it is possible to increase capacity elsewhere.

 

K.Bos added that for the CCRC 2008 challenges this will not be a problem. But this use case situation should be clarified in order to know what to do when will happen that the capacity is insufficient at some sites.

  • 30 November 2007 - The Tier-1 sites should send to A.Aimar the name of the person responsible for the operations of the OPN at their site.

Not done. Received from TW-ASGC and IT-INFN (Stefano Zani). RAL's OPN contact is Robin Tasker

 

K.Bos explained that contacts for the OPN from all WLCG Tier-1 sites are needed. Some representatives at the OPN group are representing the NREN organizations and not the Tier-1 sites.

 

In addition, DANTE has agreed to provide the monitoring of the OPN. A document describing the proposal should be reviewed by the sites representatives in the OPN. Only once it is agreed at the OPN level will be presented to the MB for approval.

 

New Action:

D.Foster should send a mail to clarify its request for a representative for the OPN network site operations.

 

Update: D.Foster asked the MB members for the names of whom to contact at the sites for reviewing the document about OPN Monitoring. See his email here (NICE login required).

 

3.    SRM 2.2 Weekly Update (Slides) - J.Shiers

J.Shiers presented an update of the SRM 2.2 Production Deployment.

 

The main points are:

-       The production deployment is basically on schedule. At FNAL – because of the (delayed) “CMS global run” - the deployment was rescheduled for this week.

-       Three new bugs were found and a new release should go to certification and in pre-release for the Applications Area shortly.

-       Information about storage classes is now sufficient for the February CCRC.

-       The space for files recalled does not seem to be well defined (i.e. the knowledge is “lost”). It is important that the same solution is found by all SRM implementations and for all VOs.

-       If more development is required the new versions of the SRM implementations will be released and deployed after February.

 

Here is the slide showed.

 

       SRM v2.2 production deployment is proceeding on schedule without hiccoughs

      Typically 1 to 1.5 days per (dCache) site, including other housekeeping operations

       NDGF, FZK, SARA, IN2P3 (done), FNAL (deferred), others in coming weeks…

      A log is kept of each upgrade and the issues found

       Bugs in the client tools are being tracked by the Engineering Management Taskforce (EMT) with high priority

      3 new bugs found: 2 had been solved yesterday, other is well understood

       Recent schedule has disrupted weekly SRM management con-calls: restarted this week

Ø  Information from experiments on storage classes at sites urgently required!

- Information from CMS expected week before Xmas

- For other experiments now have sufficient for preparation for February

Issue: space for files recalled does not seem to be well defined (knowledge is “lost”)

      Workaround then long-term solution?

      Standard behaviour is essential! (and was agreed at con-call) J

      If development is required, we talking about point release after February…

      The consequences of the current situation will be documented

      Sites will take the requests from the experiments and look at how to implement them

      Iterate via phone meeting(s) in December and next CCRC’08 F2F in January

     We need practical experience – we cannot make decisions based on interpretation alone!

 

 

 

4.    Update on CCRC-08 Planning (Slides) - J.Shiers

 

The F2F CCRC Meeting at the pre-GDB managed to:

-       Conclude on scaling factors.

-       Conclude on the scope and scale of February’s challenge. Not all 2008 resources will be available at all sites

-       The minimum exercise at Tier-1 sites is to “loop over all events with calibration DB look-up”.

-       Monitoring, logging and reporting the (progress of the) challenge was agreed. Discussion at WLCG Service Reliability workshop prepared with concrete proposal to CCRC’08 planning meeting.

 

Other issues are still to complete:

-       Conclude on SRM v2.2 storage setup details

-       Stop-gap and long-term solutions for “storage token” on recall

-       ‘Walk-throughs’ by experiments of ‘blocks’ of activities, emphasising the “Critical Services” involved and the appropriate scaling factors

-       CDR challenge in December – splitting out ‘temporary’ (challenge) and permanent data

-       Are there other tests that can be done prior to February?

-       De-scoping – this needs to be clarified.

 

The results of the F2F meeting were positive:

Good exposure of plans from the Experiments, in particular by LHCb, with several pictures. They had some questions which were answered. And the presentations have generated quite a few questions / comments from the sites.

 

Ideally, one needs the same level of detail (and presentation) from all Experiments. The fact that the LHCb presentation generated questions suggests (to me at least) that there would also be questions for the other Experiments.

 

ATLAS Tier-1 Jamboree this week and the CMS week next week will help to clarify things. But there’s no time for another F2F this year. It will have one in January or later.

 

There are problems with the Sites resources for CCRC’08 & LHC Data taking in 2008: the current plans seems to be to deliver too little too late for the official schedule

 

The next steps will include monitoring the status of preparations and progress with tests on a (at least) daily basis latest beginning of next year. This is needed in order to know what’s going on at least at the (week) daily operations meeting, via the EIS people.

 

The “Service Coordinator on duty” – aka “Run Coordinator” that was proposed by the Experiment Spokesmen in 2006 will be needed. Most likely two level or run coordination, as per the ALEPH model:

-       Management level – i.e. Harry or myself

-       Technical level – knows the full technical detail, or knows who knows…

 

F.Hernandez commented that is still not so clear what is the amount of data and the rates needed by all Experiments. Information like the one from LHCb seems useful and he will try to show it to the IN2P3 experts (in mass storage, networking, etc) in order to check whether that kind of information is sufficient for the Sites services setup (Tier-1 storage, conditions databases, networking, etc).

 

L.Robertson proposed that a few sites representatives (L.Dell’Agnello, F.Hernandez and G.Merino) could agree on a set of questions and topics that need to be clarified by the Experiments. The MB agreed.

 

New Action:

L.Dell’Agnello, F.Hernandez and G.Merino prepare a questionnaire or a check list for the Experiments in order to collect the Experiments requirements in a form suitable for the Sites.

 

H.Marten asked whether the clarification of the SRM requirements at the sites is also in the mandate of the GSSD group.

J.Gordon replied that GSSD is more covering the requirements of the SRM functionalities and deployment at the sites; not for performance, throughput and rates. He also added that the GSSD’s mandate should be revised in January 2008.

 

5.    Storage Reporting (Slides) - L.Robertson

During the Comprehensive Review it was asked how the space available at a site can be known by the Experiments.

 

The proposal is that this issue should probably be formalised and brought together as part of CCRC’08 coordination. A short note should define exactly what the parameters are and how they map to different storage systems.

 

Harry’s table and the monthly accounting should be extended to collect and report on these parameters. The Automated Storage Accounting should evolve to collect and report the same set of values.

 

The Megatable was a good tool for collecting the Experiments’ information but is now not up to date and it will be abandoned. Required and available resources should be specified via Harry’s tables until a replacement is centrally implemented.

 

The Megatable expresses, for each T-1 and experiment the requirements, by storage class - including the disk cache “hidden” in T1D0. The MB agreed that if the disk cache size is in the order of a 1-2 % of the total it could be neglected.

 

6.    WLCG Reliability Workshop (Paper; Slides) - J.Shiers

J.Shiers summarized the Reliability workshop of the previous week. The details are in the attached slides.

 

The focus of the workshop was on the fact that the best way to achieve the requested level of resilience is by building fault tolerance into the services, including experiment-specific ones.

 

The techniques are simple and well tested. A paper summarizes these techniques (Paper).

-       DNS load balancing

-       Oracle “Real Application Clusters”

-       H/A Linux (less recommended, because it is not really H/A…)

 

Reliable services take less effort to run than unreliable ones. And some ‘flexibility’ not needed by this community has sometimes led to excessive complexity, i.e. complexity is the enemy of reliability (e.g. WMS)

 

At least one WLCG service (VOMS) middleware does not currently meet the stated service availability requirements.

Need also to work through experiment services using a ‘service dashboard’ as was done for WLCG services (use of a service map?)

 

-       Experiment

-       Down

-       Seriously Degraded

-       Perturbed

-       ALICE

-       2 hours

-       8 hours

-       12 hours

-       ATLAS

-       As text

-       As text

-       As text

-       CMS

-       30’

-       8 hours

-       24 hours (72)

-       LHCb

-       30’

-       8 hours

-       24 hours (72)

 

The text description from ATLAS is the clearest. Because not all services can be fixed in the same amount of time and have the same real criticality. Depending on the nature of the problem a problem can take days to be solved. But clearly the 3 categories are well defined: down, seriously degraded and perturbed.

 

Here are just some examples of critical services specified by each Experiment:

-       ATLAS: only 2 services are in top category (ATONR, DDM central services)

-       CMS: long list containing also numerous IT services (including phones, Kerberos, …)

-        LHCb: CERN LFC, VO boxes, VOMS proxy service

-       ALICE: CERN VO box, CASTOR + xrootd@T0 (which is surprising)

 

In summary the main goal for the future months are:

-       Measure the improvements in service quality by April workshop

-       Monitor the progress possibly using a ‘Service Map’ (Size of box = criticality; colour = status wrt “checklist”)

-       Have by CHEP 2009: all main services at required service level 24x7x52.

-       Database(-dependent) and data / storage management services appear (naturally) very high in the list + experiment services

-       24x7 stand-by should be put in place at CERN for these services, at least for initial running of the LHC.

 

For some services there are not enough experts and they must be reached when not at CERN. Therefore the call-out system must be in place. The operators can only restart the DB services but often a DB expert is needed on site to solve the problem. Defining more reliable services will also reduce the need of call-outs to the experts.

 

J.Gordon asked details about the 24x7 CERN milestones.

T.Cass replied that there is a 24x7 presence on site and an on-call team of technicians for the storage and the main services. The current arrangement satisfies the WLCG milestone because the 24x7 team and procedures are established and in operation.

 

L.Robertson asked why for some sites it seems difficult to complete the 24x7 milestones with people and procedures in place.

J.Gordon replied that the sites have to define the response for all possible alarms. In addition some sites did not have already all the administrative and employment procedures to support and allow 24x7 operations.

 

7.    Tier-2 Reliability Reports - L.Robertson

 

The Tier-2 reliability data is now published every month. The Tier-2 sites are not officially represented at the MB; therefore the Tier-2 reliability data will sent to the Collaboration Board representatives.

 

Starting next January the CB members of, for instance, the “6 (or 10) least reliable sites” will be asked for an explanation, just like the Tier-1 representatives comment the Tier-1 monthly reliability data.

 

G.Merino asked how the reliability of the Tier-2 federations is calculated.

L.Robertson replied that it is a “non-weighted average of the sites in a Tier-2 federation”. Each site counts for one, because the pledges of each site are not available.

 

8.    Status of the VO-specific SAM tests (VO tests results - new and old)

 

8.1      ALICE (Slides) - P.Mendez Lorenzo

Patricia Mendez Lorenzo presented a summary of usage on SAM in ALICE.

8.1.1       SAM and FCR for ALICE

ALICE uses SAM to monitor the VOBOXES only, with specific tests associated to the VOBOXES service only.

FCR is not used to black list any site because ALICE has already its own mechanism.

 

For the CEs, ALICE assumes the test suite defined by the SAM developers and all sites are set to “YES”. While for VOBOXES, FCR is used to define the list of Critical Tests.

 

The SAM developers have implemented special sensors for VOBOXES, so that ALICE can decide the test suite to run in VOBOXES. Implementation and visualization in MonaLisa is possible via an interface SAM-MonaLisa.

The Alarm system by ALICE provides direct notification of problems to the sites.

 

See slide 5 for the details of how the SAM features are wrapped for the execution of the ALICE test suite and collect the results every 2 hours.

 

SAM developers implemented a registration procedure which ensures the freedom required by ALICE. In this way ALICE creates the list of VOBOXES and put it in a www area. This list can be changed as much as needed.

Each 1h a tool read that files and registers new entries. Also deletes old entries that are not monitored for 1 week.

 

J.Gordon asked why ALICE has sites that are not registered in GOCDB.

P.Mendez Lorenzo replied that there are sites not in EGEE that are not registered in GODCB (e.g. in South America). The SAM tools were modified in order to provide this infrastructure to those sites not in GOCDB.

8.1.2       ALICE VOBOXES Tests

All tests can be defined as Critical Tests and this is the unique use of FCR by ALICE. This defines the global status of the site.

 

Tests checking proxies issues

-       Proxy renewal service of the VOBOX

-       User proxy registration

-       VOBOX registration within the MYPROXY server

-       Proxy of the machine

-       Duration of the delegated proxy

 

Tests checking UI and services features

-       Access to the software area

-       JA tests (NEW)

-       Status of the RB under used (NEW)

-       Could not be defined until the failover mechanism was not extended to all VOBOXES

-       Status of the CE (NEW)

8.1.3       SAM, MonaLisa and the Alarm System

The results of SAM have been interfaced with MonaLisa. SAM allows queries to the SAM DB to get different information and SAM developers created a special query for ALICE to visualize the status of the tests in MonaLisa. The status of all individual tests and access to the SAM page where the details of the tests are explained in the link here

 

The SAM developers have created a tool able to send emails and SMS to the contacts persons at each site in the case that the CT fails. Currently ALICE uses the email feature only. Simultaneously a detailed message is also sent from the MonaLisa interface. It needs just a config file that contains: Name of VOBOX, Contact person mail, Phone.

The contact persons and or lists have been provided by each site.

8.1.4       Pending Issues and Requirements

Is it needed the connection to the GOCDB to establish which VOBOXES are in scheduled downtime. The idea is to avoid sending the test suite and emails to those sites declaring the VOBOXES in downtime. SAM developers provided a query to the GOC DB but the implementation is ALICE’s responsibility and the code is ready.

 

Improve the flexibility to delete tests and sites. ALICE needs to decide when to include new sites and tests.

 

Improve the alarm system to include ALICE requirements. Reports via GGUS:  SITES REQUIREMENT. Up to now is not possible. The GGUS DB does not provide read access:

 

More detailed messages on the emails. The message only warns about the error but does not provide additional details (link to the test into the SAM interface, etc).

 

 

J.Templon commented that accessing, even in read-only mode, the GGUS database allows access to information that is not specific to a project. A better solution would be to track the transitions of a ticket and the project would know exactly the status of each ticket by tracking those emails.

P.Mendez Lorenzo replied that receiving an email would require to monitor a mailbox specific to this purpose and duplicate some of the GGUS database information.

 

L.Robertson asked how, for instance, the October test results and the failures displayed in GridView are to be considered for ALICE.

P.Mendez Lorenzo replied that only the VOBOXES set is to be considered. The others test results (i.e. on CE, SE and SRM) are set by the SAM team but are not used by ALICE.

 

F.Hernandez noted that GridView does not show the VOBOXES results but all the other test suites (CE, SE and SRM).

P.Mendez Lorenzo replied that this needs to be changed in GridView. And also those other tests showed in GridView are not checked because ALICE has its own black-listing mechanism.

F.Hernandez added that in GridView is not clear where to find where the tests that are actually interesting for the VOs are.

 

The MB agreed that will be important to organize, outside the MB, a review of the tests and what is displayed in GridView for the VO-specific SAM results.

8.2      ATLAS (Slides) - A.Di Girolamo

A.Di Girolamo then presented the ATLAS-Specific SAM tests.

8.2.1       ATLAS Critical Tests

Now running standard OPS tests using ATLAS credentials (i.e. the original SAM tests run under the ATLAS VO) are:

 

SE & SRM:  

-        put: lcg-cr using cern-prod LFC, files in SAM test directory

-        get: lcg-cp from site to the SAM UI

-        del: lcg-del - clean the catalog and the storage

CE

-       Check CA RPMs version

-       Job Submission on a WN tests

-       VO swdir (sw installation directory)

LFC

-       lfc-ls, lfc-mkdir

FTS

-       glite-transfer-channel-list, Information System configuration and publication

 

The list of sites supporting ATLAS is taken from GOCDB

8.2.2       Current Development

Currently they are developing and testing ATLAS-specific SAM tests in order to:

-        monitor the availability of ATLAS critical Site Services

-        verify the correct installation and the proper functioning of the ATLAS software on each site

-       checking the SE & SRM & CE endpoints definition: cross-checking the information in the GOCDB and TiersOfATLAS (ATLAS specific sites configuration file with Cloud Model)

 

SE & SRM (centrally from SAM UI):  

-        put: lcg-cr with Cloud LFC, with and without using BDII infos

-        get: lcg-cp

CE (job submitted on each ATLAS CE):

-        keep on running large part of OPS suite

-        for ATLAS Tier1 and Tier2:

-       Check the presence of the required version of the ATLAS sw

-       Compile and execute a real analysis job based on a sample dataset

-       Test put/get to local storage via native protocols (dccp, rfcp …)

8.2.3       November 2007 Results

The SAM Critical Tests were not reliable for:

-        France: BDII configuration (ATLAS endpoint should be explicitly put).

-       NDGF/BNL: different service setup

 

The SAM Critical Tests last month failures:

-        FZK: real SRM failures. Problems under investigation with site responsible

-        SARA: (mainly) due to not scheduled network problems

 

F.Hernandez asked for details on the ATLAS issues at IN2P3.

A.Di Girolamo replied that they are already working with the site responsible. The full endpoint name must be specified at IN2P3 and ATLAS was not doing so. OPS had the correct configurations and therefore the OPS tests are succeeding.

8.2.4       Next Steps

In preparations are new ATLAS specific tests (now running in pre-production) will be more realistic for the Experiment. The alarms are not sent during this test phase but will be turned on soon.

In particular the focus is on improving the completeness of the monitor information:

-        Information across TiersOfATLAS, GOCDB and BDII.

-        ATLAS Cloud topology view is needed for better monitoring

-        Integration with Ganga Robot and other ATLAS tools

-        Integration with the ATLAS dashboard for easier visualization

 

Ph.Charpentier asked why there are two sites NIKHEF and SARA in GridView.

J.Templon replied that the reason is that SAM is not able to handle CEs and SEs at different sites.

 

Ph.Charpentier asked how GridView knows which the Tier-1 sites are for a given VO.

R.Kalmady replied that the list is coded inside GridView and the VO should send this information to the GridView team.

J.Gordon noted that this information is available also in the GOCDB database and therefore should be updated and retrieved from there.

 

H.Marten presented some examples where it was difficult for the site to understand why the OPS tests would succeed and the VO specific would fail on exactly the same node.

The explanation from A.Di Girolamo was that different VO will use different credentials and this could cause different results. Or it could be that different VOs have different time-out definitions for the same test and therefore successes or failures on the same node of the same test.

 

The MB agreed that the VO and the sites should work together to investigate behaviours that look inconsistent.

And the Vo-specific tests should be reviewed in a more in –depth dedicated meeting.

 

9.    AOB

 

 

No AOB.

 

 

10. Summary of New Actions

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.