LCG Management Board

Date/Time

Tuesday 20 May 2008, 16:00-17:00 – Phone Meeting

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=33695

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 - 23.5.2008)

Participants  

A.Aimar (notes), L.Betev, I.Bird (chair), K.Bos, D.Britton, T.Cass, Ph.Charpentier, D.Collados, L.Dell’Agnello, M.Ernst, I.Fisk, S.Foffano, J.Gordon, M.Kasemann, U.Marconi, P.Mato, P.McBride, G.Merino, A.Pace, R.Pordes, Di Qing, R.Quick, M.Schulz, J.Shiers, O.Smirnova, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 29 May 2008 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous MB meeting were approved.

1.2      Tier-1 and Tier-2 Reliability Reports (drafts) (Tier-1_SR_200804; Tier-2_SR_200804)

The Tier-1 and Tier-2 Reports have being updated and will be distributed to the WLCG boards.

 

The Tier-2 report shows that the US and the North European federations are not reporting any data yet (reported as “N/A”)

1.3      QR Reports (March-May 2008) Preparation (QR-Nov07-Feb08)

A.Aimar will ask for the contribution to the Quarterly Report covering March-May 2008.

Written contributions will have be prepared for LCG Services, GDB and for the HLM dashboard.

The Experiments should present a short report to the MB in the next few weeks.

 

ALICE and ATLAS agreed to give their QR presentation it in the next two MB meetings

 

2.   Action List Review (List of actions)

Actions that are late are highlighted in RED.

 

-       31 March 2008 - OSG should prepare Site monitoring tests equivalent to those included in the SAM testing suite.

-       J.Templon and D.Collados will verify this equivalence and report to the MB; as it was done for NDGF.

 

Will be discussed today.

The OSG tests are described here: http://rsv.grid.iu.edu/documentation/help/.

The proposed new list of critical tests is available here:

https://twiki.cern.ch/twiki/bin/view/LCG/OSGCriticalProbes#Proposed_Critical_Probes_for_OSG

 

-       31 March 2008 - ALICE, ATLAS and CMS should provide the read and write rates that they expect to reach. In terms of clear values (MB/sec, files/sec, etc) including all phases of processing and re-processing.

 

Not Done. It was agreed that the Experiments provide the same information in the format used by LHCb.

Here is the example from LHCb: https://twiki.cern.ch/twiki/pub/LCG/GSSDLHCB/Dataflows.pdf

 

I.Bird proposed that the Experiments distribute to the MB list the pages where they keep their expected rates and capacities at each site.

Next week we will close this item and address specific issues in the future.

 

M.Kasemann noted that D.Bonaccorsi has already communicated these values to the sites.

G.Merino replied that apparently the information is in some CMS tables but they do not know the expected nominal rate in detail.

J.Gordon added that some values received apparently include rates for catch-up (twice the nominal rate).

 

J.Templon added that the values for ATLAS are still unclear about what size some “spaces” should have.

 

  • 30 Apr 2008 – Sites send to H.Renshall plans for the 2008 installations and what will be installed for May and when the full capacity for 2008 will be in place.

 

H.Renshall will be absent several weeks in May, Sites should send it to S.Foffano in addition to H.Renshall.

 

S.Foffano reported that she had received some information from CNAF, NL-T1 and IN2P3.

I.Bird noted that the values have to be ready by end of June. But H.Renshall will be back next week.

 

  • 18 May 2008 - Sites should send the emails address where they are going to receive the alarms. Sites should confirm that they can achieve this.

 

A new page for alert emails for the sites, not the same as the contact pages. This is a separate page to be prepared.

Is not the same and the contact page already prepared (https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails).

Should be a different list only for alerts and only for a few people from the Experiments.

 

New Action:

24 May 2008 - A.Aimar will set up a page for the site operations alerts.

Update: The page for Operations Alarms is here: https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage

 

New Action:

27 May 2008 - Sites and Experiments should fill the Operations Alarm Page: https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage

 

 

  • 16 May 2008 - Each Experiment proposes 4 users who can raise alarms at the sites and are allowed to mail to the sites alarm mailing list.

 

Removed. Will be included in the wiki page to fill.

 

  • 9 May 2008 - Milestones and targets should be set for the LCAS solution (deployment on all sites) and for the SCAS solution (development, certification and deployment).

 

M.Schulz reported that few sites had agreed to deploy the LCAS solution. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it.

Other sites that want to install it up should confirm it.

 

Ph.Charpentier asked that the PPS installations should allow access to the existing production data.

M.Schulz agreed.

 

3.   CCRC08 Update (Draft Agenda of June Post-mortem Workshop; Slides) - J.Shiers

 

J.Shiers presented the weekly summary of the status and progress of the recent CCRC08 activities.

3.1      Post Mortem Workshop

The CCRC08 Post Mortem Workshop will take place on the 12-13 June 2008.

Only 10 people have registered so far but more people ought to participate. The agenda will be finalized by end of May.

3.2      SRM MoU Addendum

In the process of getting the Addendum signed by implementers, sites and then experiments. Some feedback and iteration will be inevitable.

 

The last conf-call that focused on short-term issues. The implementers have signed off, with small caveats on timetable and priorities.

Not all implementations (DPM) were represented at conf-call and there are some concerns related to

-       performance implications in adding new features;

-       general behaviour – e.g. handling of finite lifetimes (less relevant for WLCG)

 

It is clear that for 2009, only “short-term” solutions will be possible. These will not necessarily be consistent across implementations, nor necessarily delivered to a common timetable (maybe hidden to some degree by client tools).

 

Target for delivering the timetable (both short & long term) is the 2nd June 2008. The longer term “SRM v2.2” conformance is TBD - for 2010 data taking.

The recommendation is to focus for now on getting short-term solutions in place and revisit longer-term proposals later.

 

Ph.Charpentier agreed that the client tools should hide the differences among implementations. And proposed that the GSSD group should follow up on checking the interfaces of gfal and lcg--utils. For instance now there is not interface for retrieving space meta-data and the clients should be extended to do that.

The WLCG had agreed to access the SRM only via the gfal and lcg-utils interfaces.

 

J.Shiers agreed with that.

3.3      General Observation

The number of problems reported (at the daily meetings) by the experiments appears to be considerably lower than in February.

 

Attendance at meetings is of 10 – 15 people (2/3 local to CERN).

Minutes for last week’s meetings are here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08DailyMeetingsWeek080512

 

Resolution of problems reported seems rather fast even if the level of activity is clearly much higher. But is it high enough to ‘prove’ that we are ready for real data taking?

3.4      Core Services Report

-       (Tue) “Extreme locking issue with SRM_ATLAS plus deadlocks in SRM_SHARED, plus huge number of sessions”  (no outage – the impact is actually unclear!). This and other CASTOR SRM issues seen since the start of the May run are being investigated – post-mortem asap. See: https://prod-grid-logger.cern.ch/elog/Data+Operations/2

-       (Tue) lack of space in SRM_DATA for LHCb (03:27-12:22 outage!). Must be investigated.

 

-       (Thu) CASTOR ALICE upgraded to 2.1.7

-       (Thu) SRM ATLAS suffered from instabilities during Thursday. Problems fixed by restarting srmServer processes on two servers at 5:00p.m. Started even late Wed. No operator alarms - working on improving infrastructure. No ATLAS report. Lots of transfers failing. (Severe service degradation for almost 1 day!)

-       (Thu) Garbage collection on “t1transfer” pool (for T0 export) had a problem yesterday - it was deleting files before they were exported. Problem was fixed during the night and in the morning.

 

-       (Fri) SRM problems - report from Shaun: “close to solution on 1 of deadlock problems. Then will focus on 'stuck connections'.”

 

-       There is also “the long and tragic tale of multiple requests for srm…”

-       How do these issues get fed (cleanly) into the operations meetings? This can be done if someone participates to the CASTOR daily meeting.

 

Upcoming Upgrades:

-       (Transparent) upgrades to C2ATLAS & C2CMS scheduled for this week (Wed & Thu respectively).
The intervention will start at 09:00 CEST, and should be totally transparent.

3.5      DB Services

Forgotten from the previous week:

Two parallel Streams setups for LHCb conditions and LFC have been configured between the LHCb downstream capture database and the new RAC database for LHCb installed at PIC.

 

Sunday 11 May there was a problem with the Streams setup for ATLAS between the online and offline databases. One of the replicated schemas had several tables without a primary key defined and one of these tables had several rows duplicated. The apply process at the destination was not able to identify the unique row to apply the changes and was aborted. ATLAS people cleaned the duplicated rows and replication was restarted.

 

Performed a cleanup of streams monitoring repository, which has decreased the size from 86GB to <3GB.

 

Two hours downtime was needed last Friday. Automatic aggregation tools were put in place.

There is still no significant increase of load during CCRC'08.

3.6      Monitoring

The FTM needs to be installed at all Tier1 sites. See https://twiki.cern.ch/twiki/bin/view/LCG/FtsFTM

 

The CCRC’08 “Baseline Versions” has been updated to reflect this request: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions

The FTM (a subcomponent of FTS) was added to list explicitly 15 May 2008.

 

Dashboards:

-       ATLAS - new version for ATLAS following shifters’ requests;

-       CMS - site status board has new info about available space on storage.

 

Admin interface for GridMap for critical services - improvements performed to the LHCb’s view.

 

We will start to analyze the problems reported by experiments to understand whether they should have been seen by existing monitoring tools.

 Similarly for questions raised by sites  e.g. "we see no jobs" 

3.7      Sites Reports

Here are the main issues reported (as on the slide):

-       NIKHEF (Thu) – all worker nodes turned off due to cooling failure Wednesday.

-       BNL: (update from Tuesday evening): low data rates

-       We observed a very low incoming rate following the start of the ATLAS Tier-1 - Tier-1 transfers (<30 MB/s, outgoing was 400 – 500 MB/s) and found the FTS channels pointing from the ATLAS Tier-1 centers to BNL were clogged with transfers that are related to Raw Data Object replication in preparation of the FDR-2 in June (RDO’s are produced in all ATLAS clouds and are needed for bytestream/mixing production at BNL).

-       The FTS channel priority at CERN had to be adjusted from 3 to 5 and from 5 to 3 at BNL respectively to increase the incoming rate for the CCRC stream into BNL to a decent level. In summary, CCRC transfers are competing with ATLAS mc08 dataset replication. By changing the FTS priority to 5 for the CCRC related transfers made them having priority over MC replication.

-       Other ATLAS Tier-1 centers do not observe this effect because their MC related transfer rate (incoming) is small compared to what we observe at BNL. Following the before mentioned adjustments the CCRC related rate went up around 5pm (CDT) to 180 – 200 MB/s and was maintained at this level for more than 6 hours (continuing)

-       RAL (Fri) – CASTOR CMS outage of ~2 hours previous day due to misconfiguration – reverted

-       GRIF (Fri) – 200MB/s (250MB/s peaks) from GRIF/LAL to CCIN2P3 using new 5Gb/s link

3.8      Experiments Reports

As on the slides.

Expt

Summary

ALICE

Still working on preparing sites for 3rd commissioning exercise. By end last week, only 2 sites had not upgraded (Birmingham (expected) & Kosice). Xrootd i/f for castor2 at RAL not expected to be ready.

ATLAS

Report uploaded by Simone on results of T1-T1 exercise. Quality plot included – generally rather successful! See also https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/95

CMS

Highlighted activities atm: analysis of T1-T1 tests as from last week; extension to non-regional T1-T2 transfer tests; production transfers with latency measurements to prepare for T1 workflows; T1 workflows consisting of (iCSA) re-processing and (CCRC) skimming at T1 sites, esploiting non-custodial areas also; final development on the monitoring side to accomodate feedback from CCRC running. See https://twiki.cern.ch/twiki/bin/view/CMS/CCRC08-Phase2-OpsElog

LHCb

we can summarize that CCRC is going fairly smoothly everywhere apart from a few minor issues (still under investigation and logged with relative GGUS tickets in this elog).

See also elog entries, e.g. https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/92

3.9      Open Questions

What (if anything) are we missing?

-       Is this really all that we expect in 2008? e.g. why no observable increase in DB load?

-       If we haven’t tested it now, we will find out too late if something is not ready.

 

How can we (sustainably) provide the necessary high-level and detailed reports?

-       Experiment reports (IMHO) detailed and comprehensive

-       Some improvements could be made in site reports – elog use starting

-       The “service (~run) coordinator” idea is still around.

 

(When) should we expect all “flows” / functional blocks to be exercised continuously?

-       My guess is that things will be limited by manpower, more than compute-power, for some time to come

 

Should we retain “WLCG” section in current joint-operations-meeting?

-       Or run “daily” WLCG operations meetings Mon-Fri@15:00 + EGEE/OSG/NDGF operations meeting (release update etc.) at 16:00)?

3.10    Summary

May run of CCRC’08 is a big step forward and improvement over February.

The hope is to see “full nominal rates” from all(?) experiments “real soon now”.

 

The service load appears to be sustainable –but is what we see representative of what will happen when LHC is running?

 

The indications are:

-       We will probably be OK for 2008 data taking and (re-) processing;

-       We can expect to find some new problems, but we will certainly find solutions – or at least a work-around.

 

4.   LHCC Review Meeting (1st July 2008) (Agenda) - I.Bird

 

I.Bird showed the agenda he proposes for the LHCC Review day.

 

The reviewers asked for:

-       Status of the Tier-0/1 sites

-       CCRC status and outcome

-       Middleware and SRM

-       Applications Area

 

They especially requested for more data and numbers on the Sites and usage by the Experiments.

 

 

J.Gordon asked whether attendance by all Sites is needed.  And he volunteered to provide the Tier-1 summary.

I.Bird confirmed that all should at least phone in.

 

New Action:

31 May 2008 - MB Members should send feedback to I.Bird on the LHCC Review Agenda. In Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=33695

 

1.   OSG RSV Tests for T2 Reliability (Slides) - R.Quick, D.Collados

 

D.Collados presented the status of the OSG tests and their equivalence to the standard SAM test suite.

1.1      OSG Equivalence

There is a URL in place from where resources are pulled by SAM: http://oim.grid.iu.edu/publisher/get_osg_interop_monitoring_list.php

 

Only the OSG CEs are published for now. Soon SRMs will be included.

 

The open issue is:

-       Indiana collector not able to publish only one name for several publishing sources (impact in SAM & GridView). Example:

            cmsosgce.fnal.gov   -> uscms-fnal-wc1-ce

            cmsosgce2.fnal.gov -> uscms-fnal-wc1-ce2

            cmsosgce4.fnal.gov -> uscms-fnal-wc1-ce4 

 

Those OSG sites publishing in GOCDB separately. Example: USCMS-FNAL-WC1

Now they are under the OSG tree in the database.

 

The SAM script was ignoring resources of new sites in above URL (now fixed).

 

I.Bird noted that these CEs nodes should not be publishing as separate sites reporting individually their results. They can report separately but not belonging to different sites. Otherwise the metrics will not aggregate the values into a single entity.

 

J.Templon added that in the Information System they should be registered under the same name in the GLUE Schema. Is a change into the DB that should solve the issue not in the SAM software.

1.2      Critical Tests for OSG

The critical tests for OSG are described at: https://twiki.cern.ch/twiki/bin/view/LCG/OSGCriticalProbes

 

Existing list of critical tests (for OSG CEs only):

-       certificates crl expiry

-       general osg directories CE permissions

-       general osg version

-       globus gridftp simple

 

Requested approval of new list of critical tests (for OSG CEs):

-       certificates cacert expiry (check if CA certs are still valid)

-       general ping host (check if CE responds to pings)

-       globus gram authentication (authenticate to remote CE using Globus)

-       batch jobmanagers available (get job managers running on remote site)

 

I.Bird asked whether the tests are equivalent to the one now in SAM.

D.Collados replied positively.

 

The discussion focuses on whether to remove ‘globus gridftp simple’ test or modify the test because not all OSG CEs run gridftp.

SAM does not accept critical tests defined per Service, VO, Site and therefore cannot be selective.

Make the test succeed in CEs that are not supposed to run gridftp?

 

R.Quick asked that the test is marked as non-critical.

J.Templon instead replied that the test should be critical but on the sites where it does not apply it should always report “success”. Because this is an important data access test (on the sites where it applies) and should remain critical.

 

SRM tests should be in next OSG 1.0.0 release (2nd of June). Full set of probes equivalent to the EGEE CE & SRM tests is still missing

One could check whether EGEE SRM tests could be used instead.

1.3      Reporting OSG Downtime

The first OSG SAM submission tests are successful.

But is still missing:

-       Script for automatic downtimes submission (OSG)

-       DB trigger in GridView database to store the downtimes (GridView).

 

The time scale is to have in 2 weeks in Production.

 

The OSG Sites in grey color (N/A) in GridView. To solve this issues we should:

-       Remove all OSG resources for OPS not coming from interoperability resources URL.

-       Modify availability calculation algorithm to group flavors of CEs (CE, OSG-CE), SRMs (SRMv1, SRMv2, OSG-SRM) and Site Info systems (SBDII, CE-MON).

-       Count them as one single site service availability metric.

 

They can implement the above solutions and generate May reports in early June.

1.4      Final Comments

SAM and GridView were designed for EGEE infrastructure. And now the work is into supporting other Grid infrastructures.

This does not scale with present implementation and is main reason of most of the problems.

The SAM team is fixing and redesigning the SAM DB

-       to comply with Grid Monitoring Topology Database

-       to provide more flexibility (like critical tests per site, node X belonging to several sites, etc)

 

Many software changes in the core of SAM and GridView portals, and related data consumers (APIs, GridMaps, Dashboards, etc) will be needed.

 

The SAM team needs to define priorities for the next months and the team moved from 6 people (Jan 2008) to 4 at present time (hiring is ongoing).

The time spent supporting EGEE sites, LHC experiments, WLCG MB requirements (reports, OSG, metrics, etc.) does not allow enough time to cover all priorities.

 

I.Bird concluded that the SRM tests are important and should be added. But, even if the tests are neither complete nor equivalent, we should generate the reports for May and have all the data properly going from RSV to the GridView Reports.

-       In two week the report should be available with what currently exists

-       And the rest later in June.

 

R.Pordes asked whether the publication can proceed with a note that the tests have to be completed.

1.5      RSV Progress (by R.Quick)

R.Quick then presented his slides on the progress and plans of the RSV implementation.

 

The table below summarizes what is completed (green) and what is still ongoing (yellow).

 

 

Probe Development Completed

Probes Deployed to Sites

Publish Data to WLCG

CE Availability

Completed - Nov 2007

50% sites reporting; working to get completion commitment from US-ATLAS and US-CMS for remaining sites

Completed - Jan 2008

CE Reliability (Planned downtime reporting)

16/5/2008 - This is not a probe, but development of the mechanism is complete

23/5/2008 - It is unnecessary to deploy this to sites. We expect completeion this week.

01-Jun-08 (depends on transport mech from SAM developers)

SE Availability

07-Apr-08 (SRM)

Start deployment end of May 2008 With OSG 1.0; need to work with US-CMS and US-ATLAS for deployment commitment

Data available within one week of each site deployment completion

SE Reliability

16/5/2008 - This is not a probe, but development of the mechanism is complete

Start deployment end of May 2008 With OSG 1.0; need to work with US-CMS and US-ATLAS for deployment commitment

Data available within one week of each site deployment completion

 

One can note that:

-       CE Reliability will be complete by 1 June 2008.

-       SE Availability and Reliability will start end of May and one week per site to publish the data in SAM DB.

 

J.Templon asked why the probes have to be deployed at the sites. The SE probes in SAM are done from outside the sites to test SRM access remotely. Testing locally is not a realistic use case of the SRM.

R.Quick replied that for now the tests are not remote. But this will have to be changed if is needed.

I.Bird replied that the remote testing is needed to really check that, for instance, firewalls are properly set up etc. Running locally is not sufficient.

 

R.Pordes added that the remote tests could be the VO probes.

I.Bird replied that this will be also needed but general tests should check also the basic remote access.

 

The milestones proposed originally are:

-       WLCG verification of OSG test equivalence of RSV tests to WLCG required tests. - David is reporting on this at this meeting.

-       Necessary OSG RSV probes released and available for deployment. - CE done, SE being tested.

-       All ATLAS/CMS T1 and T2s  report site availability information. - CE done, SE being tested.

-       OSG Reliability Reported to WLCG - Testing done, available from GOC in June

 

A.Aimar proposed that only the first and last are considered as High Level Milestones.

-       WLCG verification of OSG test equivalence of RSV tests to WLCG required tests. - David is reporting on this at this meeting.

-       OSG Reliability Reported to WLCG - Testing done, available from GOC in June

The others are internal to RSV.

The MB agreed.

 

R.Quick added that GridView does not seem to show consistent agreement with SAM for OSG Resources and that GridView is difficult to navigate and is difficult for Resource admins to find their results. Is difficult/impossible to send a link to specific GridView pages.

 

Unclear testing infrastructure for SAM. (To test new SE probes results and reporting from out Testbed.)

Confusion on sources of OSG resource information for SAM. (Why is SAM using GOCDB for some OSG resources?)

 

Here are the modified HL Milestones:

 

OSG RSV Tests

WLCG-08-01

May 2008

RSV Tier-2 CE Tests Equivalent to SAM
Successful WLCG verification of OSG test equivalence of RSV tests to WLCG CE tests

OSG-RSV

 

WLCG-08-01b

Jun 2008

RSV Tier-2 SE Tests Equivalent to SAM
Successful WLCG verification of OSG test equivalence of RSV tests to WLCG SE tests

OSG-RSV

 

WLCG-08-02

Jun 2008

OSG Tier-2 Reliability Reported
OSG RSV information published in  SAM and GOCDB databases. Reliability reports include OSG Tier-2 sites.

OSG-RSV

 

 

 

2.   High Level Milestones for 2008 (HLM_20080518) - A.Aimar

 

The High Level Milestones will be used for the Quarterly Report. MB members should send their comments.

 

 

3.   AOB
 

No AOB.

 

4.   Summary of New Actions

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.

 

 

New Action:

24 May 2008 - A.Aimar will set up a page for the site operations alerts.

Update: The page for Operations Alarms is here: https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage

 

New Action:

27 May 2008 - Sites and Experiments should fill the Operations Alarm Page: https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage

 

New Action:

31 May 2008 - MB Members should send feedback to I.Bird on the LHCC Review Agenda. In Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=33695