LCG Management Board

Date/Time

Tuesday 27 January 2009 - 16:00-17:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=49388

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 31.1.2009)

Participants

A.Aimar (notes), L.Betev, I.Bird (chair), K.Bos, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, S.Foffano, J.Gordon, A.Heiss, F.Hernandez, M.Kasemann, P.Mato, G.Merino, A.Pace, Di Qing, H.Renshall, M.Schulz, J.Templon

Invited

A.Di Girolamo, P.Mendez-Lorenzo, R.Santinelli, A.Sciabá

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 3 February 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

1.1      Minutes of Previous Meeting

J.Templon asked to comment the minutes of 2 weeks. I.Bird proposed that all discussions referring to future procurement shall wait for the Workshop in Chamonix where the new LHC schedule will be discussed. J.Templon agreed.

 

No other comments. The minutes of the previous MB meeting were approved.

 

2.   Action List Review (List of actions) 
 

 

Not all due actions were discussed at the meeting.

  • VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).
    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

J.Templon reported that the NL-T1 SLA has been sent to the Experiments for review and approval.

 

3.   LCG Operations Weekly Report (Slides) – H.Renshall

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

Business as usual - the rate of new problems arising and that need follow-up service changes remains high. The problems remain heterogeneous and without a predictable pattern to indicate where effort should best be invested. However, the resulting service failures are usually of short time duration.

3.1      GGUS Tickets

Below is the summary of the tickets submitted during the week; mostly from ATLAS (34/39).

 

 

These were the 4 alarms raised:

-       CMS to CNAF created 20 Jan, addressed same day and marked as solved 21 Jan

-       Tue 20 Jan D.Bonacorsi reported PHedex exports from CNAF failing since a few hours with Castor timeouts.

-       Additional error was that he was not allowed to trigger an SMS alarm for INFN T1 (being followed up)

-       Problem addressed same day and found to be due to a single disk server.

-       ATLAS to FZK created 21 Jan and marked as solved same day

-       Test of alarm ticket workflow after new release. Closed within 1 hour.

-       ATLAS to RAL  created 22 Jan marked as verified same day

-       Thur 22 Jan RAL failing to accept data from Tier 0 giving an Oracle error on a bulk insert call. Within 1 hour solved by restarting SRM processes after which FTS reported no further errors.

-       ATLAS to CERN created 23 Jan marked as verified same day

-       Fri 23 Jan from 15.00 almost all transfers to CERN ATLASMCDISK space token failing with ‘possible disk full’ errors. This was due to a misconfigured disk server that was then removed. Then on Saturday importing failed again which was reported back as the pool being full. Monitoring showed it full while stager and SRM queries did not.  \Removing the misconfigured disk had also taken out state information, a known castor problem. The machine was back in the pool Monday and the state information resynchronised.

3.2      Service Incidents

These are the new or outstanding Services Incidents.

-       ASGC: Jan 24 FTS job submission failures due to constraint of ORACLE maximum table space. Had to add 100MB manually. Follow-up is to try adding a new plug-in to monitor the size of table space to avoid the same situation in the future. 

-       FZK: Jan 24 The FTS and LFC services at FZK went down due to a problem with the Oracle backend. The problem was quite complex and Oracle support was involved. Reported as solved late Monday.

 

A.Heiss reported that the problem seems not too different from the one of ASGC. The table space was filled up and was not possible to add disk space resulting in a DB In an incorrect state. An Oracle support DBA had to be involved.

 

-       BNL: Jan 23 hit by the FTS delegated proxy corruption bug, a repeated source of annoyance. Back porting of the fix from FTS 2.2 to 2.1 is now in certification and eagerly awaited.

 

I.Bird asked whether the issue is due to a bug in Oracle, or the way we use Oracle, or a problem of the application?

 

-       PIC: Jan 24 Barcelona was hit by a storm with strong winds and suffered a power cut about midday which turned off air conditioning and closed them down. Resumed Monday morning being fully back around midday. Had some problems bringing back Oracle databases following their unclean shutdown.

-       CERN:  Jan 22/23 the lcg-cp command (which makes SRM calls) started failing when requested to create 2 levels of new directories. Recent SRM server upgrade suspected - to be followed up.

-       PIC: Jan 21 high SRM load due to srmLs commands thought to be from CMS jobs (as previously seen at FZK). No easy/feasible control mechanism.

-       CERN: Jan 20 25K LHCb jobs "stuck" in WMS in waiting status. Further investigations suggested an LB 'bug' that occasionally leaves jobs in limbo state. Plan is to see if latest patch will fix. In the meantime DIRAC will discard such jobs after 24h.

-       CERN: Jan 19 ATLAS eLogger backend daemon hung and had to be killed. Follow up will be to trap the condition and write an alarm. Heavily used by many levels of ATLAS and they did not know where to report it. Follow up has been to create a new ATLAS-eLogger-support Remedy work flow with GS group members as the service managers. 

3.3      Situation at ASGC

Jason Shih was at CERN last week and we (with J.Shiers) discussed F2F issues around their DB services –3D and CASTOR+SRM and including migration from the OCFS2 file system. A target date for restoring the “3D” service (ATLAS LFC and conditions, FTS) to production is early February 2009.

 

The new hardware should be ready before that allowing for sufficient testing and resynchronization of ATLAS conditions then VO-testing before announcing the service to be open

 

A tentative target for a “clean” CASTOR+SRM+DB service is mid-February 2009 – preferably in time for the CASTOR external operations F2F (Feb 18-19 at RAL). .It is less clear that this date can be met – need to checkpoint regularly (work will start after Chinese NY on Feb 2 – see notes)

 

ASGC will participate in bi-weekly CASTOR external operations calls and 3D con-calls and to the WLCG daily OPS meeting.

3.4      Other Service Changes

GGUS has now released direct routing to sites for all tickets not just team and alarm. Submitter has new field to target site when it is known.

 

CERN moved back into production the two LCG backbone routers that were shutdown before Christmas (attended by hardware support engineers).

 

CNAF have been testing long-term problematic access to the LHCb shared software area in GPFS. A debugging session discovered that the GPFS file system, for intrinsic caching mechanism limitations, runs the LHCb setup script in 10-30 minutes (depending the load of the WN) while a NFS file system (on top of GPFS storage) runs the same in the more reasonable time of 3-4 seconds as at other sites. A migration is planned.

 

L.Dell’Agnello reported that CNAF will start 30% of the farm on this new shared area model for testing purposes.

 

IN2P3 following the ATLAS 10Million file test have deployed a third load balanced FTS server addressing CNAF, PIC, RAL and ASGC. Some improvement though not much was initially seen.

 

M.Schulz reported about the LHCb gssklog problem is going to be fixed. It was appearing only at CERN and will be resolved.

 

4.   Proposal of New CPU Benchmarking Unit (Document) – G.Merino

 

G.Merino read/summarized the paragraphs of the attached document (Link). The document explains how to organize the transition from the old to the new CPU unit for benchmarking (called “HEP-SPEC” in the document, but will be called HEP-SPEC06). And agree on the proposed the conversion factors.

 

Within the work of the HEPIX Benchmarking WG, a cluster (LXBENCH1) with eight machines was setup at CERN to be used as reference platforms for running benchmarking codes. The old (SI2K) and the new proposed (HEP-SPEC) benchmarks have been run on every machine. The results obtained are summarized in the document.

 

The proposal in the document is to use a conversion ratio of 4.00.

 

From the document:

 

The members of this group believe that uncertainties of the order of 5% are inherent to this kind of CPU power measurements when realistic scenarios are considered. On the other hand, the requirements that these measurements have to be compared with, also suffer from uncertainties of the same order.

 

Given this situation, rather than arguing about different values within this 5% resolution, this group proposes a simple conversion rule which eases the whole transition process. Therefore the proposal is to adopt 4.00 as conversion factor representing the benchmarks ratio HEPSPEC/kSI2K.

 

The proposal is that all where the current units appear the new ones are used instead. Namely in :

-       the MoU document

-       The Pledges from the Sites.

-       the Experiment Requirements

 

This is the time scale, from the document:

 

Once a fixed value has been agreed for the conversion factor, this group proposes the following scenario for the transition period, which should conclude before the April 2009 CRRB:

 

1. The current tables for requirements and pledges are translated into the new units as they are, by just applying the agreed factor.

 

2. Sites buy the SPECCPU2006 benchmark and calibrate their farms to be able to report their current CPU power in the new unit. The appropriate scripts and configuration files are provided by the HEPIX Benchmarking WG at Ref. [2].

 

3. A technical group is appointed to propose a detailed plan on how to make the migration to the new units regarding the CPU power published by sites through the Information System and stored in the Accounting system. Ideally, the plan should allow a coexistence period for the two units.

 

4. For the April CRRB, sites publish their updated pledges tables in the new units.

 

5. For the April CRRB, experiments re-compute their requirements tables given the new LHC schedule. New numbers should be computed already in the new unit.

 

Finally, this group also proposes to set up a web site for WLCG sites to publish the HEP-SPEC results obtained for their machines, so that other sites and the experiments can query these results. Publication of results can be automated at the level of the script to run the benchmark by enabling the reporting flag.

 

I.Bird agreed that there should be a place where the results can be reported and shared with the Tier-2 Sites. In this way all sites benefit from the same benchmarks.

 

I.Bird asked about the fact that at the previous GDB the values for CMS seemed not to be inline with the other Experiments.

G.Merino replied CMS was not able to run their entire suite but a representative subset. With this subset they confirm that the CMS code benchmarks like the other Experiments.

 

Decision.

I.Bird asked for the approval of the MB for a Conversion Factor of 4.00. The MB approved.

 

So the next steps are:

-       Convert the current requirements to the new unit.

-       Sites Tier-1 Sites and main Tier-2 Sites buy the license for the benchmark

-       A web site at CERN should be set up to store the values from WLCG.

-       A group to prepare the plan of the migration regarding the CPU power published by sites through the Information (J.Gordon replied that will be discussed at the MB next week).

-       Pledges and Requirements need to be updated.

 

New Action:

Convert the current requirements to the new unit.

Sites Tier-1 Sites and main Tier-2 Sites buy the license for the benchmark

A web site at CERN should be set up to store the values from WLCG.

A group to prepare the plan of the migration regarding the CPU power published by sites through the Information (J.Gordon replied that will be discussed at the MB next week).

Pledges and Requirements need to be updated.

 

J.Templon reminded that the reference platform is not SLC4 (the CERN version of SL4) but now should be RHEL 45 (and SL5), without requesting that the CERN-specific version is used. In addition the version of the OS should not be fixed but “bigger than” in order to cover new upgrades (e.g. 5.1, 6.0, etc).

 

I.Bird agreed but added that everybody should use the same compiler version. And exactly the same options for running the benchmarks.

 

G.Merino noted that in the group there was the feeling that one ought to run the benchmark on the exact hardware and OS used for providing the service. In this way the values obtained correctly represent the resources used in providing those Services.

 

Ph.Charpentier and M.Kasemann noted that the Experiments will also change the requirements due to their assessment. They could report in the two units for some time until the same new unit is used everywhere.

 

F.Hernandez proposed that the name HEP-SPEC should include the year because they can be changed in the future.

 

HEP-SPEC06 was the name selected for the new CPU benchmarking unit. .  

 

5.   Comments on VO-specific SAM results (Dec 2008) (SAM Reports; Slides)

 

5.1      ALICE – P.Mendez-Lorenzo

ALICE has established collaboration with VECC-T2 site (India) for the improvement and maintenance of the SAM structure for the ALICE experiment. Based on the good experiences with them gained in the past while creating the SAM-VOBOX test environment for ALICE

 

Progress was made in November-December 2008 :

-       Documentation of the SAM-VOBOX test suite for ALICE

-       Migration of the CE sensor tests from RB to WMS submission mode

 

The future plans are:

-       Creation of a WMS sensor test suite

-       (Proposed already during the F2F MB meeting in November 2008)

 

Even if not running a large number of jobs, ALICE was in production at that time. T1 stability in Dec.08 was very good for ALICE. The Tier-1 support was efficient (as usual).

 

a1.tiff

 

Most Sites had no problems in December 2008 with the ALICE SAM tests. Only RAL and FZK had a few:

-       FZK (slide 4): SAM reported errors associated to the CE sensor related to the job submission test (CT for ALICE). Affecting all CEs at the site (ce-[1…5]-fzk.gridka.de) in the period 26-31/12/2008.
FZK has 5 computing elements just one was running in December 2008 and the Site was considered up.

-       RAL (slide 6): RAL had some issues at the beginning of the Dec’08 during some few days User Proxy problem observed at the queue.  Job properly submitted by the RB, began to see the problems at the CE level

 

There were mostly WMS related issues - presented at GDB in January, and are being addressed. The inclusion of a new sensor related to the WMS and the setup of the CREAM-CE at all sites are the most important goals of the Experiment at this moment.

 

I.Bird asked whether issues from the ALICE tests are sent to the Sites via some automatic ticketing or messaging system.

P.Mendez replied that for the moment not at all Sites the Nagios, or other, messaging system is connected. ALICE is manually sending an email message when they spot an issue.

 

I.Bird asked whether ALICE considers these SAM tests representatives of the status of the Site for ALICE. And whether these results can be considered reliable, and therefore reported.

P.Mendez replied that, at the moment, the SAM tests are sufficient for ALICE.

5.2      ATLAS – A.Di Girolamo

Below is the availability for all the ATLAS Tier-1. Is sufficient at all Sites except for FZK. But also very often the test submission fails and the results are considered “unknown”.

 

 

The problems at FZK where due to the new SRM V2 tests that are now critical. The SRM had many time-outs under load, without any particular reason, and this caused the SRMV2 tests to block on lock files on the SAM servers.

 

 

Below are the details of the tests showing the service trying and often failing the “lcg-cp” and “lcg-cr” tests. Only “lcg-del” tests work properly all the time because does not use any lock file for delete operations.

 

 

A related issue was the execution of the ATLAS tests that failed and resulted on numerous “unknown” results, as shown in the picture below.

 

DECEMBER 2008

 

JANUARY 2009

 

Many lock files are left on the SAM production servers and this has been reported several times to the SAM services. The solution is to introduce a time-out in the SAM tests execution so that non-responding tests do not cause the following tests to find the files locked.

 

Now a new LEMON sensor has been introduced and will report immediately if a lock file already exists. Until now this had to be verified manually.

 

ATLAS and SAM should try to understand the causes of the lock files left behind.

 

P.Mendez reported that this happens also to ALICE but is far less relevant than for ATLAS.

 

New Action:

M.Schulz will report about the issues on the SAM servers (and the issues of the ATLAS lock files).

 

I.Bird asked whether the tests for ATLAS are sufficient and report correct alarms.

A.Di Girolamo replied that the granularity is not yet sufficient for ATLAS and more tests should cover also the space tokens availability.

 

J.Templon noted that the failures are very often due to errors caused by users and not by Sites (trying to open wrong file names, copy to PIC and the problem was on the target site not on the source. Often he also submits a GGUS ticket.

 

I.Bird asked whether NL-T1 is the only Site with problems or is the only Site actually checking the SAM results.

 

J.Gordon noted that is the VOs that should verify their tests and not expect the Sites to do so.

I.Bird agreed but added that in this preparation phase both should verify the failures of the tests. Otherwise the reports will start to be sent to the Overview Board but they should be carefully verified. Sending alerts for test failures would help to analyse every test failure and clarify it.

 

J.Gordon asked what happens when new tests are introduced and who manages the changes of the VO tests.

M.Kasemann proposed that new tests are introduced as non-critical and only after verification and approval are added as critical.  

 

A.DI Girolamo noted that the new tests are always announces 2 week in advance at the Operations meeting.

 

F.Hernandez asked whether ATLAS is still using the SRMV1.

A.Di Girolamo replied that GridView is still showing the SRMV1 but the ATLAS critical tests are on the SRMV2.

5.3      CMS – A.Sciabá

 

The critical services for CMS are the CE and the SRMv2 (since December).

For this reason the critical tests implemented are:

-       CE: job submission (the same OPS tests but run by CMS), CA certificates (run by OPS)

-       SRMv2: “lcg-cp” (copies a local file to the SRM in the provided tokens and default space and then it copies it back) (run by CMS)

 

There were very few test failures for CMS in December 2008.

 

CERN: No relevant problems.

 

ASGC: Oracle problem that overloaded CASTOR. Reliability is poor as all downtime was unscheduled.

 

CNAF: Very good reliability. There was the problem of the central software area but the test was not critical. Maybe should be critical.

 

FNAL: Good results but the CE is still not visible in GridView. In BDII there are two FNAL sites one with CE and one with the SRM. This has been reported long time ago and should be reported again.

 

 

IN2P3: Not good results, Electrical maintenance at the beginning of December. On the 26 Dec the CE was a problem but usually was the SRMV2 for the rest of the month. SRM was overloaded due to a recursive “lcg-ls”. Took a long time to fix it.

 

PIC: Three days of maintenance but the rest was good.

 

FZK: The CE was undeclared down for the all Christmas period.

 

RAL: Excellent reliability.

 

I.Bird asked whether CMS is sending automatic alerts to Sites or they currently follow another procedure.

A.Sciabá replied that people on shift when they see a problem they send a GGUS ticket. For the moment it will not be automated because it was agreed like that with the Sites.

 

I.Bird reminded that the Sites should announce downtimes in advance in the GOCDB (appears yellow in the availability graph).  

5.4      LHCb – R.Santinelli

LHCB has 2 independent kinds of tests.

-       DIRAC3 and LHCB specific tests + installation of core application (if in not ‘safe’ mode)
Gauss Boole Brunel DaVinci, Dirac-Install, queues check , OS check

-       running as SGM 

-       not critical in WLCG FCR

-       Offering however the effective LHCB perception of site usability

-       Submitted via DIRAC

-       Trivial infrastructure tests  JS”,”CSH” and “Swdir” (inherited from ops running with LHCb production credentials)

-       disentangled from DIRAC (following a bad experience with missing tests for some internal DIRAC problems for several months)

-       Critical in FCR

-       Targeted to test really problems with the site infrastructure (and then sysadmin should mind of)

-       Using dedicate machine and a SAM UI installed from CVS

 

LHCb plans to reintroduce more tests (critical in the past):

-       ConditionDB access

-       LHCb-availability test that shows the impossibility (from DIRAC) to submit to the CE (like JS test)

-       Include in DIRAC submission framework and polish a bit infrastructure trivial tests

 

Problems with the SAM framework 

-       9th December because of known issue with SAM clients not supporting soft links and then need to migrate to a dedicated box. No data available till 13th Dec. Posted as Savannah bug #45246.

-       20th December because of some jobs lost (due to network problem) and it was not submitting CE test jobs. LHCb discovered that issue (also affecting other tests) only after Xmas break. Reported via GGUS# 45090.

 

PIC.png

 

Below is the general availability of LHCb at the Sites.

 

Fairly smooth behavior for all sites until the 10th but then most tests stopped working.

 

Apart from GridKA for the file access test (problem under investigation since long time, see GGUS # 43893) they noticed that all sites had bad results only because of external/general causes.

 

The red traces in the period from 14th -16th Dec for all sites are commonly due to CE tests containing incorrect code of “caver “test (set critical at that time). The other red traces around the 25th (for all sites) are due to SRMv2 test checking the space tokens on the site BDII. Lost track of the reason of this failure though it looks like a  network problem (being common across all sites)

 

The GridView still have two major incorrect information for LHCb:

-       ASGC is not an LHCb Tier-1 Site.

-       Still waiting GridView to consider NL-T1 as a single site and not two Tier1s (merging NIKHEF and SARA)

 

I.Bird noted that the NL-T1 must be declared in the GOCDB as a single site. It is not depending on GridView.

 

LHCb would like that the LHCb dashboard is more heavily used. The dashboard is in sync with SAM as far as concerns test results and their criticality. The dashboard offers different groupings and availability definitions that better fit with LHCb site-usability perception

 

The dashboard offers historical views of results and availability measurements and is available here: http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry

 

 

6.   CMS Quarterly Report 2008 Q4 (Slides) – M.Kasemann

 

Due to lack of time, the CMS Quarterly Report was not presented. Here are the Slides

7.   AOB

 

 

No AOB.

 

 

8.    Summary of New Actions

 

 

 

New Action:

M.Schulz will report about the issues on the SAM servers (and the issues of the ATLAS lock files).

 

New Actions:

-       Convert the current requirements to the new unit.

-       Sites Tier-1 Sites and main Tier-2 Sites buy the license for the benchmark

-       A web site at CERN should be set up to store the values from WLCG.

-       A group to prepare the plan of the migration regarding the CPU power published by sites through the Information (J.Gordon replied that will be discussed at the MB next week).

-       Pledges and Requirements need to be updated.