LCG Management Board

Date/Time

Tuesday 29 September 2009 16:00-17:00 – Phone Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=62560

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 3.10.2009)

Participants

A.Aimar (notes), J.Bakken, I.Bird (chair), K.Bos, D.Britton, T.Cass, L.Dell’Agnello, M.Ernst, P.Flix, S.Foffano, Qin Gang, J.Gordon, D.Groep, A.Heiss, M.Kasemann, P.McBride, A.Pace, B.Panzer, H.Renshall, M.Schulz, Y.Schutz, J.Shiers , O.Smirnova, R.Tafirout 

Invited

S.Traylen, R.Wartel

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 6 October 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes of the previous meeting.

1.2      Quarterly Report Preparation – A.Aimar

A.Aimar proposed the dates for the Quarterly Reports from the Experiments.

The agreed dates are:

-       6 October: ALICE, CMS

-       13 October: LHCb

-       20 October: ATLAS

1.3      GridView Availability Calculations – I.Bird

As agreed at the previous MB Meeting, GridView have just fixed the problem of wrongly counting unscheduled downtimes in the reliability calculations. The reports for September will show the correct implementation of the agreed definition of reliability.

1.4      Update on Tier-1 Pledges – S.Foffano

S.Foffano is expecting the confirmation of the 2010 pledges from the Tier-1 Sites. Not all Sites replied by the deadline of the 28 September.

 

She asked (below the respective answers)for news from:

-       TRIUMF: R.Tafirout replied that they will confirm what is installed until now on 2009 and their pledges for 2010.

-       IN2P3: Not present at the meeting, she will ask via email.

-       INFN: Not present at the meeting, she will ask via email.

-       NL-T1: Will answer via email

-       NDGF: no information from Sweden. O.Smirnova said that she will send some incomplete information with the current proposed procurement draft.

-       ASGC: S.Foffano will forward the questions to Qin Gang.

 

2.   Action List Review (List of actions)

 

  • 5 May 2009 - CNAF completes, and sends to R.Wartel, their plans to solve the issues concerning security procedures, alarm contact point, etc.

No news.

  •  Each Site should send the URL of the XML File with the Tape Metrics to A.Aimar.

Not done by: FR-CCIN2P3, NDGF and NL-T1,
Sites can provide what they have at the moment.

See http://sls.cern.ch/sls/service.php?id=WLCG_Tier1_Tape_Metrics

Sites should send URLs to existing information until they do not provide the required information.

No updates from NDGF.
D.Groep added that NL-T1 will report the values within the following day.


Update: NL-T1 added to the SLS metrics page on the following day. Only CC-IN2P3 and NDGF left.

 

3.   LCG Operations Weekly Report (Slides) – H.Renshall
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

Covers the three weeks 6th to 27th September. There was a mixture of problems – mostly database related.

 

Only one test alarm ticket: from CERN to BNL-T1. Was acknowledged by BNL but did not automatically open an OSG ticket. New global tests will be performed this week.

 

Incidents leading to service incident reports

-       CERN: Two separate LHCb CASTOR blockages 7 and 8 September.

-       FZK: Access to ATLAS DB blocked on 7 September with too many open sessions. Degraded from 8 to 16th.

-       RAL Disk to Disk transfers failing from 15-17 Sep during a planned upgrade to the CASTOR Name Server.

-       CERN: ATLAS Replication Tier0 to Tier1 down on 21 Sep.

 

It is hard to understand the several Oracle bugs/features/side-effects at CERN-LHCb, FZK-ATLAS and CERN-ATLAS.

 

VO

User

Team

Alarm

Total

ALICE

2

0

0

2

ATLAS

41

81

1

123

CMS

8

0

0

8

LHCb

2

56

0

58

Totals

52

137

1

191

 

Slide 4 shows the results of the VO SAM tests of the 4 Experiments. One can notice:

-       FZK ATLAS 3D DB degradation  from 7-16 Sep – see SIR

-       RAL name server upgrade 15-17 Sep lead to disk-to-disk copy failures 15-17 Sep for all VOs. See SIR.

-       NIKHEF had scheduled network maintenance on morning of 22nd. Three dCache-SRM crashes in one week – debugging with developers. SAN problem on 21st which cut off their BDII.

-       IN2P3 scheduled Electrical Power upgrades from 22-24 September extended to at risk for 2 days (followed now by dCache migration to chimera file system 28-30 Sep.)

3.2      Service Incidents Reports (SIRs)

CERN CASTOR LHCb Outages

7 September CASTORLHCB stager was blocked from 05.00 for 3 hours when a table space could not be extended. This should be automatic but in fact an associated control file file-system was filled by Oracle dumps put there by a known bug in the clusterware. The lemon metric to warn of the full file-system was unfortunately not correctly configured. This has now been fixed and DES group is checking Oracle patch availability to prevent further core files.

 

8 September one of the nodes of the RAC cluster of LHCb became blocked (login could not complete) with console logs showing errors on a disk partition. Clusterware should have ejected the node but did not the suspicion being that it was partially responding. DB monitoring did detect the problem and the node was rebooted.  

 

ATLAS 3D DB RAC at GridKa degradation Sep 7-16, 2009.

Due to too many open sessions on the first of two ATLAS RAC nodes we rebooted it on Sep 7. After rebooting the DB was not properly open (was to be seen only in alertlog). We tried to restart the instance several times, but not successfully.
From Sep 8, we stopped the 1-st instance and the DB worked on only the 2-nd node. Next day we opened an ORACLE Service Request. (Streams replication of the ATLAS conditions to FZK was hence down from 7 to 8 September).

 

On Sep 16 the 2-nd node rebooted for unknown reasons (maybe network). Surprisingly, afterwards the DB started properly on both nodes and everything seems fine now. Therefore, on Sep 17 we closed the SR.

 

From 8 to 16 Sep the service was at risk from a second failure and possible overloading (activity was in fact low during this period).

 

Follow up - Not clear, but we suspect this problem to be an unexpected consequence of the Oracle July CPU patches that we applied on Aug 11. By a reboot of the nodes (one after the other) directly after patching maybe we would have noticed this issue already on Aug 11. We are discussing to re-open the SR to find the real cause (during the SR, the Oracle support was not very useful).

 

RAL Disk to Disk transfers  failing 15-17 Sep

Disk to Disk (D2D) transfers started failing during a planned upgrade to the Name Server and were down for all VOs during 44 hours.

 

After applying a scheduled upgrade of the Castor NS from version 2.1.7-27 to version 2.1.8, during testing it was found that Disk to Disk (D2D) copies started failing across all instances. Any possible link with the NS upgrade was ruled out. Investigations carried out with the assistance of CASTOR developers at CERN revealed that there was an LSF job scheduler problem resulting in D2D transfer jobs failing. After LSF was restarted, both on the central servers and on all disk servers, the D2D transfer problem disappeared.

 

Why this problem started affecting services after the NS upgrade is so far unknown as it has not been possible to reproduce. The D2D transfer problem had also been detected on the certification instance immediately prior to this upgrade during NS testing. Since D2D transfers are unrelated to the NS, it was decided to proceed with the upgrade. We now believe that the problem was caused by a wrong procedure for stopping castor services and LSF prior to the NS upgrade, and bringing them up afterwards.

 

Follow-up: Certification instance must always be fully functional, and no future upgrades should proceed if anything is broken. A clearly defined startup and shutdown sequence for the LSF scheduler (both central LSF master and on daemons on all disk servers) and other CASTOR services needs to be written and tested, and should be used during future upgrades.

 

ATLAS Replication Tier0->Tier1 down Sep 21st

Capture aborted after ORA-01280: Fatal LogMiner Error. Failure occurred at 8:05 and was not noticed by shifter until 18:00 due to lack of email notification from monitoring.

 

Lack of email notification from the monitoring was caused by the overload of sendmail (not the central sendmail), which is sending a number of emails at 8:00. Because of the overload message from monitoring hasn't been sent. In the mean time monitoring web page was not checked by shifter. Replication to Tier1 sites was completely re-established around 18:00.

ATLAS dbas noticed that the capture process was down but sent a private email to a person on holidays. Mail should have been sent to the physics database support email list.

 

Follow up: Shifter now obliged to check monitoring web page. Monitoring is being reviewed to make sure that email notifications will be sent each time an abort occurs. For each message that monitoring was not able to send in 3 tries, a thread will be created and it will try to send the message until success.

We are in contact with Oracle support in order to identify which is the cause for the LogMiner error.

3.3      Miscellaneous Reports

Several other shorter reports were mentioned:

-       CREAM-CE’s at CERN failing for ALICE job submission from 7 to 14 September with ‘blparser service is not alive’. Needed detailed investigation and understanding.

-       ASGC had problems migrating CMS data to tape over last week – tracked down to badly configured cartridges in the CMS pool. Planning to go from 6 to 24 tape drives mid-October so tape performance will be limited till then.

-       LHCb Dirac has now been certified for SL5. CERN LHCb SRM upgraded to 2.8 to properly support xroot TURL’s.

-       Many WLCG staff at EGEE’09 in Barcelona last week giving presentations and demonstrations (at the WLCG booth) and attending EGI/HEP SSC meetings.

-       RAL had problems with their Maui/Torque batch scheduler since converting to SL5. Suspected due to running 32-bit server version but cleared when server and worker node clients were brought up to the same software level.

-       CMS plan to hold what they call the October Exercise involving all physics groups where they pretend to act as if they are trying to push out the first physics papers in a hurry after they have data. It starts on October 5th and lasts for 2 weeks with intensive grid job submission via their CRAB (CMS Remote Analysis Builder) servers.

-       For this they needed to increase their number of Grid pool accounts at CERN (cmsxxx) from the current 200 to 999 and this has been done (thanks to AIS and FIO). These accounts appear on lxplus, lxbatch and cms VO-boxes but are not in AFS. They request in fact to go to 2000 for a long term solution.

-       To appear in LDAP they also have to have a NICE account hence a mail account. An individual CCID can only have 1023 mail accounts so a second service provider ‘CMS GRID-USER2’ has been setup in CRA by AIS group.

 

I.Bird asked why so many email accounts are needed.

T.Cass replied that a different ID is needed for each user submitting jobs.

A.Pace added that these were created as secondary IDs of a single account. And there is a limitation to one thousand. Probably these should have been created as main accounts and there would not be limitations.

I.Bird and T.Cass proposed to solve the issue outside the meeting.

M.Schulz added that the versions installed at many Sites are at least 6 months old and several bugs have been fixed since.

 

-       ATLAS will perform a throughput test of data distribution from Tier-0 to their Tier-1 for 5 days from 5 October and using version 2.2 of the FTS software.

-       ATLAS user analysis test for 21-23 October (or following weekend if not ready in time) to get many user analysis jobs running over the world-wide resources. Users are meant to run their normal analysis jobs during this test. This is a follow-on test from STEP-09 and the last before data taking.

-       Plan is for users to run analysis jobs on a set of large AOD datasets for the first two days with the third day for copying output to Tier 3s or local disks.

-       In preparation for this test ATLAS are distributing 5 large containers of AOD (each of 100M events = 10 TB) at two to each Tier 2 cloud then expert users will run over 3 containers so as to exercise at least two Tier 2 clouds with their analysis (these details subject to change).

 

 

4.   Summary of LHCC Referees Meeting (Slides) – I.Bird

 

 

I.Bird presented a short summary of the LHCC Referees Meeting and the feedback received.

No comments from the MB.

4.1      Main Point in the Closed Session

Transition to SL5

The referees were worried that this ends up happening just when data is about to come.

 

ALICE

“They assume MSS at sites will be materialized. And they lack Tier 2 disk”, They are satisfied with the performance and ask sites to be more proactive” (CREAM).

 

ATLAS:

DB access – Adding Frontier etc: “This is a bit surprising ... To wake up on this at this stage of experiment...”

SW performance improvement: “Sure there is still room for additional improvement”

 

LHCb:

CRSG support level of resources requested: “However CRSG report claims a significant change in LHCb computing model and recommends a review by LHCC for 2011”

 

Interaction with CRSG:

“Experiments strongly desire that the final proposed numbers reflect their estimations and not those from the CRSG limited model”

“In some cases (ALICE) non-fully resolved discrepancies are still present”

“Future iterations, this review process should be organised in such a way that avoids a double referee process and unnecessary delays and maybe involves LHCC early on”

4.2      Feedback

Next WLCG mini-review:

“Now I can confirm you we will cancel our meeting in November.”

 

Scrutiny group and process:

“Concerning your comments on CRSG/LHCC coordination and the overall timeliness of the resources review process, we have discussed it in closed session”.

 

We consider that we should suggest now CERN management LHC management to come up with an early and some conservative estimation of the total 2011 LHC running time. This information is needed by the funding agencies at next RRB in order to define their procurement cycles.

This should be the first step towards establishing a well-defined and coherent review process.” We have to consider various running scenarios”.

 

 

 

5.   GDB Summary (SL5, CREAM, HEP-SPEC06, etc) (Text) – J.Gordon

 

 

J.Gordon summarized the September’s GDB Meeting. He read the following text.

 

GDB Issues for MB

John Gordon 29/9/2009

 

1.   Security Patching

The EGEE Security Officer expressed concern that so few sites had applied a critical security patch a couple of weeks after it went public. It has since been escalated further and last week the EGEE PMB said they would recommend disconnecting sites from the Grid that were still unpatched after the end of September. This topic does not seem to have discussed at the daily WLCG operations meetings. I don’t know the OSG view on this but I think WLCG Management should support the infrastructure security operations staff in stressing how important an issue this is. Any security incident associated with the Grid will be a big setback to our progress just at the worst possible time before data taking.

 

2.   SL5

A significant fraction of WNs are now deployed as SL5 64 bit. Experience will tell whether they get used. Soon, but probably not at the October GDB, sites will be asking the experiments how much longer they need to provide SL4 for LHC.

 

3.   WMS

WMS’s supporting LHC experiments should upgrade to version 3.2 by the end of October. This version can submit to CREAM and migration will allow CREAM CEs to be flagged as production and matched by WMS for general work. It is an open question whether this change should be made to enable significant testing of CREAM as a replacement for LCG-CE or whether wide-scale testing by consenting adults should be done before exposing CREAM to all jobs.

 

4.   Authorisation

Some many months we have been encouraging the testing of SCAS and gLite to enable identity changing in pilot jobs. This has not been forthcoming at any significant scale. The current advice is to use gLite and SCAS. Issues are that an SL5 release of gLite will only go into production in October and the longer before widespread deployment of SCAS the more credible Argus (another alternative gLite product) becomes.  

 

5.   Virtualisation

Tony Cass gave a good presentation on establishing the trust of sites in distributed images. Paraphrasing, the experiment view is that user-built images should be run instead of jobs as happens in Clouds; the site view is why would they trust random images from random sources not to subvert their site security. Tony gave a roadmap for how experiments and projects (like WLCG) could establish the trust of sites. I will be convening a track at HEPiX next month which can start this.

 

 

No comments from the MB.

 

I.Bird asked whether the discussion on SCAS and Argus will take place at the next GDB.

J.Gordon replied that he will add it to the GDB Agenda.

 

 

6.   Security Monitoring (Slides) – R.Wartel

 

 

R.Wartel presented the general situation on Security at the WLCG Sites.

6.1      Technical Aspects

Existing frameworks (Nagios/SAM) for availability/reliability monitoring are available and used on a daily basis, also to certify or suspend sites. Currently they are being extended with additional security tests.

 

The key objective is to identify:

-       Sites vulnerable to a given serious vulnerability (typically a root exploit)

-       Sites with no patching process whatsoever

 

The main users are the ROC security contacts/NGI security officers (about 20 people). Nobody else can view the results and they use encrypted transport from the WN to the central security monitoring server. No information available from SAM on the findings

 

It only uses data/information available to any user; therefore it is not using any private information.

They are just checking the system installations, not the gLite packages or other additional packages.

 

The process collected data and concludes on the security status of the WN.

-       Non intrusive tests:

-       No violation of the site security controls or policy

-       No active test of security vulnerabilities (exploits, etc.)

-       No penetration testing

 

J.Gordon noted that several Sites use tar files for their installations.

R.Wartel replied that currently the security verifications are checking the versions of .rpm. This could be extended to .deb files too.

6.2      Strategy Followed

The current benefits are:

-       Targeted alerting system for a given vulnerability

-       “Live” view of the patching status of the infrastructure with regards to known security holes

 

The caveats are:

-       Currently running via SAM (not via SWAT)

-       As any “beta” software, possible bugs in the code (false positive, false negative)

-       ROC/NGI security contacts need some time to practice and understand the implications of the results, and communicate with their sites

 

The plan are:

-       Enable each NGI/ROC to use the framework for its own regional monitoring

-       Release the whole tool for the sites, if they wish to deploy a local service

 

Status of the current pilot phase

-       First overall view of the infrastructure in August 2009

-       Initial results highlighted an immediate threat for the infrastructure.

-       Several alerts sent to the sites, the issue was raised during the last GDB and EGEE PMB.

-       Not all sites pleased to be alerted about critical OS vulnerabilities in their farm

 

Note: Some of the details discussed are not reported here because of their security-related matter.

 

Decision:

The WLCG MB fully supported and endorsed this monitoring approach and Sites should proceed urgently with the needed patches. They should be reminded in the GDB and the daily Operation meetings.

 

 

7.   Verification of Installed Capacity (Slides) – S.Traylen

 

 

S.Traylen reported the status and issues with collecting the installed capacity at the WLCG Sites.

7.1      Tier 2 CPU Capacity

 A report is being produced every month (https://twiki.cern.ch/twiki/bin/view/LCG/SamMbReports)  e.g.Tier2_Reliab_200908.pdf

 

All software for Installed Capacity is now released: with Info providers for lcg-CE and CREAM. Available in YAIM for DPM, lcg-CE and Cream.

And fully documented with detailed Docs on Installed Capacity, YAIM documentation, instructions for running benchmarks and all sites can now publish CPU installed capacity.

7.2      Major Issues with CPU Reporting

The major issues that one can observed Some examples are:

-       Major Omissions, i.e. zero or null capacity:

-       OSG sites all missing. OSG have provided data via MyOSG. See http://tinyurl.com/y9hfl5s
GridView team contacted to include in report. Easy now report as switched to installed capacity.

-       AU-ATLAS - Australia-UNIMELB-LCG2: Site closed for good, should drop out next month.

-       IL-HEPTier-2 - IL-TAU-HEP: Completely unclear why this Null - GridView team contacted.

-       UK-London - UKI-LT2-IC-LESC: Site is closed in GOCDB.

7.3      Random Tier 2 CPU Anomalies

There are clearly inconsistencies. GStat2 is detecting them though. But Sites and respective Tier1s please review GStat2 results.

GStat2 can never determine a correct capacity, only a sane one. Read the report or gstat2 for instantaneous values.

 

As examples below are some clear Sites anomalies, that the Sites should have identified and fixed:

 

HEPHY-Vienna 416*physical, 416*logical:

Probably wrong unless really 1 core processors.
gstat2 shows 10 errors, e.g.: http://gstat-prod.cern.ch/gstat/site/Hephy-Vienna/ce/
ERROR: hephygr.oeaw.ac.at:2119/jobmanager-lcgpbs-dteam , GlueCEPolicyAssignedJobSlots has negative or null value
ERROR: hephygr.oeaw.ac.at, The Cores format is wrong, Cores not set
ERROR: hephygr.oeaw.ac.at, The Benchmark format is wrong, Benchmark not set

 i.e. site is not publishing:
GlueHostProcessorOtherDescription:
Cores=<N>, Benchmark=<X>-HEP-SPEC06
If it were <Cores>*<Physical> = <Logical> could be checked.
i.e. gstat2 will perform a cross check of these 3 values.

 

UKI-NORTHGRID-LANCS-HEP

1 Logical CPU spread over 656 physical CPUs. Obviously wrong. 5 errors reported by GStat2

 

Finland - CSC

4 Logical CPUs spread over 64 physical CPUs. 6 errors reported by GStat2.

7.4      Gstat2 Results

Some useful links for reading and browsing Gstat2 results

-       GStat2 - Pretty Page http://gstat-prod.cern.ch/gstat/summary to search by site, region or tier.

-       Gstat2 - Less Pretty Nagios Results https://gstat-dev.cern.ch/nagios/ to search by site-bdii hostname.

-       Gstat2 view requests project-grid-info-support@cern.ch

 

There are/may be bugs though https://savannah.cern.ch/bugs/?55235

And the new ones to EGEE-OAT savannah. https://savannah.cern.ch/projects/sa1tools/

 

Also look at the OAT Nagios, ROC or Site instance. Both of these have same probes running.

 

GStats2 Results

Slide 7 shows the CE Gstat2 results:

-       Shows #errors and #warnings from gstat2. Ordered by #errors

-       Totals: 268 Site BDIIs with CEs. 6202 errors, 9497 warnings.

-       Many duplicates. (E.g. Every CESEbind gives one error.

-       Only 10 Sites fully pass

 

Slide 8 shows the SE Gstat2 results:

-       Shows #errors and #warnings from gstat2. Ordered by #errors

-       Totals: Site BDIIs with SEs: 268. Errors: 15522. Warnings: 12995

-       Only 7 Sites fully pass

7.5      Related Work - SubClusters

We must improve SubCluster publishing in EGEE. Was requested by a number of regions at EGEE conference. This does NOT stop correct publication of installed capacity. It will make occupancy figures easier/possible.

 

New code for YAIM exists but needs another testing round. YAIM code has no semantics. A bash representation of glue basically.

 

OSG GIP plugin seems better and works. OSG publishes multiple non-overlapping SubClusters today. Review OSG plugin for EGEE inclusion.

7.6      Conclusions

The main actions to undertake are:

-       Add OSG results to report.

-       Sites can now view sanity errors via gstat2.

-       T1s please follow up with sites.

-       GGUS or submit bugs where you disagree.

 

Reference documents for Installed Capacity: https://twiki.cern.ch/twiki/pub/LCG/WLCGCommonComputingReadinessChallenges/WLCG_GlueSchemaUsage-1.8.pdf

 

Pledges information should also be added to the Tier-2 reports. .

J.Gordon asked whether the verification probes are in production already.

S.Traylen replied positively but added that for the moment GGUS tickets are not issued when anomalies are found. They want to wait that the migration to Nagios is complete.

 

Decision:

The MB supported the request that all WLCG Sites must verify the installed capacity numbers in Gstat2. The possibility should be advertised at the Meetings and milestones for the Sites should be defined.

 

 

8.    AOB

 

 

 

Upgrade LCG CE and WMS

M.Schulz reminded the Sites that they must update the LCG CE and the WMS which has many bugs fixed and is out since 6 months. The Tier-1 should also remind their Tioer-2 Sites.

 

 

9.    Summary of New Actions

 

 

 

No new actions.