LCG Management Board
Tuesday 29 September 2009 16:00-17:00 – Phone Meeting
(Version 1 – 3.10.2009)
A.Aimar (notes), J.Bakken, I.Bird (chair), K.Bos, D.Britton, T.Cass, L.Dell’Agnello, M.Ernst, P.Flix, S.Foffano, Qin Gang, J.Gordon, D.Groep, A.Heiss, M.Kasemann, P.McBride, A.Pace, B.Panzer, H.Renshall, M.Schulz, Y.Schutz, J.Shiers , O.Smirnova, R.Tafirout
Mailing List Archive
Tuesday 6 October 2009 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No comments received about the minutes of the previous meeting.
1.2 Quarterly Report Preparation – A.Aimar
A.Aimar proposed the dates for the Quarterly Reports from the Experiments.
The agreed dates are:
- 6 October: ALICE, CMS
- 13 October: LHCb
- 20 October: ATLAS
1.3 GridView Availability Calculations – I.Bird
As agreed at the previous MB Meeting, GridView have just fixed the problem of wrongly counting unscheduled downtimes in the reliability calculations. The reports for September will show the correct implementation of the agreed definition of reliability.
1.4 Update on Tier-1 Pledges – S.Foffano
S.Foffano is expecting the confirmation of the 2010 pledges from the Tier-1 Sites. Not all Sites replied by the deadline of the 28 September.
She asked (below the respective answers)for news from:
- TRIUMF: R.Tafirout replied that they will confirm what is installed until now on 2009 and their pledges for 2010.
- IN2P3: Not present at the meeting, she will ask via email.
- INFN: Not present at the meeting, she will ask via email.
- NL-T1: Will answer via email
- NDGF: no information from Sweden. O.Smirnova said that she will send some incomplete information with the current proposed procurement draft.
- ASGC: S.Foffano will forward the questions to Qin Gang.
2. Action List Review (List of actions)
Not done by: FR-CCIN2P3, NDGF and
updates from NDGF.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
Covers the three weeks 6th to 27th September. There was a mixture of problems – mostly database related.
Only one test alarm ticket: from CERN to BNL-T1. Was acknowledged by BNL but did not automatically open an OSG ticket. New global tests will be performed this week.
Incidents leading to service incident reports
- CERN: Two separate LHCb CASTOR blockages 7 and 8 September.
- FZK: Access to ATLAS DB blocked on 7 September with too many open sessions. Degraded from 8 to 16th.
- RAL Disk to Disk transfers failing from 15-17 Sep during a planned upgrade to the CASTOR Name Server.
- CERN: ATLAS Replication Tier0 to Tier1 down on 21 Sep.
It is hard to understand the several Oracle bugs/features/side-effects at CERN-LHCb, FZK-ATLAS and CERN-ATLAS.
Slide 4 shows the results of the VO SAM tests of the 4 Experiments. One can notice:
- FZK ATLAS 3D DB degradation from 7-16 Sep – see SIR
- RAL name server upgrade 15-17 Sep lead to disk-to-disk copy failures 15-17 Sep for all VOs. See SIR.
- NIKHEF had scheduled network maintenance on morning of 22nd. Three dCache-SRM crashes in one week – debugging with developers. SAN problem on 21st which cut off their BDII.
- IN2P3 scheduled Electrical Power upgrades from 22-24 September extended to at risk for 2 days (followed now by dCache migration to chimera file system 28-30 Sep.)
3.2 Service Incidents Reports (SIRs)
CERN CASTOR LHCb Outages
7 September CASTORLHCB stager was blocked from 05.00 for 3 hours when a table space could not be extended. This should be automatic but in fact an associated control file file-system was filled by Oracle dumps put there by a known bug in the clusterware. The lemon metric to warn of the full file-system was unfortunately not correctly configured. This has now been fixed and DES group is checking Oracle patch availability to prevent further core files.
8 September one of the nodes of the RAC cluster of LHCb became blocked (login could not complete) with console logs showing errors on a disk partition. Clusterware should have ejected the node but did not the suspicion being that it was partially responding. DB monitoring did detect the problem and the node was rebooted.
ATLAS 3D DB RAC at GridKa degradation Sep 7-16, 2009.
to too many open sessions on the first of two ATLAS RAC nodes we rebooted it
on Sep 7. After rebooting the DB was not properly open (was to be seen only
in alertlog). We tried to restart the instance several times, but not
On Sep 16 the 2-nd node rebooted for unknown reasons (maybe network). Surprisingly, afterwards the DB started properly on both nodes and everything seems fine now. Therefore, on Sep 17 we closed the SR.
From 8 to 16 Sep the service was at risk from a second failure and possible overloading (activity was in fact low during this period).
Follow up - Not clear, but we suspect this problem to be an unexpected consequence of the Oracle July CPU patches that we applied on Aug 11. By a reboot of the nodes (one after the other) directly after patching maybe we would have noticed this issue already on Aug 11. We are discussing to re-open the SR to find the real cause (during the SR, the Oracle support was not very useful).
RAL Disk to Disk transfers failing 15-17 Sep
Disk to Disk (D2D) transfers started failing during a planned upgrade to the Name Server and were down for all VOs during 44 hours.
After applying a scheduled upgrade of the Castor NS from version 2.1.7-27 to version 2.1.8, during testing it was found that Disk to Disk (D2D) copies started failing across all instances. Any possible link with the NS upgrade was ruled out. Investigations carried out with the assistance of CASTOR developers at CERN revealed that there was an LSF job scheduler problem resulting in D2D transfer jobs failing. After LSF was restarted, both on the central servers and on all disk servers, the D2D transfer problem disappeared.
Why this problem started affecting services after the NS upgrade is so far unknown as it has not been possible to reproduce. The D2D transfer problem had also been detected on the certification instance immediately prior to this upgrade during NS testing. Since D2D transfers are unrelated to the NS, it was decided to proceed with the upgrade. We now believe that the problem was caused by a wrong procedure for stopping castor services and LSF prior to the NS upgrade, and bringing them up afterwards.
Follow-up: Certification instance must always be fully functional, and no future upgrades should proceed if anything is broken. A clearly defined startup and shutdown sequence for the LSF scheduler (both central LSF master and on daemons on all disk servers) and other CASTOR services needs to be written and tested, and should be used during future upgrades.
ATLAS Replication Tier0->Tier1 down Sep 21st
Capture aborted after ORA-01280: Fatal LogMiner Error. Failure occurred at 8:05 and was not noticed by shifter until 18:00 due to lack of email notification from monitoring.
Lack of email notification from the monitoring was caused by the overload of sendmail (not the central sendmail), which is sending a number of emails at 8:00. Because of the overload message from monitoring hasn't been sent. In the mean time monitoring web page was not checked by shifter. Replication to Tier1 sites was completely re-established around 18:00.
ATLAS dbas noticed that the capture process was down but sent a private email to a person on holidays. Mail should have been sent to the physics database support email list.
Follow up: Shifter now obliged to check monitoring web page. Monitoring is being reviewed to make sure that email notifications will be sent each time an abort occurs. For each message that monitoring was not able to send in 3 tries, a thread will be created and it will try to send the message until success.
We are in contact with Oracle support in order to identify which is the cause for the LogMiner error.
3.3 Miscellaneous Reports
Several other shorter reports were mentioned:
- CREAM-CE’s at CERN failing for ALICE job submission from 7 to 14 September with ‘blparser service is not alive’. Needed detailed investigation and understanding.
- ASGC had problems migrating CMS data to tape over last week – tracked down to badly configured cartridges in the CMS pool. Planning to go from 6 to 24 tape drives mid-October so tape performance will be limited till then.
- LHCb Dirac has now been certified for SL5. CERN LHCb SRM upgraded to 2.8 to properly support xroot TURL’s.
- Many WLCG staff at EGEE’09 in Barcelona last week giving presentations and demonstrations (at the WLCG booth) and attending EGI/HEP SSC meetings.
- RAL had problems with their Maui/Torque batch scheduler since converting to SL5. Suspected due to running 32-bit server version but cleared when server and worker node clients were brought up to the same software level.
- CMS plan to hold what they call the October Exercise involving all physics groups where they pretend to act as if they are trying to push out the first physics papers in a hurry after they have data. It starts on October 5th and lasts for 2 weeks with intensive grid job submission via their CRAB (CMS Remote Analysis Builder) servers.
- For this they needed to increase their number of Grid pool accounts at CERN (cmsxxx) from the current 200 to 999 and this has been done (thanks to AIS and FIO). These accounts appear on lxplus, lxbatch and cms VO-boxes but are not in AFS. They request in fact to go to 2000 for a long term solution.
- To appear in LDAP they also have to have a NICE account hence a mail account. An individual CCID can only have 1023 mail accounts so a second service provider ‘CMS GRID-USER2’ has been setup in CRA by AIS group.
I.Bird asked why so many email accounts are needed.
T.Cass replied that a different ID is needed for each user submitting jobs.
A.Pace added that these were created as secondary IDs of a single account. And there is a limitation to one thousand. Probably these should have been created as main accounts and there would not be limitations.
I.Bird and T.Cass proposed to solve the issue outside the meeting.
M.Schulz added that the versions installed at many Sites are at least 6 months old and several bugs have been fixed since.
- ATLAS will perform a throughput test of data distribution from Tier-0 to their Tier-1 for 5 days from 5 October and using version 2.2 of the FTS software.
- ATLAS user analysis test for 21-23 October (or following weekend if not ready in time) to get many user analysis jobs running over the world-wide resources. Users are meant to run their normal analysis jobs during this test. This is a follow-on test from STEP-09 and the last before data taking.
- Plan is for users to run analysis jobs on a set of large AOD datasets for the first two days with the third day for copying output to Tier 3s or local disks.
- In preparation for this test ATLAS are distributing 5 large containers of AOD (each of 100M events = 10 TB) at two to each Tier 2 cloud then expert users will run over 3 containers so as to exercise at least two Tier 2 clouds with their analysis (these details subject to change).
4. Summary of LHCC Referees Meeting – (Slides) – I.Bird
I.Bird presented a short summary of the LHCC Referees Meeting and the feedback received.
No comments from the MB.
4.1 Main Point in the Closed Session
Transition to SL5
The referees were worried that this ends up happening just when data is about to come.
“They assume MSS at sites will be materialized. And they lack Tier 2 disk”, They are satisfied with the performance and ask sites to be more proactive” (CREAM).
DB access – Adding Frontier etc: “This is a bit surprising ... To wake up on this at this stage of experiment...”
SW performance improvement: “Sure there is still room for additional improvement”
CRSG support level of resources requested: “However CRSG report claims a significant change in LHCb computing model and recommends a review by LHCC for 2011”
Interaction with CRSG:
“Experiments strongly desire that the final proposed numbers reflect their estimations and not those from the CRSG limited model”
“In some cases (ALICE) non-fully resolved discrepancies are still present”
“Future iterations, this review process should be organised in such a way that avoids a double referee process and unnecessary delays and maybe involves LHCC early on”
Next WLCG mini-review:
“Now I can confirm you we will cancel our meeting in November.”
Scrutiny group and process:
“Concerning your comments on CRSG/LHCC coordination and the overall timeliness of the resources review process, we have discussed it in closed session”.
We consider that we should suggest now CERN management LHC management to come up with an early and some conservative estimation of the total 2011 LHC running time. This information is needed by the funding agencies at next RRB in order to define their procurement cycles.
This should be the first step towards establishing a well-defined and coherent review process.” We have to consider various running scenarios”.
5. GDB Summary (SL5, CREAM, HEP-SPEC06, etc) (Text) – J.Gordon
J.Gordon summarized the September’s GDB Meeting. He read the following text.
No comments from the MB.
I.Bird asked whether the discussion on SCAS and Argus will take place at the next GDB.
J.Gordon replied that he will add it to the GDB Agenda.
6. Security Monitoring (Slides) – R.Wartel
R.Wartel presented the general situation on Security at the WLCG Sites.
6.1 Technical Aspects
Existing frameworks (Nagios/SAM) for availability/reliability monitoring are available and used on a daily basis, also to certify or suspend sites. Currently they are being extended with additional security tests.
The key objective is to identify:
- Sites vulnerable to a given serious vulnerability (typically a root exploit)
- Sites with no patching process whatsoever
The main users are the ROC security contacts/NGI security officers (about 20 people). Nobody else can view the results and they use encrypted transport from the WN to the central security monitoring server. No information available from SAM on the findings
It only uses data/information available to any user; therefore it is not using any private information.
They are just checking the system installations, not the gLite packages or other additional packages.
The process collected data and concludes on the security status of the WN.
- Non intrusive tests:
- No violation of the site security controls or policy
- No active test of security vulnerabilities (exploits, etc.)
- No penetration testing
J.Gordon noted that several Sites use tar files for their installations.
R.Wartel replied that currently the security verifications are checking the versions of .rpm. This could be extended to .deb files too.
6.2 Strategy Followed
The current benefits are:
- Targeted alerting system for a given vulnerability
- “Live” view of the patching status of the infrastructure with regards to known security holes
The caveats are:
- Currently running via SAM (not via SWAT)
- As any “beta” software, possible bugs in the code (false positive, false negative)
- ROC/NGI security contacts need some time to practice and understand the implications of the results, and communicate with their sites
The plan are:
- Enable each NGI/ROC to use the framework for its own regional monitoring
- Release the whole tool for the sites, if they wish to deploy a local service
Status of the current pilot phase
- First overall view of the infrastructure in August 2009
- Initial results highlighted an immediate threat for the infrastructure.
- Several alerts sent to the sites, the issue was raised during the last GDB and EGEE PMB.
- Not all sites pleased to be alerted about critical OS vulnerabilities in their farm
Note: Some of the details discussed are not reported here because of their security-related matter.
The WLCG MB fully supported and endorsed this monitoring approach and Sites should proceed urgently with the needed patches. They should be reminded in the GDB and the daily Operation meetings.
7. Verification of Installed Capacity (Slides) – S.Traylen
S.Traylen reported the status and issues with collecting the installed capacity at the WLCG Sites.
7.1 Tier 2 CPU Capacity
A report is being produced every month (https://twiki.cern.ch/twiki/bin/view/LCG/SamMbReports) e.g.Tier2_Reliab_200908.pdf
All software for Installed Capacity is now released: with Info providers for lcg-CE and CREAM. Available in YAIM for DPM, lcg-CE and Cream.
And fully documented with detailed Docs on Installed Capacity, YAIM documentation, instructions for running benchmarks and all sites can now publish CPU installed capacity.
7.2 Major Issues with CPU Reporting
The major issues that one can observed Some examples are:
- Major Omissions, i.e. zero or null capacity:
OSG sites all missing. OSG
have provided data via MyOSG. See http://tinyurl.com/y9hfl5s
- AU-ATLAS - Australia-UNIMELB-LCG2: Site closed for good, should drop out next month.
- IL-HEPTier-2 - IL-TAU-HEP: Completely unclear why this Null - GridView team contacted.
- UK-London - UKI-LT2-IC-LESC: Site is closed in GOCDB.
7.3 Random Tier 2 CPU Anomalies
There are clearly inconsistencies. GStat2 is detecting them though. But Sites and respective Tier1s please review GStat2 results.
GStat2 can never determine a correct capacity, only a sane one. Read the report or gstat2 for instantaneous values.
As examples below are some clear Sites anomalies, that the Sites should have identified and fixed:
HEPHY-Vienna 416*physical, 416*logical:
Probably wrong unless really 1 core processors.
1 Logical CPU spread over 656 physical CPUs. Obviously wrong. 5 errors reported by GStat2
Finland - CSC
4 Logical CPUs spread over 64 physical CPUs. 6 errors reported by GStat2.
7.4 Gstat2 Results
Some useful links for reading and browsing Gstat2 results
- GStat2 - Pretty Page http://gstat-prod.cern.ch/gstat/summary to search by site, region or tier.
- Gstat2 - Less Pretty Nagios Results https://gstat-dev.cern.ch/nagios/ to search by site-bdii hostname.
- Gstat2 view requests firstname.lastname@example.org
There are/may be bugs though https://savannah.cern.ch/bugs/?55235
And the new ones to EGEE-OAT savannah. https://savannah.cern.ch/projects/sa1tools/
Also look at the OAT Nagios, ROC or Site instance. Both of these have same probes running.
Slide 7 shows the CE Gstat2 results:
- Shows #errors and #warnings from gstat2. Ordered by #errors
- Totals: 268 Site BDIIs with CEs. 6202 errors, 9497 warnings.
- Many duplicates. (E.g. Every CESEbind gives one error.
- Only 10 Sites fully pass
Slide 8 shows the SE Gstat2 results:
- Shows #errors and #warnings from gstat2. Ordered by #errors
- Totals: Site BDIIs with SEs: 268. Errors: 15522. Warnings: 12995
- Only 7 Sites fully pass
7.5 Related Work - SubClusters
We must improve SubCluster publishing in EGEE. Was requested by a number of regions at EGEE conference. This does NOT stop correct publication of installed capacity. It will make occupancy figures easier/possible.
New code for YAIM exists but needs another testing round. YAIM code has no semantics. A bash representation of glue basically.
OSG GIP plugin seems better and works. OSG publishes multiple non-overlapping SubClusters today. Review OSG plugin for EGEE inclusion.
The main actions to undertake are:
- Add OSG results to report.
- Sites can now view sanity errors via gstat2.
- T1s please follow up with sites.
- GGUS or submit bugs where you disagree.
Reference documents for Installed Capacity: https://twiki.cern.ch/twiki/pub/LCG/WLCGCommonComputingReadinessChallenges/WLCG_GlueSchemaUsage-1.8.pdf
Pledges information should also be added to the Tier-2 reports. .
J.Gordon asked whether the verification probes are in production already.
S.Traylen replied positively but added that for the moment GGUS tickets are not issued when anomalies are found. They want to wait that the migration to Nagios is complete.
The MB supported the request that all WLCG Sites must verify the installed capacity numbers in Gstat2. The possibility should be advertised at the Meetings and milestones for the Sites should be defined.
Upgrade LCG CE and WMS
M.Schulz reminded the Sites that they must update the LCG CE and the WMS which has many bugs fixed and is out since 6 months. The Tier-1 should also remind their Tioer-2 Sites.
9. Summary of New Actions
No new actions.