Week of 090810

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Harry, Patricia, Gang, Andrea, Sophie, Jacek, Olof, Alessando, Ueda);remote(Xavier/FZK, Gareth/RAL, Alexei/ATLAS ADC).

Experiments round table:

  • ATLAS - 1) The MCDISK space token at FZK filled over the weekend. A ticket was sent and more space was added. 2) NIKHEF is in scheduled down for the next two weeks (though Gocdb says only one) so Functional Tests will be sent to SARA to continue testing the NL-T1 cloud.

  • CMS reports - CMS use a private version of RFIO with a common worker node client library for interfacing to CASTOR and DPM and the French Tier2 of IPHC (Strasburg) find this does not work with the default SLC5 configuration since an ldpath entry is missing.

  • ALICE - Production going ok. The latest Alien version 2.17 has been installed everywhere and services will now be restarted where needed.

Sites / Services round table:

* NIKHEF: Moving all grid service machines and network infrastructure to a new data center. All services will be unavailable during this period. Start of downtime [UTC]: 10-08-2009 07:00, End downtime [UTC]: 14-08-2009 15:00

* IN2P3: The AMI (ATLAS Metadata Interface) capture is down due to an error on the LOG miner. The problem is under investigation. Streams replication to CERN is hence stopped.

* RAL: Just been notified of an air conditioning problem in the new computer centre - planning to stop all batch work.

* FZK: Had memory problems in dcache on Saturday then lost one of their tape libraries on Saturday evening. The library was fixed this morning but is still running at risk. A technician will come tomorrow for a scheduled downtime from 9 am plus 4-5 hours. The tape service will be degraded but not down.

AOB:

Tuesday:

Attendance: local(Sophie, Alessandro, Olof (chair), Jazek, Andrea, Gang, Markus, Patricia, Antonio);remote(Jeremy, Alexei, Xavier Mol (FZK), Angela, Gareth, Brian, Elisabetta).

Experiments round table:

  • ATLAS - (Alessandro) One issue: RAL ATLASMCDISK is still down. Second: Santa Claus subscription for data export has been modified for the Muon data taking starting next week. The details are in Elogger and Twiki.

  • CMS reports - (Andrea) no significant issue to report.

  • ALICE - (Patricia) Latest version of Alien: 2.17 has been put centrally at CERN but only available at three sites: IN2P3, INFN tier-2 and CERN. Will watch how it performs this week before deciding on a wider deployment later this week. Two new tickets (one updated from two weeks ago) about the CREAM CE at CERN - ticket numbers are 50508 (old) + 50883. Following up some small issues with Tier-2 site admins.

Sites / Services round table:

  • RAL (Gareth, report submitted prior to meeting) - Just before 14:00 (local time) we were notified of an air conditioning problem in the computer room (in the new building.) We immediately reduced batch load, but as temperatures were still rising the quickly became a stop of all Castor and Batch Worker nodes, which were powered off. Air conditioning was restored fairly quickly after that and after allowing a period of stability for the temperatures to fall back, systems were restarted. We ended the Outage in the GOC DB with systems (Castor, batch, FTS) back at 20:00. Other services continued running. An "At Risk" was declared in the GOC DB until midday today. This morning there were a significant number of minor issues to resolve (e.g. individual disks that had gone offline etc.). Work is ongoing to understand the underlying cause of the problem and what appropriate changes should be made. The issue about the ATLASMCDISK reported by ATLAS above should have been fixed yesterday evening - Alessandro confirms that the report was from yesterday and it's OK now
  • FZK (Angela) - nothing special to report. (Xavier) dCache tape library was repaired but there are still some issues and it might break. Another intervention needs to be scheduled.
  • GridPP - NTR
  • ASGC (Gang) - some CMS issues with Tier-2s being sorted out.
  • CERN (Sophie) - please test wms209 that has some new patches that are planned for next release. Andrea/Patricia: we are both using it already. Antonio: there has been some changes applied by Maarten in between, so it would be good to test again. Alessandro: ATLAS can run a HammerCloud test against it
  • CNAF (Elisabetta) - ATLAS jobs failing on SLC4/32bit nodes. In contact with Platform about the problem. If no resolution, CNAF will upgrade to SLC4/64bit or SLC5. Probably in ~2 weeks. Olof: if it only happens on 32bit, it's perhaps better to not wait for Platform to answer but just go ahead upgrade the WNs to SLC4/64bit or SLC5?

AOB:

  • Brian: when trying to test the TCP settings between BNL and RAL we are getting problems putting some files into BNL SE because it doesn't appear in lcginfosites, which takes the info from the BDII. Alessandro: there is a long standing issue with the site naming convention, which could be related but we should wait for BNL to comment

Wednesday

Attendance: local(Harry(chair), Luca, Antonio, Gang, Olof, Wayne, Patricia, Julia, Oliver, Andrea, Alessandro);remote(Gareth/RAL, Jeremy/GRIDPP, Alexei/ATLASDC, Brian/RAL, Elisabetta/CNAF).

Experiments round table:

  • ATLAS (AdG) - Muon cosmics data taking is in progress with data export only to the three muon calibration sites. The project name is data09_muoncomm (can be seen in the ATLAS elogger). Brian asked if the project name could be updated in the Twiki which was immediately done.

  • CMS reports - Nothing special to report.

  • ALICE - Production jobs have dropped to 5000 due to aliroot jobs (current version) now eating large amounts of memory. Some sites have stopped using this version. Under investigation. Tickets on the CERN CE are still open.

Sites / Services round table:

* RAL (from broadcast): The RAL Tier1 (RAL-LCG2) sufferered an air conditioning failure during the night Tuesday-Wednesday 11-12 August. This is the second such failure. All batch and storage (Castor) services are down. An Outage on these services has been declared until Tomorrow (13th) while investigations take place. Currently other services at hosted at the RAL Tier1 are available except the MyProxy service which will be restarted as soon as practical.

Meeting update: Aircon failure at midnight was due to chillers failing with lack of flow of water but underlying cause not yet clear. All batch and CASTOR are off but central services are up with FTS serving only non-RAL channels. The Gocdb is in failover to its readonly copy which happened before it could be updated. Confidence in the cooling is needed before running under full load which might happen on Friday (definitely not Thursday).

* CNAF: Following the ATLAS failures on LSF 32-bit worker nodes these will be upgraded to 64-bit and the ATLAS LSF configuration changed to point to them only. There will then be about 1050 job slots in 64-bits.

* GridPP: The MB has now requested sites to migrate worker nodes to SL(C)5 but two UK sites take ATLAS builds from elsewhere and are failing in SL5 so would like assurances ATLAS will address this issue. Alexei, when asked, confirmed that ATLAS had told WLCG management they were ready for SL5. He asked what percentage of CERN WN are SLC5 and Harry thought 55% and rising.

* CERN databases: Trying to schedule integration database upgrades with the experiments for next Monday/Tuesday. As usual the production databases would come two weeks later.

* CERN PPS: gLite 3.1 update 53 will contain the WMS 3.2 together with new LB and glue2 enabled bdii. One issue left is a possible memory leak in the WMS but there is not enough evidence to stop the release which should be made next week.

AOB: Antonio asked if ALICE and ATLAS had been able to test the patched CERN WMS219 - both had with no problems found.

Thursday

Attendance: local(Harry (chair), Olof, Sophie, Andrea, Gang, MariaDZ, Luca, Diana);remote(Angela+Xavier/FZK, Jeremy/GridPP, Michael/BNL, Gareth+Brian/RAL, Alexei/ATLAS).

Experiments round table:

  • ATLAS -

  • CMS reports - Networking problem affected CASTORCMS (also CASTORALICE) and CMS raised an alarm ticket against CERN at 21.12 - About half of Tier 0 processing jobs of CRAFT data failing due to failing to access CASTOR data. Diagnosed as network switch problem. Complaint from CMS was that ticket was not updated when the problem was resolved at about 02.10 - they noticed jobs flowing again from about 03.00. (See CERN report below for details - and apologies from the smod for not updating the ticket).

  • ALICE -

Sites / Services round table:

* RAL: (by email confirmed at meeting) Current situation is that Castor and batch systems are down, other services (LFC etc.) are up. We have a better understanding of the air conditioning situation. It looks like a flow sensor. However, tests are not sufficiently far advanced to give us confidence in the system. We do not yet have a complete inventory of hardware problems following on from the second incident. We are working on better detection of temperature problems and system shutdown. This is largely in place, but would be a hard power off at the moment. We anticipate restoring services (as best estimate) on Monday. This will enable us to better monitor systems as compared to the weekend.

* IN2P3 (by broadcast): ATLAS AMI database streams replication down. Last Sunday, we had a block corruption causing a failure of the capture process. In spite of this corruption, we could restart the capture process while we were in contact with support. Unfortunately, since this morning the capture process is down and we are no more able to restart it.

* FZK: Tape library failed for the third time and was definitively fixed between 09.00 and 13.00 today. This is one of 3 at FZK.

* CERN (networking by email): This night IT-CS were called at night for a problem affecting many services in the computer centre. The cause was a broadcast storm coming from the service IP328 that overloaded the router's CPU. Our engineer disconnected the switch IP328 and the router recovered. This morning the switch IP328 was verified and the broadcast storm seems was coming from port 19, connected to LXBRF2703. The machine was disconnected and the service reconnected: all fine. We have now also re-enabled LXBRF2703 and it seems to work properly. At the meeting it was pointed out that the other servers connected to the switch had been cut off from 02.10 till 08.00 so was this the only solution (no rep from CS present) ?

ASGC: A shipment of disk servers to be added for ATLAS is expected next month.

AOB: (from EGEE broadcast) Following aircon problem at RAL, GOCDB has been switched yesterday to its failover instance (database at CNAF-INFN and web portal at ITWM) in read-only mode. This failover instance is now fully operational in normal (read/write) mode.

(MariaDZ) ATLAS VO Support has a lot of GGUS tickets in status 'assigned', i.e. not yet handled. Details in https://gus.fzk.de/download/escalationreports/vo/html/20090810_EscalationReport_VOSupport.html Please go through and clean up.

Friday

Attendance: local(Harry(chair), Sophie, Jan, MatthiasS, Julia, Olof, Andrea, Luca, Gang, Patricia);remote(Angela+Xavier/FZK, Ronald (Nikhef), Gareth/RAL, Michael/BNL, Ian/CMS)).

Experiments round table:

  • ATLAS -

  • ALICE - Have several tickets open against the CERN Cream CE, one since 2 weeks. Sophie promised to look after the meeting.

Sites / Services round table:

* CERN: A Redhat linux exploit applies to the 2.6 kernel so RH4 and 5 and SL(C)4 and 5 are vulnerable. LXPLUS SLC4 at CERN is protected against loading of modules but not LXBATCH nor the SLC5 nodes. A workaround, adding vulnerable modules into /etc/modprobe.conf, has been applied to CERN LXPLUS and LXBATCH requiring a reboot if one of the vulnerable modules is already loaded. Only one SLC5 LXPLUS was rebooted (out of 50) and 150 worker nodes (out of 1100) were rebooted. Other CERN service providers have been warned and servers and VO-boxes are currently being checked. Please ask me if you want the bugzilla link (mail to Harry.Renshall@cernNOSPAMPLEASE.ch)

* FZK: 1) Had a blocked firewall from 03.00 to 08.00 causing name resulution problems. 2) Found one CREAMCE and one ALICE VObox which had the vulnerable Linux modules loaded. 3) At midday had a kernel panic on an ATLAS disk server losing access to several ATLAS disk only pools. Rapidly fixed.

* RAL: There is now better understanding of the problem with the air conditioning system although there are some remaining issues to resolve and checks to be made. We are fairly confident that we will be able to restart services on Monday.

A separate problem has been found on the tape robot. A small amount of water, from condensation, has dripped onto the tape robot and penetrated inside. An assessment is being made of the extent of damage to the robot and any contamination of media. We await the outcome of that assessment. May delay the Monday restart of this robot.

* NIKHEF: Normal servers have been moved to the new computer centre and are up and running. The majority of disk servers and worker nodes will be moved next week.

AOB:

-- JamieShiers - 05 Aug 2009

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2009-08-14 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback