Week of 090810

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Harry, Patricia, Gang, Andrea, Sophie, Jacek, Olof, Alessando, Ueda);remote(Xavier/FZK, Gareth/RAL, Alexei/ATLAS ADC).

Experiments round table:

  • ATLAS - 1) The MCDISK space token at FZK filled over the weekend. A ticket was sent and more space was added. 2) NIKHEF is in scheduled down for the next two weeks (though Gocdb says only one) so Functional Tests will be sent to SARA to continue testing the NL-T1 cloud.

  • CMS reports - CMS use a private version of RFIO with a common worker node client library for interfacing to CASTOR and DPM and the French Tier2 of IPHC (Strasburg) find this does not work with the default SLC5 configuration since an ldpath entry is missing.

  • ALICE - Production going ok. The latest Alien version 2.17 has been installed everywhere and services will now be restarted where needed.

Sites / Services round table:

* NIKHEF: Moving all grid service machines and network infrastructure to a new data center. All services will be unavailable during this period. Start of downtime [UTC]: 10-08-2009 07:00, End downtime [UTC]: 14-08-2009 15:00

* IN2P3: The AMI (ATLAS Metadata Interface) capture is down due to an error on the LOG miner. The problem is under investigation. Streams replication to CERN is hence stopped.

* RAL: Just been notified of an air conditioning problem in the new computer centre - planning to stop all batch work.

* FZK: Had memory problems in dcache on Saturday then lost one of their tape libraries on Saturday evening. The library was fixed this morning but is still running at risk. A technician will come tomorrow for a scheduled downtime from 9 am plus 4-5 hours. The tape service will be degraded but not down.

AOB:

Tuesday:

Attendance: local(Sophie, Alessandro, Olof (chair), Jazek, Andrea, Gang, Markus, Patricia, Antonio);remote(Jeremy, Alexei, Xavier Mol (FZK), Angela, Gareth, Brian, Elisabetta).

Experiments round table:

  • ATLAS - (Alessandro) One issue: RAL ATLASMCDISK is still down. Second: Santa Claus subscription for data export has been modified for the Muon data taking starting next week. The details are in Elogger and Twiki.

  • CMS reports - (Andrea) no significant issue to report.

  • ALICE - (Patricia) Latest version of Alien: 2.17 has been put centrally at CERN but only available at three sites: IN2P3, INFN tier-2 and CERN. Will watch how it performs this week before deciding on a wider deployment later this week. Two new tickets (one updated from two weeks ago) about the CREAM CE at CERN - ticket numbers are 50508 (old) + 50883. Following up some small issues with Tier-2 site admins.

Sites / Services round table:

  • RAL (Gareth, report submitted prior to meeting) - Just before 14:00 (local time) we were notified of an air conditioning problem in the computer room (in the new building.) We immediately reduced batch load, but as temperatures were still rising the quickly became a stop of all Castor and Batch Worker nodes, which were powered off. Air conditioning was restored fairly quickly after that and after allowing a period of stability for the temperatures to fall back, systems were restarted. We ended the Outage in the GOC DB with systems (Castor, batch, FTS) back at 20:00. Other services continued running. An "At Risk" was declared in the GOC DB until midday today. This morning there were a significant number of minor issues to resolve (e.g. individual disks that had gone offline etc.). Work is ongoing to understand the underlying cause of the problem and what appropriate changes should be made. The issue about the ATLASMCDISK reported by ATLAS above should have been fixed yesterday evening - Alessandro confirms that the report was from yesterday and it's OK now
  • FZK (Angela) - nothing special to report. (Xavier) dCache tape library was repaired but there are still some issues and it might break. Another intervention needs to be scheduled.
  • GridPP - NTR
  • ASGC (Gang) - some CMS issues with Tier-2s being sorted out.
  • CERN (Sophie) - please test wms209 that has some new patches that are planned for next release. Andrea/Patricia: we are both using it already. Antonio: there has been some changes applied by Maarten in between, so it would be good to test again. Alessandro: ATLAS can run a HammerCloud test against it
  • CNAF (Elisabetta) - ATLAS jobs failing on SLC4/32bit nodes. In contact with Platform about the problem. If no resolution, CNAF will upgrade to SLC4/64bit or SLC5. Probably in ~2 weeks. Olof: if it only happens on 32bit, it's perhaps better to not wait for Platform to answer but just go ahead upgrade the WNs to SLC4/64bit or SLC5?

AOB:

  • Brian: when trying to test the TCP settings between BNL and RAL we are getting problems putting some files into BNL SE because it doesn't appear in lcginfosites, which takes the info from the BDII. Alessandro: there is a long standing issue with the site naming convention, which could be related but we should wait for BNL to comment

Wednesday

Attendance: local(Harry(chair), Luca, Antonio, Gang, Olof, Wayne, Patricia, Julia, Oliver, Andrea, Alessandro);remote(Gareth/RAL, Jeremy/GRIDPP, Alexei/ATLASDC, Brian/RAL, Elisabetta/CNAF).

Experiments round table:

  • ATLAS (AdG) - Muon cosmics data taking is in progress with data export only to the three muon calibration sites. The project name is data09_muoncomm (can be seen in the ATLAS elogger). Brian asked if the project name could be updated in the Twiki which was immediately done.

  • CMS reports - Nothing special to report.

  • ALICE - Production jobs have dropped to 5000 due to aliroot jobs (current version) now eating large amounts of memory. Some sites have stopped using this version. Under investigation. Tickets on the CERN CE are still open.

Sites / Services round table:

* RAL (from broadcast): The RAL Tier1 (RAL-LCG2) sufferered an air conditioning failure during the night Tuesday-Wednesday 11-12 August. This is the second such failure. All batch and storage (Castor) services are down. An Outage on these services has been declared until Tomorrow (13th) while investigations take place. Currently other services at hosted at the RAL Tier1 are available except the MyProxy service which will be restarted as soon as practical.

Meeting update: Aircon failure at midnight was due to chillers failing with lack of flow of water but underlying cause not yet clear. All batch and CASTOR are off but central services are up with FTS serving only non-RAL channels. The Gocdb is in failover to its readonly copy which happened before it could be updated. Confidence in the cooling is needed before running under full load which might happen on Friday (definitely not Thursday).

* CNAF: Following the ATLAS failures on LSF 32-bit worker nodes these will be upgraded to 64-bit and the ATLAS LSF configuration changed to point to them only. There will then be about 1050 job slots in 64-bits.

* GridPP: The MB has now requested sites to migrate worker nodes to SL(C)5 but two UK sites take ATLAS builds from elsewhere and are failing in SL5 so would like assurances ATLAS will address this issue. Alexei, when asked, confirmed that ATLAS had told WLCG management they were ready for SL5. He asked what percentage of CERN WN are SLC5 and Harry thought 55% and rising.

* CERN databases: Trying to schedule integration database upgrades with the experiments for next Monday/Tuesday. As usual the production databases would come two weeks later.

* CERN PPS: gLite 3.1 update 53 will contain the WMS 3.2 together with new LB and glue2 enabled bdii. One issue left is a possible memory leak in the WMS but there is not enough evidence to stop the release which should be made next week.

AOB: Antonio asked if ALICE and ATLAS had been able to test the patched CERN WMS219 - both had with no problems found.

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • CMS reports - Alarm ticket raised against CERN at 21.12 - About half of Tier 0 processing jobs of CRAFT data failing to access CASTOR data. Diagnosed as network switch problem - resolved about 02.10 when faulty switch (giving broadcasting overload) was taken out of service.

  • ALICE -

Sites / Services round table:

* IN2P3 (by broadcast): ATLAS AMI database streams replication down. Last Sunday, we had a block corruption causing a failure of the capture process. In spite of this corruption, we could restart the capture process while we were in contact with support. Unfortunately, since this morning the capture process is down and we are no more able to restart it.

AOB: (from EGEE broadcast) Following aircon problem at RAL, GOCDB has been switched yesterday to its failover instance (database at CNAF-INFN and web portal at ITWM) in read-only mode. This failover instance is now fully operational in normal (read/write) mode.

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 05 Aug 2009

Edit | Attach | Watch | Print version | History: r12 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2009-08-13 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback