Week of 131209

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Simone (SCOD), Alessandro (ATLAS), Raja (LHCb), Przemek (CERN-DB), Vitor (CERN-PES), Maarten (ALICE)
  • remote: Xavier (KIT), Pepe (PIC), Sang-Un (KISTI), Rolf (IN2P3), Michael (BNL), Tiju (RAL), Jeremy (GridPP), Roger (NDGF), Onno (NL-T1), Rob (OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • ATLAS_DDM_VOBOXes were unstable on Dec. 5th. Back stable at 2:00UTC on Dec. 6th.
      • PilotFactories also. Degraded during 12:00UTC - 24:00UTC on Dec. 5th.
    • T0/T1
      • FZK-LCG2: Network trouble caused DNS lookup errors on Dec. 6th. GGUS:99571. Fixed.
      • FZK-LCG2: Transfer failures due to 'RQueued' (reported last Thursday ) still happening. Around 10% of failure rate since Dec. 8th.
      • TAIWAN-LCG2: Recovered from disk server crash on Oct. 30th. GGUS:98482 closed.

  • CMS reports (raw view) -
    • It has been a very quiet few days, largely just some scattered issues at scattered T2 sites.
    • The exception to this is CNAF, for which the storage was down for several days. It's back now.
    • I have just (13:40) learned that there is trouble with the CERN BDII that are making sites appear unavailable in SAM tests. GGUS:99521, perhaps there will be an update by 15:00?

  • ALICE -
    • NTR

Sites / Services round table:

  • KIT: there will be 3 downtimes tomorrow: CMS dCache, firewall, tape management software. Thursday dCache for ATLAS will be upgraded as well.
  • PIC: finishing the SIR on the network incident occurred last week. Will be provided ASAP.
  • BNL: there will be a 2h network intervention one week from now (next monday). It will also affect access to LFC and FTS (therefore T2s activity) beside T1 services . On tuesday next dCache will be upgraded to the SHA-2 compliant version.
  • IN2P3: downtime tomorrow. Operations portal down from 8:30 to 10:30.
  • NL-T1: downtime on december 17th: 24 hours maintenance of the MSS. It will not be possible to stage files during that time. Maarten: what is the situation with the disk servers (which gave lots of troubles in the past weeks)? Onno: seem to be stable now after a lot of hardware replacement. New hardware should also arrive before the end of the year. Maarten: at the end of the process, a SIR should be provided (there was also some minimal data loss). Onno: will do.
  • NDGF: this morning in downtime for upgrade of central storage services. On wednesday there will be a network intervention which will affect some pools; therefore some data might be unavailable.
  • CERN DB: intervention on wednesday (10 AM CET) to the WLCGR test and integration database.
  • ASGC: During the weekend, our data center was suffering high temperature issue due to there were some problems with our air conditions, so, it caused some CASTOR disks to be unstable, it should be improved in Monday morning.
  • Maarten for PES: it is very urgent to upgrade the CERN and SAM BDII to the latest version to make sure the FCR mechanism does not affect SAM tests. Also T1s are invited to upgrade.

AOB:

Thursday

Attendance:

  • local: Simone (SCOD), Kate (CERN-DB), Maarten (Alice), Felix (ASGC), Luca (CERN-DSS), Raja (LHCb), Alessandro (ATLAS), Pablo (Dashboards), Vitor (CERN-PES), MariaD
  • remote: Xavier (KIT), Sang-Un (KISTI), Lisa (FNAL), Rolf (IN2P3), Roger (NDGF), Jose (CMS), Kyle (OSG), Dennis ((NL-T1), Stefano (CNAF), Gareth (RAL), Thorsten (ATLAS), Pepe (PIC)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • ATLAS pilot factories were unable to submit new pilots on Dec 11 due to an upgrade of the production system (new status introduced). Quick patch was provided, new version to be released soon.
      • All jobs were killed starting Dec 10th, fixed on Dec 11th with a patch of ATLAS pilot factories.
      • openssl-1.0.1e-16.el6_5.x86_64 breaks job submission (->Peter Love)
        • affected version: glite-ce-cream-1.14.4-1.sl6.noarch, openssl version not working openssl-1.0.1e-16.el6_5.x86_64, last version working ok: openssl-1.0.0-27.el6.x86_64 (reported by Andreas Haupt, DESY-ZN)
        • There might be multiple issues here, but the problem has to do with the latest update of the OS. Maarten will investigate offline and broadcast a recipe for the sites.
    • T0/T1:
      • No major issues to report

  • CMS reports (raw view) -
    • Again some stale information observed in the SAM BDII GGUS:99684. Back to normal before Maria Alandes could debug the problem. We'll keep an eye on it.
      • The SAM BDII at CERN was upgraded and this should fix the problem reported on monday. This could/should be a different problem.
    • Some transient transfer issues with KIT after the switch of srm endpoint from cmssrm-fzk.gridka.de to cmssrm-kit.gridka.de. Fine now.

  • ALICE -
    • RRC-KI-T1: commissioning activities ongoing since late Nov
      • EOS, VOBOX, CEs

  • LHCb reports (raw view) -
    • Main activities is Simulation at all Sites.
    • Some DIRAC monitoring down due to work on re-indexing accounting tables.
    • T0:
      • srm problem solved on Tuesday (stuck request - GGUS:99614)
    • T1:
      • IN2P3 : Problem with nagios probe (GGUS:99420) seems to have gone away (for now). Not sure we understand why.
        • The issue looks correlated with CVMFS. Yesterday CVMFS was upgraded, but the problem was still there after. Then disappeared. So no clear understanding of what happened. Will keep an eye on it.
      • GridKa : Possibly flickering jobSubmit to all CEs
      • GridKa : Also possible problem with batch system. Job submission to GridKa currently stopped. Local contact (Alexey) and admins informed.
        • GridKa: please submit a GGUS ticket concerning the issue. LHCb will do.

Sites / Services round table:

  • Sang-Un: there is a network problem on the campus network (investigating). Also, SAN storage volumes managed by hypervisor cannot be mounted so the services running on virtual machines are currently off.
  • KIT: still in downtime for ATLAS dCache upgrade. Operations on the firewall will follow.
  • IN2P3-CC: downtime finished yesterday as planned. Everything OK.
  • PIC: next thursday (19th) complete downtime of the center for the all day for an intervention on the cooling system. PIC will also upgrade several services (all documented in GOCDB). Queues will be drained in advance.
    • Alessandro: for ATLAS, draining is not needed.
  • CERN DB: on monday, upgrade of the ATLAS integration DB.
  • CERN Storage: on monday short upgrade of EOSCMS (almost transparent).

  • CERN CvmFS The stratum 0 (cvmfs-stratum-zero.cern.ch) and stratum 1 (cvmfs-stratum-one.cern.ch) will migrate to new hardware, OS and in the stratum1 case also from from 2.0.? to 2.1.15. The migration will be transparent for all stratum 1s that are replicating from the stratum 0. It will also be transparent for all CvmFS clients (both 2.0.* and 2.1.*) that are using the stratum one. ITSSB entry.

AOB:

  • Middleware Readiness WG meeting TODAY at 4pm CET. Agenda and connection details here
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2013-12-12 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback