Week of 120813

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alexandre, Cedric, Doug, Jan, Maarten, Raja);remote(Burt, Daniele, Dimitri, Jhen-Wei, Marc, Michael, Rob, Tiju, Ulf).

Experiments round table:

  • ATLAS reports -
    • T0/T1
      • EOS DATADISK space problem. : GGUS:85057 Errors on dashboard and prod jobs fail. Monitoring said 150TB free. Not space quota but #file quota of 10M. Solution: #file quota Raised to 15M. (1.8PB and 10M files implies 180MB/file. We have small log files there too. )
        • Jan: the next version will have better error messages, distinguishing file from space quotas
      • LSF sub times increased to several minutes: GGUS:85058 Early Sunday - completely dead (eg. bsub from lxplus fails). Solution: 17:20 Ulrich fixed it - apparently an errant user process killed it.
        • Maarten: all 4 experiments were affected; will follow up with the LSF experts
          • the LSF team are looking into further mitigations to prevent query overload from unmindful user scripts, e.g. by replacing the bjobs command with an intelligent wrapper; no timeline yet, but this month looks unlikely

  • CMS reports -
    • LHC / CMS
      • CMS solenoid went into fast discharge on Friday, ~7 am CERN time. 4-5 days before recovery is complete.
      • on Saturday, there was a T0 problem with the HLT menu for some runs, now fixed
      • Physics fill through Monday, 0 Tesla data taking with a special HLT menu and high rates into ZeroBias and AlCa PDs
      • High load expected on the T0, then the plan would be not to have beams until Tue night (TBC)
      • for Castor Ops: follow-up from Friday's WLCG call, to schedule interventions stay in contact with <cms-crc-on-duty@cern.ch>
        • Jan: the CASTOR update is still being discussed, the amount of downtime uncertain, as e.g. DB indices will need to be rebuilt; it seems better to delay the operation until the next technical stop (Sep)
    • CERN / central services and T0
      • LSF: 3.5k job saturation observed on Saturday
      • over the weekend, we saw LSF issues (see ATLAS GGUS), from the T0 Ops:
        ...
           Returned stderr: Cannot connect to LSF. Please wait ...
           Cannot connect to LSF. Please wait ...
           Cannot connect to LSF. Please wait ...
           Failed in an LSF library call: A socket operation has failed: Connection reset by peer. Job not submitted.
           ...
    • Tier-1/2:
      • Only minor Savannahs at the T2 level, nothing major at the T1 level

  • ALICE reports -
    • CERN: low job throughput during the weekend (see ATLAS GGUS)
    • QM 2012 has started

  • LHCb reports -
    • T0:
      • GGUS:85062 Significant problems with jobs over the weekend again. Solved now.
      • GGUS:85069 Ongoing problems resolving turls by jobs. Problematic diskserver(s)? Interestingly these files are (wrongly?) reported as nearline (they should not be having a tape copy).
        • Jan: last week 300 files were reported as lost, which is linked to this incident; a few machines were briefly put in production, then taken out and reinstalled; this caused various problems; a list of files would be helpful
        • Raja: will provide a list
      • Lost files due to diskserver failure - dealt with within LHCb (https://lblogbook.cern.ch/Operations/11304).
    • T1:
      • IN2P3 : New file with corrupted checksum seen (re old ticket - GGUS:82247). It was created on 10 August. Filename : srm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/lhcb/MC/MC11a/ALLSTREAMS.DST/00019658/0000/00019658_00000818_5.AllStreams.dst
    • Others
      • FTS : Switched off checksum checks to allow very old files to be transferred to CNAF (GGUS:85039). For now do not see significant errors transferring to GridKa - will keep an eye on it.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • OSG - ntr
  • RAL - ntr

  • dashboards - ntr
  • storage - nta

AOB:

Tuesday

Attendance: local(Alexandre, Cedric, Doug, Eva, Jan, Maarten, Raja);remote(Ian, Jeremy, Jhen-Wei, Lisa, Marc, Michael, Rob, Ronald, Tiju, Ulf).

Experiments round table:

  • ATLAS reports -
    • T0
      • Nothing to Report
    • T1
      • Taiwan Castor glitch GGUS:85138 , caused some job failures - Solved
      • Doug: reminder for ATLAS Tier-1 sites to implement atlasgrouptape directory

  • CMS reports -
    • LHC / CMS
      • Magnet being brought back up
    • CERN / central services and T0
      • Nothing new to report
    • Tier-1/2:
      • Only minor Savannahs at the T2 level, nothing major at the T1 level

  • LHCb reports -
    • T0:
      • GGUS:85069 Ongoing problems resolving turls by jobs - files lost in bad diskserver
      • GGUS:85134 Opened second ticket as requested in above ticket. More files found lost.
        • Jan: both tickets were updated and the first was closed; about half of the file list were part of that ticket and have been recovered; the rest are still being looked into, probably lost

Sites / Services round table:

  • ASGC
    • yesterday around 23:00 CEST the CASTOR stager became unavailable; the node was rebooted and it recovered; the cause of the problem is being investigated
  • BNL - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3
    • tomorrow is a holiday in France
  • NDGF
    • for some 50 minutes the FTS and site BDII were affected by a loss of IP connectivity to the outside, but it seems to have gone mostly unnoticed
    • tomorrow there will be short breaks for dCache pool upgrades, downtimes have been declared in the GOCDB
  • NLT1 - ntr
  • OSG - ntr
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • storage
    • EOS-ALICE nodes are being rebooted transparently to bring the service into a clean state

AOB:

Wednesday

Attendance: local(Alexandre, Jan, Luca C, Maarten, Raja);remote(Alexander, Dimitri, Doug, Gareth, Ian, Jhen-Wei, Lisa, Michael, Rob, Ulf).

Experiments round table:

  • ATLAS reports -
    • T0
      • Nothing to Report
    • T1
      • Nothing to Report

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Reasonable number of jobs on the Tier-0
    • Tier-1/2:
      • A few pre-staging and tape family requests were submitted for continuing MC production

  • LHCb reports -
    • T0:
      • GGUS:85134 Lost files marked as bad within LHCb. Much lower failure rate at CERN now. However as a result of pulling the faulty diskservers, we have are down to a low level of storage : only 22TB free now.
        • Jan: will check and update the ticket
    • T1 :
      • GridKa : GGUS:85208 Possible srm problems. Turls not being returned for some files.
        • Dimitri: will ping our storage experts
      • CNAF : Diskserver rebalancing - slow access to storage causing jobs and transfers to fail / go very very slow since ~Monday. SE banned until the situation improves, backlog of files to be transferred to CNAF building up.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • FNAL - ntr
  • KIT - nta
  • NDGF
    • 1 site found its dCache update taking more time than anticipated and postponed the remainder to Friday
  • NLT1 - ntr
  • OSG
    • there was a network issue affecting the ticket synchronization between the OSG GOC and BNL; still being looked into, but the tickets seem to be OK now
  • RAL
    • our grid test queue now has a few WN running the EMI-2 SL5 release
      • Maarten: in the coming weeks and months we will ramp up the testing of EMI-1 and EMI-2 WN, esp. by ATLAS and CMS, who depend much more on the WN contents than ALICE and LHCb

  • dashboards - ntr
  • databases - ntr
  • storage
    • EOS-ATLAS SRM access was blocked since the early morning due to BeStMan running out of file descriptors; restarted (GGUS:85201)

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 02-Jul-2012

Edit | Attach | Watch | Print version | History: r11 | r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2012-08-15 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback