Week of 140310

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Akos, Alessandro, Andrew, Eddie, Felix, Maarten, Xavier
  • remote: Christian, Dimitri, Eric, Kyle, Lisa, Lucia, Michael, Onno, Pepe, Rolf, Sang-Un, Tiju

Experiments round table:

  • ATLAS reports (raw view) -
    • T1
    • ATLAS Internal:
      • Rucio File Catalog migration: we checked that the switch of the CVMFS latest symlink did not create any issue -- confirmed. It seems some US sites have hardcoded CVMFS link instead of the latest one -- to be followed
      • RAL deletion to be followed up: still huge backlog (today deletion rate is 4Hz, 370k datasets still to be deleted -- approx 3M files) List of files to be deleted can be created centrally, but we are not sure how useful it is for the site. Deletion parameter can be tuned: it was 50 files in a chunk, now it is 200.
      • Disk Space: lot of data secondarized or deleted. FZK did not get a lot of space, to be checked.

  • ALICE -
    • CNAF down

  • LHCb reports (raw view) -
    • MCsimulation, User jobs and Stripping.
    • T0: NTR
    • T1: CNAF unscheduled downtime

Sites / Services round table:

  • ASGC
    • data transfer timeouts are affecting the ATLASDATADISK space due to high level of concurrent xrootd accesses; experts looking into it
  • BNL - ntr
  • CNAF
    • downtime due to cooling failure ongoing since 04:00 CET yesterday, expected to be resolved by ~20:00 today
  • FNAL - ntr
  • IN2P3
    • outage on Tue March 18:
      • batch and MSS all day
      • dCache data export degraded all day
  • KISTI
    • downtime Wed 03:00-10:00 UTC for intervention on optical link between New York and Amsterdam; the backup link should make this transparent...
  • KIT
    • a network problem caused data transfers to time out; cured by a roll-back to the old configuration
    • will involve the dCache team for the CMS ticket
  • NDGF - ntr
  • NLT1
    • downtime extended into tomorrow because of HW issue: after DIMM replacement the machine does not boot; vendor has been contacted
  • OSG
    • can we close the ATLAS tickets for the OSG T3 sites for now?
      • Alessandro: OK. Those tickets were about how to make the total and free SE space discoverable for those sites, which do not appear in the BDII and do not have an SRM either; we first should have that discussion elsewhere indeed.
  • PIC - ntr
  • RAL - ntr

  • CERN grid services - ntr
    • Alessandro: what are the number of slots and HEP-SPEC ratings in Wigner vs. the CC?
      • after the meeting (updated on Tue):
        • Wigner: ‍ 7976 non-dedicated slots,   57613 HS06
        • Meyrin: 32236 non-dedicated slots, 288042 HS06
        • SLC5: 118897 HS06
        • SLC6: 226781 HS06
  • CERN storage
    • tomorrow CASTOR DB upgrade for CMS 08:00-12:00 CET
    • Wed idem for ATLAS
    • Thu idem for ALICE
    • Wed afternoon transparent memory module change on CASTOR Name Server DB
  • dashboards - ntr

AOB:

Thursday

Attendance:

  • local: Akos, Andrew, Belinda, Felix, Kate, Maarten
  • remote: Dennis, Gareth, Kyle, Lisa, Michael, Sang-Un, Sonia, Stefano, Thomas

Experiments round table:

  • ATLAS reports (raw view) -
    • T1
      • TAIWAN-LCG2 2 files seem to be corrupted. GGUS:102006
      • FZK-LCG2 GGUS:102013 one directory was with wrong permissions
    • ATLAS internal:
      • FZK cleared up a lot of space on Wednesday (15TB) not clear exactly what happened (but it's not a problem).
      • RAL-LCG2 deletion: 10k files/hour, 40/50 TB/day, we keep like this, now there is enough space.
      • reviewed free space, it seems ok for all the Tier1s

  • CMS reports (raw view) -
    • Nothing bad to report
    • Running production and analysis full throttle and working on increasing site utilization

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • MCsimulation, User jobs and Stripping.
    • T0: NTR
    • T1
      • Large number of pilots died at RAL (due to ISB server failure) at the same time this morning, but problem didn't persist
      • At IN2P3, many jobs died because new dCache SRM installation returned "file:" PFNs, which were not accessible. This has happened before, and LHCb makes a strong request to avoid this in the future. Perhaps by disabling "file:" by default? We appreciated IN2P3's rapid response to the ticket this morning. (alarm GGUS:102034)

Sites / Services round table:

  • ASGC
    • 2 corrupted files found, see ATLAS report
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • KISTI - ntr
  • KIT - ntr
  • NLT1 - ntr
  • OSG - ntr
  • RAL
    • account of March 5 GOCDB outage has been made available

  • CERN grid services
    • WMS: Deployment of EMI3 Update 14 (ITSSB)
  • CERN storage
    • CASTOR DB upgrades went fine so far; LHCb scheduled for Tue March 25
  • databases - nta

AOB:

-- SimoneCampana - 20 Feb 2014

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2014-03-13 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback