Week of 140421

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday: Easter Monday holiday

  • The meeting will be held on Tuesday instead.

Tuesday

Attendance:

  • local: Felix (ASGC), Maarten (SCOD), Maria A (WLCG), Ulrich (CERN grid services)
  • remote: Alexander (NLT1), Antonio (CNAF), Kyle (OSG), Lisa (FNAL), Oliver (CMS), Rolf (IN2P3), Sang-Un (KISTI), Tiju (RAL), Xavier (KIT)

Experiments round table:

  • CMS reports (raw view) -
    • Processing/production situation
      • Queues pretty much full with 13 TeV preparation MC production and digitization/reconstruction
    • EOS trouble on April 19th
      • Jan caught it on our support list before I even could open a ticket (thanks, Jan!)
      • Lockup of EOSCMS namespace? Solved by restart?
    • CERN: xrootd trouble on April 21st
    • JINR: xrootd trouble
      • file announced to be served by JINR but not accessible afterwards
      • GGUS:104810

  • ALICE -
    • KIT
      • network usage by jobs was OK over the long weekend
      • the jobs cap has been removed this morning and the number of concurrent jobs rose from 1k to 3k+ without incident

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KISTI - ntr
  • KIT - nta
  • NLT1
    • yesterday morning 1 dCache pool node crashed; restarted this morning and running OK again
  • OSG
    • GGUS:104806 opened for documentation on adding, removing or changing resources in REBUS
      • Maria: pledges or resources published by the BDII?
      • Kyle: site names
      • Maria: REBUS has its own support unit, so the ticket just needs to get routed there
      • Maarten: will check after the meeting
        • the ticket has been routed OK, but the expert may be unavailable this week
  • RAL - ntr

  • CERN grid services
    • this morning an AFS intervention went wrong, whereby local jobs could not get AFS tokens; fixed shortly after noon
    • we plan to ramp down the number of CEs submitting to the remaining SLC5 resources; further announcements will follow

AOB:

Thursday

Attendance:

  • local: David (CMS), Felix (ASGC), Maarten (SCOD), Marcin (databases), Maria A (WLCG), Stefan (LHCb), Xavi (CERN storage)
  • remote: Dennis (NLT1), Jeremy (GridPP), Kyle (OSG), Lisa (FNAL), Michael (BNL), Pepe (PIC), Rolf (IN2P3), Sang-Un (KISTI), Sonia (CNAF), Thomas (KIT), Tiju (RAL), Ulf (NDGF), Xavier (KIT)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services/T0
    • T1
      • FZK-LCG2 file transfer failing with SRM_GET_TURL error on the turl request (GGUS:104864)
      • RAL-LCG2 file transfer errors with "SRM_FILE_UNAVAILABLE". RAL admins reported a problem with a disk server (ELOG:49033)

  • CMS reports (raw view) -
    • Rather quiet two days -- generally trying to keep things busy with 13 TeV MC production and accompanying reco, with legacy reprocessing of MC to complete Run 1 analyses
    • Rereco of 2011 HI data nearly complete -- successful use of CERN HLT and AI resources for reprocessing.
      • "Nearly complete" -- waiting on ~dozen files inaccessible in CERN EOS INC:538717 (open)
    • GGUS:104827 (closed) Transfer errors between Beijing & FNAL -- sites need to remember to use SHA1 certs until FNAL upgrades dcache to allow SHA2 in few weeks.
    • SAV:143092 (open) HPSS outage at IN2P3 last week affecting outgoing transfers of files this week.
      • Rolf: GGUS:104842 appears to be about the same issue?
      • David: looks like it

  • ALICE -
    • KIT: the firewall and OPN link were saturated between 16:30 and 21:30 yesterday due to a large batch of jobs running an older analysis tag that does not have the patch for using the local SE when possible
      • the jobs cap was lowered to 1.5k for the night, then lifted again this morning

  • LHCb reports (raw view) -
    • Incremental Stripping campaign finished - many thanks to all T1 sites !!!!
    • MCsimulation, Working Group Production and User jobs.
    • T0:
      • Migration of user data Castor -> EOS started on Tuesday
        • Xavi: ~100k out of a few M files still to be done
    • T1:
      • GridKa : Some 102 input files for the stripping were not accessible to jobs, (GGUS:104852). Fixed Wed evening and jobs succeeded.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP
    • do we need to be concerned about concurrent major downtimes at T1 sites?
      • after some discussion and rectification of erroneous information the conclusion was that such clashes are to be avoided to a reasonable extent, but that we do not seem to have a reason for worry in this matter
        • major interventions usually are announced well in advance and (where feasible) agreed with the experiments concerned
        • in the past many years there has rarely (if ever) been a clash that caused a big nuisance
        • sometimes the T0 has been down, with a much bigger impact, and we just dealt with it
  • IN2P3 - nta
  • KISTI - ntr
  • KIT
    • the ATLAS ticket has been solved
  • NDGF
    • high ALICE traffic observed, but the system is coping OK
      • Maarten: probably due to increased activities in preparation for Quark Matter 2014 (May 19-24); furthermore, the issue that was observed for KIT can also happen at other sites in principle
    • Maria: you were going to open a ticket about better values for GLUE Validator limits?
      • Ulf: will follow up
  • NLT1
    • ATLAS opened 2 tickets:
      • GGUS:102716 about slow transfers from NIKHEF to BNL: we did some tuning, now waiting for a reply
      • GGUS:104769 about SARA transfer failures: the cause does not seem to be at SARA
    • yesterday the dCache SRM door was down for ~1h; it has happened before and been reported to the developers
  • OSG - ntr
  • PIC
    • our multi-core queue is seeing significant use by ATLAS and CMS; we have re-tuned the scheduler and will report in the Multi-core WG on our observations
  • RAL
    • the ATLAS ticket has been solved
    • Tue April 29 there will be a site outage for network maintenance 07:00-17:00 local time

  • CERN storage - nta
  • databases - ntr

AOB:

-- SimoneCampana - 20 Feb 2014

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2014-04-24 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback