Week of 130325

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Ivan, Luca M, Maarten, Stefan, Steve, Ulrich
  • remote: Alexander, David, Kyle, Lucia, Michael, Pepe, Rolf, Stephane, Tiju, Ulf, Wei-Jen, Xavier

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • CERN-PROD ALARM : GGUS:92166 : file transfer failures due to 'SECURITY_ERROR' . No reply to ticket and problems occurs each day during few hours.
        • Steve: will investigate further as time permits
    • T1s
      • RAL: Limitations observed in current FTS3 setup at RAL -> No FTS 3 over the week-end until (Savannah:135151)
        • Transfer timeout does not scale with file size
        • Bug in the optimizer, resulting in the number of streams being too low
        • Maarten: please ensure the FTS-3 developers are informed of these issues

  • CMS reports (raw view) -
    • LHC / CMS
      • Rereconstruction of 2012 data progressing well -- processing the last of 4 datataking eras. Still utilizing all T1 resources for this
    • CERN / central services and T0
      • Working on reconfiguring for reprocessing (details on last Thursday's report -- coordination meeting between CMS/CERN will occur this week)
        • Luca: t0streamer in progress; T0 restructuring details to be agreed during that meeting
    • Tier-1:
      • GGUS ticket 92754 (RAL) -- xrootd fallback reconfig caused errors in SAM tests briefly -- fixed right away on 3/21, but ticket still open...
        • Tiju: will follow up
    • Tier-2:
      • Continue to work on moving some T1 workflows to larger T2's with xrootd input, due to high T1 use.
        • US T2's working well, expanding to larger European T2's -- requests to enable xrootd fallback, accept t1production role issued end of last week

  • ALICE -
    • CERN: job submission to CERN CEs OK again since Thu evening, thanks to a big effort of IT-PES! (GGUS:92521)
    • Luca: an EOS update needs to happen soon, will send details via e-mail

  • LHCb reports (raw view) -
    • Mainly user jobs with some MC ongoing. Preparation for Re-Stripping to be launched next week.
    • T0:
      • latest SW releases not available in CVMFS Stratum 1 instance for LHCb (GGUS:92815)
        • Steve: for some reason the LHCb repository decided to resync from scratch, being looked into
    • T1:
      • PIC: Failed pilots, is this already the draining of queues for tomorrows DT? LHCb person at PIC contacted.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF
    • kernel upgraded on half of the WN, back in production; the other half will be done gradually
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • tape library maintenance on Wed
  • OSG
    • during tomorrow's maintenance window all services will be rebooted at some time
  • PIC
    • reminder: downtime 5-19 UTC tomorrow for electrical maintenance, queues already being drained
  • RAL - ntr

  • dashboards - ntr
  • GGUS
    • Reminder! Monthly Release on 2013/03/27 as usual (last Wednesday of the month). There will be NO ALARM tests this time. See in Savannah:136545 why not. Basically, release is minor, dev. items' list is http://bit.ly/12nJ9ev . (text entered by MariaD now away).
  • grid services
    • CVMFS Stratum-0 instance offline tomorrow for back-end storage change; Stratum-1 instances should not be affected
    • Argus service nodes need improved memory tuning, should be transparent
  • storage
    • file updates have been disabled on CASTOR-PUBLIC today, the other instances to follow on Apr 8

AOB:

Thursday

Attendance:

  • local: Andrea, Ivan, Luca M, Maarten, Manuel, Stefan
  • remote: Gareth, Lisa, Lucia, Michael, Oliver, Rob, Rolf, Ronald, Ulf, Wei-Jen, Xavier

Experiments round table:

  • CMS reports (raw view) -
    • LHC / CMS
      • Rereconstruction of 2012 data in the tails, load at the T1 sites small
    • CERN / central services and T0
      • GGUS:92909 : SAM tests stopped on the 27th, machine needed to be rebooted, even if an alarm was automatically generated, which is supposed to prompt a reboot by the sysadmin, no reboot was done so far even after repeated requests by the SAM team. The impact for CMS is that no SAM tests are run since yesterday and therefore there is no monitoring information.
        • A ticket has been opened for the SAM team to get the SAM production machines' importance increased, so that the sysadmins will look into such issues earlier, as suggested by Manuel
    • Tier-1:
      • GGUS:92754 -- xrootd fallback reconfig caused errors in SAM tests briefly -- fixed right away on 3/21, closed
    • Tier-2:
      • Continue to work on moving some T1 workflows to larger T2's with xrootd input, due to high T1 use.
        • US T2's working well, expanding to larger European T2's -- requests to enable xrootd fallback, accept t1production role

  • ALICE -
    • CNAF: VOBOX suffered network issues since Mon afternoon, fixed Tue afternoon (GGUS:92845)
    • IN2P3: investigating very low job rates since switch to new VOBOX on Mon
      • fixed
    • KIT: all jobs lost around 09:00 CET today, being investigated
      • Xavier: no idea what might have caused that
      • the VOBOX was rebooted and ended up in a bizarre state: not understood, but jobs are running again

  • LHCb reports (raw view) -
    • Mainly user jobs with some MC ongoing.
    • T0:
      • No SAM tests displayed on the SUM dashboard (GGUS:92924)
        • also see the CMS SAM incident reported above
    • T1:
      • PIC: Failed pilots after DT. The jobs submitted had requested 700k HepSpec06.seconds, for some reason they ended up in the short queue (17k sec) (GGUS:92914), no more failures as of yesterday evening, ticket closed

Sites / Services round table:

  • ASGC
    • this morning one of the DPM daemons crashed; fixed
    • yesterday 50 ATLAS files were lost in CASTOR; ATLAS and the CASTOR developers have been informed
  • BNL - ntr
  • CNAF
    • kernel upgrades on UI nodes and second half of the WN will be done next week
    • last night there were problems with the Italian top-level BDII, which was recently upgraded to EMI-2; being investigated
      • added after the meeting: a few sites reported memory usage increases on the LCG-Rollout list today
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - nta
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • RAL
    • on Tue morning there was a network issue caused by the RALPP T2 site overloading the site firewall, due to Xrootd redirection tests by CMS

  • dashboards - ntr
  • grid services - ntr
  • storage - ntr

AOB:

  • ATTENTION: start of European Summer Time on Sun March 31 !
  • ATTENTION: next meeting on Tuesday April 2 !
  • Have a good Easter break !
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2013-03-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback