Week of 130311

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: (Elena - ATLAS, Ivan, Jerôme, Maarten, MariaD)
  • remote: ( Onno, Joel - LHCb, Wei-Jen, David - CMS, Tiju, Salvatore, Rolf, Christian)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • CERN-PROD:file transfer failure from T2s due to SECURITY_ERROR GGUS:92166 & INC:252496. Reappeared on Friday (no feedback since Wednesday). Escalated to Alarm. Agents restarted. OK now. NB! CERN Grid Services, please, put some more explanations in the ticket about the incident's reasons and how exactly it got fixed. Thanks!
    • T1s
      • TRIUMF FTS stuck. GGUS:92371. Channels left inactive after downtime. Extremely fast reaction from site (Sunday). Fixed.

  • CMS reports (raw view) -
    • LHC / CMS
      • Rereconstruction of 2012 data progresing well -- utilizing all T1 resources
    • CERN / central services and T0
      • Working on reconfiguring for reprocessing
        • Castor disk pools being cleaned and will ask soon to move them to EOS
        • HLT cloud commissioning progressing after network reconfig.
    • Tier-1:
      • NTR
    • Tier-2:
      • Tested and moved some T1 workflows to larger T2's with xrootd input, due to high T1 use.
        • US T2's at first, expanding to German and Italian T2's soon.

  • ALICE -
    • CNAF: since Sun afternoon most of the disk SE capacity looked offline, only 25.28 GB was visible through xrootd and tests failed with "Stale NFS file handle"; looks OK again since Mon morning

  • LHCb reports (raw view) -
    • Mainly user jobs. MC production on hold for new requests.
    • T0:
      • NTR
    • T1:
      • NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: not connected; probably due to OSG AHM
  • CNAF: ntr
  • FNAL: ntr (sent by email) NB! The USA is on summer time already since yesterday!
  • IN2P3: Next Monday to Tuesday 18-19 March all batch activity will be stopped at the site for a scheduled intervention. GOCDB will be updated as appropriate.
  • KIT: not connected.
  • NDGF: Suffering from issues with their storage system. Working on them now.
  • NL-T1: The SARA network maintenance intervention completed successfully.
  • OSG: not connected; probably due to OSG AHM
  • PIC: not connected.
  • RAL: There will be a network update tomorrow, 8:45-15:30 UK time.

  • CERN:
    • Dashboards: ntr
    • Grid services: ntr

  • GGUS: ntr

AOB: Next meeting is on THURSDAY!

Thursday

Attendance:

  • local: ( Elena - ATLAS, Belinda, Alexandre, Gavin, Maarten, MariaD )
  • remote: ( Ronald, Joel - LHCb, Wei-Jen, David - CMS, Gareth, Lucia, Rolf, Ulf, Lisa )

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • CERN-PROD: file transfer failures due to SECURITY_ERROR seen last week re-appeared on Wednesday. Transfers succeded within one hour. New GGUS:92487. Alarm GGUS:92166 was re-opened in the morning. NB! CERN Grid Services, please, see re-appearence of issue discussed at last Monday's meeting.
    • T1s
      • ntr

  • CMS reports (raw view) -
    • LHC / CMS
      • Rereconstruction of 2012 data progressing well -- began processing the last of 4 datataking eras. Still utilizing all T1 resources for this
      • Enforced CERN mapping of grid certs on 3/12, resulted in only ~6 savannah tickets (one user did manage to have an apostrophe in their CERN email address somehow)
    • CERN / central services and T0
      • Working on reconfiguring for reprocessing
        • Castor disk pools being cleaned and will ask soon to move them to EOS
    • Tier-1:
      • Minor issues resolved quickly at CNAF and KIT earlier in the week
      • Occasional problems with tape migration at ASGC due to repacking
    • Tier-2:
      • Moving some T1 workflows to larger T2's with xrootd input, due to high T1 use.
        • US T2's working well, expanding to German and Italian T2's soon.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Mainly user jobs. MC production on hold for new requests.
    • T0:
      • NTR
    • T1:
      • NTR

Sites / Services round table:

  • ASGC: Working now on a problem with their CASTOR transfer manager.
  • BNL: not connected; probably due to OSG AHM
  • CNAF: WNs' upgrade to EMI2 on SL5 went well.
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: not connected
  • NDGF: ntr
  • NL-T1: One SARA file server is down. It will only be repaired tomorrow.
  • OSG: not connected; probably due to OSG AHM
  • PIC: not connected
  • RAL: Last Tuesday's scheduled intervention finished much later (around 9pm UK time) than the originally planned time. Currently only one of the two 10Gbps uplinks is working; this should not cause immediate problems, as the traffic rarely exceeds 10 Gbps. Experts are investigating.

  • CERN:
    • Batch systems: Job scheduling takes too long. Investigation is on-going with help from ATLAS and CMS.
    • CASTOR: The scheduled upgrade was successful. A problem is now seen with the CMS stager on which external help is sought from DB experts.
    • Dashboards: ntr

  • GGUS:
    • File ggus-tickets.xls is up-to-date and linked from page WLCGOperationsMeetings as every week. ALARM drills next week for the MB.
    • Unannounced SNOW changes resulted into GGUS-SNOW interface trouble discovered on Tue 2013/03/12 and confined to 1-2 tickets.
    • Following discussion on ATLAS ALARM GGUS:92166 MariaD opened GGUS dev. item Savannah:136499 for ALARM email notiifications on ALARM ticket re-opening.

AOB: Next meeting will be on Monday 2013/03/18!!

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2013-03-14 - NicoloMagini
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback