Week of 150406

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday: Easter Monday holiday

  • The meeting will be held on Tuesday instead.

Tuesday

Attendance:

  • local: Maarten (SCOD + ALICE)
  • remote: Alessandro (ATLAS), Christoph (CMS), Di (TRIUMF), Jeremy (GridPP), Kyle (OSG), Lisa (FNAL), Onno (NLT1), Pepe (PIC), Rolf (IN2P3), Tiju (RAL), Ulf (NDGF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • CERN-PROD GGUS:112835 this is about one or more files not being staged within the timeout (which is now 2 days). At this point these files should be checked, it's not normal to have such long waiting time to prestage files.

  • CMS -
    • Sorry, nobody could join the call today.
    • Nothing to report

  • ALICE -
    • High activity, except between Sun ~19:00 and Mon ~07:00 CEST

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF:
  • FNAL:
    • downtime this Wed 06:30-19:00 local time for power maintenance affecting the building hosting the WN
  • GridPP: ntr
  • IN2P3: ntr
  • JINR:
  • KISTI:
  • KIT:
  • NDGF:
    • received a lot of ATLAS data OK over the weekend
  • NL-T1:
    • congratulations on the first beams in the LHC!
  • NRC-KI:
  • OSG: ntr
  • PIC:
    • last week there was a 2-day downtime for a UPS upgrade, which went NOT OK:
      • a major electrical fault occurred towards the end of the maintenance, which had to be extended by 4h
      • a DDN system hosting ~100 TB then suffered a RAID controller misconfiguration, which took ~10h to solve
      • a broken climatization unit affected two thirds of the CPU capacity, which had to be kept switched off over Easter
      • at the moment the T1 is running with 50% of the CPU
      • more blades will be moved to main computer room to recover part of the missing capacity
      • the new UPS is suspected to have been installed incorrectly, which then would have to be fixed in the near future
  • RAL:
    • quiet Easter
    • TRIUMF-RAL network issue under investigation
    • CASTOR downtime for upgrade tomorrow
  • RRC-KI:
  • TRIUMF: ntr

  • CERN batch and grid services:
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Maarten (SCOD + ALICE), Steve (grid services)
  • remote: Alessandro (ATLAS), Dennis (NLT1), Di (TRIUMF), Felix (ASGC), John (RAL), Matteo (CNAF), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI), Ulf (NDGF), Zoltan (LHCb)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • any news from RAL? it seems from GOCDB that the downtime is over, is everything ok?
        • Alessandro: can we use the FTS-3 instances again?
        • John: AFAIK it was proposed that ATLAS wait until Monday; will be followed up offline
        • Alessandro: OK, we keep using CERN and BNL for now

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • User and MC jobs are running in the system. The failure rate of the jobs is very low.
    • No issues to report

Sites / Services round table:

  • ASGC: ntr
  • BNL:
  • CNAF: ntr
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT:
  • NDGF:
    • Downtime for storage 11-12 UTC on Monday due to head node reboots to get new kernels.
    • We got an atlas ticket complaining about lack of space, they had filled the space token for MCTAPE. We increased it.
  • NL-T1: ntr
  • NRC-KI:
  • OSG:
    • can ATLAS have a look at GGUS:112470 about SRM failures?
      • Alessandro: OK
    • there appears to be an issue with the latest monthly accounting report, possibly related to the introduction of multi-core records; this matter is being looked into
  • PIC:
  • RAL:
    • yesterday there was a scheduled downtime foreseen for a CASTOR upgrade
    • it was a good opportunity to reboot a switch as well
    • that unexpectedly caused a lot of network problems, which took until this morning to resolve
    • the CASTOR upgrade had to be canceled, a new date will be agreed
  • TRIUMF: ntr

  • CERN batch and grid services: ntr
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

-- AndreaSciaba - 2015-02-27

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2015-04-09 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback