Week of 150803

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria Alandes (chair, minutes), Christoph Wissing (CMS), Jiri Chudoba (ATLAS), Ben Jones (Grid&Batch), Mark Slater (LHCb), Belinda Chan Kwok Cheong (Storage)
  • remote: Dima Kovalskyi (CMS), Dimitri (KIT), Alexander Verkooijen (NL-T1), Michael Ernst (BNL), Gerard Bernabeu (FNAL), Gareth Smith (RAL), Antonio Falabella (CNAF), Sebastian Gadrat (IN2P3)

Experiments round table:

  • ATLAS reports (raw view) -
    • RAL network outage for 1 hour (affected mainly squid availability)
    • Monitoring shows wrong status of Squid (service is up, monitoring says down)

Discussion on how to improve the monitoring is taken offline among ATLAS and Ops Coordination to see how to solve this issue.

Nothing to report.

  • ALICE -
    • high activity
    • CERN: a lot of the raw data currently under processing could not be accessed from CASTOR during the weekend.
      • Disk pool was under heavy load.
      • Some mitigation was applied by the CASTOR team.
      • A fix is scheduled for Tue Aug 4.

  • LHCb reports (raw view) -
    • Data Processing: data "stripping" productions ongoing. User and MC jobs
    • T0
      • Discussion with LSF team about wall and cpu time queue lengths, ongoing (GGUS:115027). Some progress and bugs found on both sides. Deployment of fixes happening in progress.
    • T1
      • RAL: LAN network instabilities continue, CVMFS and data access problems over the weekend.

Sites / Services round table:

  • ASGC: NA
  • BNL: Very high activity regarding tape migration over the weekend. E.g. on Saturday data volume of 80TB/70k files was moved to tape in 24 hours. It's the highest activity registered for tape storage so far.
  • CNAF: NTR
  • FNAL: There will be a downtime on Wednesday 5th August from 8h to 17h where all storage services will be rebooted.
  • GridPP: NA
  • IN2P3: NTR
  • JINR: NA
  • KISTI: NA
  • KIT: A big downtime is planned between 30th September to 7th October. The whole site will be offline. GOCDB details will be given soon.
  • NDGF: Two upcoming warning downtimes for srm.ndgf.org, which affects availability of some ALICE and ATLAS data:
  • NL-T1: Scheduled downtime tomorrow afternoon for the tape backend that will last 24h.
  • NRC-KI: NA
  • OSG: GGUS:115413 opened to site but in fact related to Global Xrootd redirector. Maria will follow up after the meeting.
  • PIC: NA
  • RAL: Outage scheduled tomorrow to check routers causing the network problems affecting RAL during the weekend. T1 will be unavailable for 6h. Details in GOCDB
  • TRIUMF: NA

  • CERN batch and grid services: NTR
  • CERN storage services: A set of transparent CASTOR upgrades happening this week:
  • Databases: NA
  • GGUS: NA
  • Grid Monitoring:
    • Draft availability reports for July sent, and available at the SAM3 UI
  • MW Officer: NA

AOB: None

Thursday

Attendance:

  • local:
  • remote:

Experiments round table:

  • ALICE -
    • high activity
    • CERN: thousands of files still persistently inaccessible from CASTOR
      • reported as 'STAGED', but inaccessible from the WN or interactively
        • xrootd or rfcp
      • many retries were done over several days
      • also after the interventions earlier this week
      • the CASTOR team are aware

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3:
  • JINR:
  • KISTI:
  • KIT:
  • NDGF:
  • NL-T1:
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL:
  • TRIUMF:

  • CERN batch and grid services:
  • CERN storage services:
    • Upgrade of CMS AAA redirectors to XrootD 4.2.2: 2nd node will be upgraded next Tuesday, so far we didn't experience any lockup of the CMSd on the upgraded node.
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2015-08-06 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback