Week of 160111

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Lorena (batch + grid), Maarten (SCOD + ALICE)
  • remote: Antonio (CNAF), Daniele (CMS), Dario (ATLAS), David (IN2P3), Di (TRIUMF), John (RAL), Kyle (OSG), Lisa (FNAL), Michael (BNL), Onno (NLT1), Pavel (KIT), Pepe (PIC), Renato (LHCb)

Experiments round table:

  • ATLAS reports (raw view) -
    • Production continuing full steam. Reprocessing almost completed.
    • (Probably) network problem between CA sites and several EU T2s. Low or null transfer rates. Reported.

  • CMS reports (raw view) -
    • CMS Tier-0 on AI project could only use fraction of its quota (GGUS:118546)
      • opened on Dec 29th, updated on Jan 7th, problem now understood (see ticket), now being worked on with OpenStack
    • issues in accessing HI data on some Wigner machines persist but are improving (GGUS:118082)
      • fail to open with the "Machine is not on the network” error (most frequent) or "Unable to read replica - read failed ; remote I/O error".
      • 3 unavailable machine in Wigner. Several manual actions file by file upon CMS notifications.
      • plenty of help by IT but painful and risky: some files are urgent to be fixed as they need to be processed before getting deleted (e.g. at least one of them was found problematic and was 1 month old).
      • Keeping the ticket open until done with the Tier0 processing tails.
    • a couple of tickets on KIT by the shifters
      • GGUS:118630 - our transfer team should tell if this is needed still or not, probably not
      • GGUS:118660 - glexec issues, I think it can be closed now
    • Job failures at GRIF_IRFU on Jan 8th (GGUS:118716)
      • CVMFS issue, impacting CMS. SAM and HC do not look that bad, so problem might arise only for not-yet cached files
      • IRFU confirmed they had hw problems, worked on it the same day, failure rates are recovering now, being monitored

  • ALICE -
    • high activity (often 80k+ jobs)

  • LHCb reports (raw view) -
    • Data Processing
      • Monte Carlo and User analysis at T0/1/2D sites during weekend
    • Issues
      • T1:
        • IN2P3: Files could not be opened, because SRM returned a wrong end point (GGUS:118655): Ticket closed on Friday(8th.) as solved

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF:
    • on Friday the SRM and GridFTP servers started getting new certificates containing the aliases of those services in the Subject Alternative Name field, as will be required by a Globus update in the near future
      • there was some discussion on a supposed premature revocation of the old certificates
      • the old certificates should not be revoked at all
      • revocation is only for certificates that were or might have been compromised
      • the CRL for each CA should not be bloated unnecessarily, viz. to avoid impacting the speed of GSI handshakes
  • FNAL: ntr
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI:
  • KIT: nta
  • NDGF:
  • NL-T1: ntr
  • NRC-KI:
  • OSG: ntr
  • PIC: ntr
  • RAL:
    • 2 disk servers for the CMS tape SE are out with multiple failed drives; tape access may be slow
  • TRIUMF: ntr

  • CERN batch and grid services: ntr
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Christoph (CMS), Lorena (batch + grid), Maarten (SCOD + ALICE), Maria A (WLCG), Maria D (WLCG)
  • remote: Andrew (NLT1), David (ATLAS), Di (TRIUMF), Jens (NDGF), John (RAL), Kyle (OSG), Lisa (FNAL), Michael (BNL), Renato (LHCb), Rolf (IN2P3)

Experiments round table:

  • ATLAS reports (raw view) -
    • A network problem disrupted traffic between EU and Canada Mon-Wed. Some discussion about who is responsible for detecting/reporting such issues.
      • there was some discussion during the meeting:
        • anybody can report suspected network issues e.g. via a GGUS ticket
        • the GGUS support unit is WLCG Network Throughput
        • it did indeed follow up on the case at hand (GGUS:118730 and GGUS:118748 )
        • the automatic detection of such issues and subsequent alarming are works in progress
    • A large amount of dark data was created on grid storage over Xmas by particularly nasty jobs.
      • Campaign is ongoing to clean up the dark data, thanks to sites who responded to requests for storage dumps
    • FTS: a recommendation was made by the developers to change a MySQL configuration setting, we would like to know if all service admins have made the change
      • BNL done shortly after the meeting, RAL and CERN were done earlier

  • CMS reports (raw view) -
    • CERN/Tier-0
      • Still some file access issues on EOS, GGUS:118858
      • Not clear, if this related to the issue with storage nodes at Wigner, GGUS:118082
    • Distributed sites
      • Had a number of files that got lost during production
      • Still the annoying bug: files get locally deleted, but remain registered in the Data Bookkeeping
      • One example: GGUS:118851 - usually the site can only confirm that the files are really lost
      • Data transfers to CCIN2P3 Lyon observed to be rather slow (except from CERN): GGUS:118757
    • Central Services
      • Some monitoring issues, basically spotted by CMS Computing Shifters
        • FTS report down early this week
          • Was actually only a monitoring issue, already fixed (INC:0940068)
        • LXBATCH report to have 50% availability - INC:0941762
          • Actually only ~5% reduction in host count

  • ALICE -
    • high activity

  • LHCb reports (raw view) -
    • Data Processing
      • User jobs, since Monday, at a good rate (approx. 5K jobs/hour) without major errors.
      • MC running (more than 15K jobs/hour) with no significant issues.
    • Issues
      • T1:
        • No main issue to be reported

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF:
  • FNAL: ntr
  • GridPP:
  • IN2P3:
    • will follow up on the CMS ticket
  • JINR:
  • KISTI:
  • KIT:
  • NDGF: ntr
  • NL-T1: ntr
  • NRC-KI:
  • OSG: ntr
  • PIC:
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services: ntr
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2016-01-14 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback