Week of 171211

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Julia (WLCG), Maarten (WLCG, ALICE), Kate (chair, DB), Vincent (security), Gavin (Computing), Belinda (storage), Marian (networks), Alberto (monitoring)
  • remote: Xavier (KIT), Christoph W (CMS), Andrei T (LHCb), Di Qing (TRIUMF), Marcelo (CNAF), Sabine (ATLAS), Victor (JINR), Kyle (OSG), David M (FNAL), Christian (NDGF), Gareth (RAL)

Experiments round table:

  • ATLAS reports ( raw view) -
    • ATLAS software week ongoing
    • Smooth operation
    • 250k-300k running job slots, depending on the use of the 50k from HLT, peak at 500k with HPC
    • T0 run short Bphys late stream this week, its 19k slots back for grid use
    • CNAF data replication: 1.3 PB for data17_13TeV done, for data16_13TeV: 340 TB (215k files) being replicated, 590 TB (405k files) done
    • No jobs running at IN2P3-CC part of the weekend due to modifications in read/write protocol in database => solved
    • Reprocessing about to start, no data prestage for starting

  • CMS reports ( raw view) -
    • Still high activity: ~175k cores in use
    • Tier-0 CPU is now in use for processing
    • Asked several dCache sites to restart their GSI components (GridFTP, SRM, ...)
    • This weekend CMS Global HTCondor Pool shrank down to ~100k cores
      • Issue not fully understood yet
      • Likely related: a firewall issue with the CERN Glidein Factory: RQF:0904789

  • ALICE -
    • High to very high activity on average
    • CERN: EOS-ALICE restarted Thu afternoon for urgent intervention
      • The service then was unstable until late evening (GGUS:132382)
    • CERN: many reco failures due to CASTOR issue on Tue (GGUS:132319)
      • Caused by a non-transparent maintenance operation
    • CERN: ~80k reco job failures due to CASTOR issue Sun evening (GGUS:132428)
      • Reco had to be stopped overnight
    • CERN: EOS-PUBLIC badly affected by an ALICE file being read from everywhere
      • It was a fallback option that was not supposed to be reached
      • As a mitigation the file was renamed, while a fix was/is being looked into
      • Our apologies for this incident !

  • LHCb reports ( raw view) -
    • Activity
      • Stripping validation, user analysis, MC
    • Site Issues
      • T1
        • RAL: problems with file download from Castor (GGUS:132356)
        • RRC-KI: Downtime for the tape storage update, should be finished now
        • FZK: foreseen network maintenance on the 12 Dec, expect possible temporary connectivity problems; temporary file unavailability due to disk pool migration (should be mostly transparent for the users)

Sites / Services round table:

  • ASGC: nc
  • BNL: nc
  • CNAF: The works in the first power lines have started and should be finished before Christmas.
    • An upgrade for 100Gbps link at the storage will start tomorrow
  • EGI: nc
  • FNAL: Factory kept running for CMS.
    • Final checks are being made. FTS upgrade/switch possible on Wednesday. 1 hour downtime to be confirmed.
  • IN2P3: nc
  • JINR: WNs restarted after AFS upgrade.
Maarten asked if other, non-LHC VOs are using AFS. It was confirmed.
  • KISTI: nc
  • KIT:
    • Restarted all dCache "doors" in order to work around the eScience CA cert-triggered bug on Wednesday (Dec 6th).
    • Tomorrow, Dec 12th, we'll be working on our firewall from 9 to 12 CET. A downtime at-risk has been added to GOC-DB.
  • NDGF: UCPH will replace UPS next week. Probably Wednesday. Tape robot will be offline for ~30 minutes. Tape frontends unaffected. GocDB downtime will follow shortly.
  • NL-T1: dCache maintenance on Monday 18th to update the root certs (which for v1.88 requires a restart, see GGUS:132205). While we're at it, we'll take the opportunity to do some OS, firmware and dCache updates. https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=24479
  • NRC-KI: nc
  • OSG: NTR
  • PIC: nc
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services:
    • Upgrade on CASTOR for CASTORATLAS, CASTORCMS this morning
    • Recalls queue is several days to a week long. This is due to necessary ongoing REPACK of tapes and recalls of data for CNAF. If you need an urgent recall, contact CASTOR support.
    • EOSALICE - scheduled restarted on Thursday afternoon to fix an issue preventing instance maintenance
    • EOSALICE - unstable on Thursday evening after the restart due to an intrinsic limitation of xrootd-4.7 internal file descriptor (max 32k) - workaround put in place. New xrootd-4.8-rc1 installed on Sunday to fix it
    • EOSATLAS - instabilities on Friday evening due to a user misusing the system. User's activity has been limited to prevent further issues. The instance was in read-only mode during this period of time.
  • CERN databases: NTR
  • GGUS: NTR
  • Monitoring:
    • Recomputed the CMS_CRITICAL profile, reminder that the final reports will be sent around the 15th
  • MW Officer: NC
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2017-12-15 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback