Week of 160725

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Maria A (chair and minutes), Julia (WLCG), Luca (Storage), Andrea M (MW Officer, FTS), Maria D (GGUS), Tomas (ATLAS), Maarten (ALICE), David (Network), Fa-Hui Lin (ASGC), Shao Ching (ASGC), Jose (DB)
  • remote: Stefano (CMS), Eric (BNL), Lucia (CNAF), Dave (FNAL), Sang Un (KISTI), Alexander (NL-T1), Kyle (OSG), Tiju (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • Activities:
      • Two T0 spillover tasks completed fine, their derivations are done. ICHEP derivations almost done, last run is being processed, to finish today/tomorrow.
      • Low in MC simulation again, 10M events left to be done, asking MC team for keeping at least 100M events in the system.
    • Problems:
      • Export rate from EOS close to 6 GB/s, maximum throughput we can get with the current hardware.
      • Discussions last week with CMS on CERN-RAL network link saturation. Summary and actions put together.
      • TZDISK space getting tight, lifetime reduced to 2 weeks from 2.5 weeks, mostly RAW occupancy.
      • Data Quality monitoring service was affected by migration of nodes at CERN last week. The DQ server was replicated to another machine in the meantime. A long term robust solution with DQ team has been under discussion for some time.

Tomas adds also that ATLAS experiences some slow transfers from EOS to CASTOR. Luca explains that this is due to the tape incident happening during the weekend that is slowing down transfers as some backlog accumulated. Luca adds that two more gridftp doors are to be added to EOS-ATLAS. David from networking asks whether these machines are connected to the right ports to provide the needed throughput and Luca replies that this is the case as the EOS team understands from the networking team.

Maarten reminds that there was also an ALARM ticket during the weekend affecting EOSATLAS (GGUS:123045). This is mentioned in the Storage report in more detail.

  • CMS reports (raw view) -
    • Computing resources very busy but coping.
    • Data flow out of T0 at satisfactory lever (stably at 4GB/sec), backlog going down. Close to be limited by T1 ingestion capability, tuned down xfers to KIT and RAL to avoid eating into Atlas share
    • No special problem to report.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Activity
      • Monte Carlo simulation, data reconstruction/stripping and user jobs on the Grid
    • Site Issues
      • T0:
        • Backlog for tape migration, due to a problematic tape library.
        • Pilot problem (cannot estimate lifetime of proxy)
      • T1: Issue with pilot at CNAF, similar to CERN

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: Scheduled downtime for storage system intervention affecting CMS, from today until tomorrow. See GOCDB downtime for more details.
  • FNAL: NTR
  • GridPP: NA
  • IN2P3: NA
  • JINR: No notable problems. Our SL67 has been upgraded to SL68.
  • KISTI: NTR
  • KIT: NA
  • NDGF: NA
  • NL-T1: NTR
  • NRC-KI: NA
  • OSG: NTR
  • PIC: NA
  • RAL: We are working to get our failover 10G OPN link into active usage.
  • TRIUMF: NA

  • CERN computing services: NA
  • CERN storage services:
    • EOSATLAS out of memory: the master headnode run out of memory at 22:20, this triggered a no_contact alarm and the node was powercycled. The service was restored in read-only mode at 22:40 and fully back in read-write mode at midnight. The underlying issue of the recent instability has been found, an update with the latest fixes is available.
    • CASTOR tape library unavailable: accessor and gripper issues during the weekend. As a result, mounts failed and consequently tape transfers piled up over the weekend. Write transfers have been diverted to other libraries this morning. An IBM engineer is currently on site to inspect the library and fix the problem. Currently reads from this library are not possible. Estimated time of resolution today at 18:00.
    • FTS issue on Tuesday evening, OTG:0031767. FTS DB tables remain locked after the usual scheduled DB backup ( implemented by DB On Demand). For 1h 15m FTS clients connections hung / got disconnected, and some running transfers got aborted, then we unlocked the tables and everything got fixed. After investigation with DB on Demand support we discovered a possible FTS long running query which may interfere with backup operations. For now we have disabled the DB on demand DB backup ( which locks the tables before running ) and implemented our own backup running without locking.
  • CERN databases: A rolling intervention in CMSONR and CMSINTR databases tomorrow to replace a broken chassis
  • GGUS: Reminder: Important changes on the release this Wed. jira-1526, jira-1527, jira-1528 explain all.
  • Monitoring: NA
  • MW Officer: First C7 UI bundle and rpm available for testing ( /cvmfs/grid.cern.ch/centos7-ui-test and https://jenkins.indigo-datacloud.eu:8080/job/emi-ui/3/artifact/RPMS/emi-ui-4.0.0-1.el7.centos.x86_64.rpm).
  • Networks: NTR
  • Security: NTR

AOB:

  • A new egroup has been created to remind people about the WLCG Operations meeting on Monday morning wlcg-ops-reminder. Current members of this egroup are:
    • wlcg-scod
    • Vincent Brillault
    • Andrea Manzi
    • Marian Babik
    • It-dep-cm-is-rota (Computing people)
  • It T1s or experiments would like to subscribe to this egroup, please, let us know
  • We shall use this egroup to communicate with the participants of the meeting and ask for reports on particular issues if necessary.

Luca confirms that eos-admin and castor-admin egroups could be added as well. Jose will confirm what to use for DB.

  • wlcg-tier2-contacts mail deluge last Friday
    • an ordinary message was sent to wlcg-operations late Fri morning CEST
    • that list includes the Tier-2 contacts list
    • a non-delivery report for one of its members was sent back to the list
      • instead of the sender
    • that message then generated another NDR message and so on
    • the mail server for another site marked such messages as infected
      • maybe because the "same" message was being re-sent all the time
    • also that mail server sent its alerts to the list
    • after 50 messages the mail server of yet another site reached a limit
      • for subsequent messages it sent warnings, yet again to the list
    • several tickets were opened to get the deluge to stop
    • other lists were used as well to alert experts
    • as of 13:27 CEST the traffic stopped after the list was blocked
      • it was kept closed until Monday afternoon
    • the incident amounted to almost 3500 messages
    • admins of misbehaving mail servers have been asked to fix their configurations
    • the settings of the Tier-2 list have been improved
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2016-07-26 - VictorZhiltsov
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback