Week of 151012

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Andrea Sciabà, Asa Hsu (ASGC), Ignacio Reguero (IT-PES), Maarten Litmaath (ALICE), Alessandro Di Girolamo (ATLAS)
  • remote: Dmytro Karpenko (NDGF), Federico Stagni (LHCb), Francesco Noferini (CNAF), Rolf Rumler (IN2P3-CC), Pavel Weber (KIT), Sang Un Ahn (KISTI), Onno Zweers (NL-T1), Tiju Idiculla (RAL), Kyle Gross (OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • ATLAS pilot factory in trouble on Friday. Problem started Friday early morning (3am) and disappeared around 11am. This caused a drop in resource usage to ~100k (they were 200k) for few hours. Not clear the cause. It solved by itself. Network issue in the computing CERN centre? not clear, no news on IT SSB. Anyone from CERN can comment if other problems were seen?
    • DDM errors "stalled FTS connection": one node on BNL FTS was drained over the weekend. No evidence of reappearence the issue after the change was done.
    • ATLAS submitted an alarm ticket to CERN GGUS:116841 to quickly get the quota of EOSATLAS TZDISK increased from 2.5PB to 2.8PB. This was not a CERN site fault, the GGUS alarm workflow has been used to quickly get in touch with the EOS experts (thanks indeed to them, who quickly update the values, together with one ATLAS expert who also can update the quota)

  • CMS reports (raw view) -
    • This week it is CMS Computing and Offline Week - likely nobody from CMS available to call in
    • High activity since end of last week
      • re-miniAOD campaign
    • Problem with bad Site Readiness due to CE issues at CERN is solved: GGUS:116468
    • Lost some fractions of a few datasets due to a bug in DDM (Dynamic Data Management)
      • Files got removed before proper archiving or replication
      • Only derived data affected
        • Lost files will be re-created or replaced by new version from upcoming campaigns

  • ALICE -
    • high activity

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of pp data at T0/1/ sites, Monte Carlo mostly at T2, user analysis at T0/1/2D sites
      • Data processing stopped for Full Stream and Turbo Calibration due to conditions distribution problem - ready to restart
    • DTs:
      • OUTAGE DTs for CERN and RAL tomorrow, overlapping: can this be avoided?

Andrea mentions that Tier-1 sites will be asked from now on to do their best to check for possible conflicts with other Tier-1 sites when declaring a new downtime (of type OUTAGE). Currently it is not easy using GOCDB, but the plan is to have Google calendars with all T1 and T2 downtimes (separately).

Sites / Services round table:

  • ASGC: ntr
  • BNL:
  • CNAF:
    1. We scheduled the migration to LSF9 as soon as we can. We currently have a problem on LSF9 with ATLAS job which doesn't allow us to move to LSF9. We urgently need a support by ATLAS experts to understand and fix the problem (pilot jobs are not able to get any task by ATLAS).
    2. We tried, as previously announced, to update the OS of the general purpose router. Unfortunately the update failed and we needed to roll back to the previous version of the OS. We are investigating with Cisco on the problem before trying again the update.

Alessandro asked Francesco to send an email to him and Salvatore Tupputi to understand the problem with the LSF9 queue.

  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI:
  • KIT: ntr
  • NDGF: tomorrow, electrical power outage in Copenhagen affecting both compute and data services for the whole day. On Thursday, Oslo and Umea will have their storage unavailable for half an hour
  • NL-T1: Network maintenance of dCache on October 19

Andrea will check for downtime conflicts and will report how he did it.

  • NRC-KI:
  • OSG: All CMS sites received a ticket concerning PhEDEx, not clear if and how OSG operations can help. Andrea will contact CMS for clarifications
  • PIC:
  • RAL: reminding tomorrow's downtime affecting CASTOR for all VOs
  • TRIUMF:

  • CERN batch and grid services: reminding tomorrow's downtime to upgrade LSF, it should be "almost" transparent
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Andrea Sciabà, Ignacio Reguero (IT-PES), Maarten Litmaath (ALICE), Xavi Espinal (IT-DSS), Jesús López (IT-DSS), Pablo Sáiz (IT-SDC)
  • remote: Michael Ernst (BNL), Asa Hsu (ASGC), Dmytro Karpenko (NDGF), Rolf Rumler (IN2P3-CC), Oliver Gutsche (CMS), Thomas Hartmann (KIT), Tiju Idiculla (RAL), Matteo (CNAF), Federico Stagni (LHCb), Di Qing (TRIUMF), Rob Quick (OSG), Dennis van Dok (NL-T1)

Experiments round table:

  • CMS reports (raw view) -
    • 25ns Data taking, normal operation
    • Production overview: High activity
      • Re-MINIAOD campaign and MC production/processing, expecting partial data re-reco pass to start soon
    • Routine campaign to change PhEDEx DB credentials in local PhEDEx installations at all sites ongoing
      • This is rare but necessary for security reasons
      • Used the GGUS bulk tool to open tickets to all sites, used not-optimal "top priority" and "incident", should have been "Urgent" (has to be done by November 2nd) and "Change Request"
        • This caused some confusion, which site admins resolved themselves.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of pp data at T0/1/ sites, Monte Carlo mostly at T2, user analysis at T0/1/2D sites
      • Data processing ramp up at T0/1
    • T0
      • Ticket about aborted SRM at CERN: can be closed for us, but left open presumably for further investigations (GGUS:116897)
      • Ticket about aborted SRM at CERN: LHCb EOS down - closed (GGUS:116877)
    • T1
      • Ticket about aborted transfer failures to SARA (GGUS:116939)

Xavi explains that the cause of the problem in SRM was BeSTMan having been broken by the automatic update of some packages in the OSG repository. During the meeting Rob investigates and finds out that the problem is due to a Globus package, it is understood and a fix will be soon available.

Federico asks what happened with the LSF upgrade at CERN. Ignacio explains that LSF9 showed serious scalability problems and it was necessary to roll back to LSF7. A ticket has been opened with IBM.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: the migration to LSF9 is ongoing. The problem with ATLAS reported at the previous meeting has been solved. Two CEs (and 5k cores) are on LSF9, while the other CEs (and 11k cores) are still on LSF7.
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI:
  • KIT: yesterday there was an emergency downtime affecting ATLAS, due to a problem with the dCache database, and which lasted about one hour.
  • NDGF: ntr
  • NL-T1: ntr
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services:
  • CERN storage services: The Quattor head nodes of EOS for ATLAS had their certificate close to expire, and after replacing it it was necessary to restart the namespace. Unfortunately doing so it crashed, and caused an unexpected downtime of 40'. The same will need to be done for ALICE next Thursday. EOS for CMS is not concerned as it was puppetized half an year ago.
  • Databases: yesterday the read-only copy of the ATLAS offline database was partially down for 15' due to a switch problem. Only a few nodes in the cluster were affected.
  • GGUS:
    • GGUS was unavailable on 13/10/2015 from 15:39 CEST to 16:19 due to the troubleshooting of the external KIT firewall
    • Future unscheduled downtimes will be announced following this procedure. Please, comment in the ticket if there are any suggestions
  • Grid Monitoring:
  • MW Officer:
    • Regarding the slapd crashes affecting Top BDIIs and ARC-CE resource BDIIs running the latest version of openldap (2.4.40-5/2.4.40-6) , RedHat may have found the cause of the issue. Soon they are going to provide us a fix for testing.
    • Sites should stay with the version of openldap which is working ok ( 2.4.39) until the fix is tested and released
    • DPM 1.8.10 pushed to EPEL stable today, it fixes an issue affecting CMS AAA

AOB:

Andrea: tried to check for downtime conflicts in GOCDB in two ways:

  • "Active and imminent downtimes" (link). Information is very poorly presented but I completed the check in one minute. Some margin for error.
  • Checking the Downtimes list for every Tier-1, searching them by name. It took around 3 minutes. Safer but it takes more time.
  • OSG downtimes (link). Clicked "All Resource Groups" and then "Update page". Not very familiar with the page, but the results are easy to interpret (one has to look only for BNL or FNAL).
Conclusion: a Google calendar (or equivalent) is sorely needed. The first method above is the recommended by me right now.

Rob provided convenient links to see downtimes for BNL and FNAL as a web page and in XML.

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2015-10-15 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback