Week of 151005

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria Alandes (Chair and minutes), Iain Bradford Steers (Grid and Batch), Andrei Dumitru (DB), Xavier Espinal (Storage), Asa Hsu (ASGC), Maarten Litmaath (ALICE), Andrew McNab (LHCb)
  • remote: Michael Ernst (BNL), Francesco Noferini (CNAF), Lisa Giachetti (FNAL), Rolf Rumler (IN2P3), Dimitri (KIT), Ulf Tigerstedt (NDGF), Onno Zweers (NL-T1), Kyle Gross (OSG), Pepe Flix (PIC), Gareth Smith (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • FTS server issue at BNL GGUS:116603 . "disappeared", but we need to follow up. Alejandro commented
    • On Friday ATLAS DBA experts noticed "odd" attempts of deletion. These attempts of deletion were caused by a missing filter in particular conditions which did not happen up to Friday. Luckily nothing was deleted. Rucio fixed the issue and with DBA are also thinking on how have even more protections.

  • CMS reports (raw view) -
    • I (Christoph) have to leave at 15:15 for another appointment - please address issues to me in case I had leave before the end of the call
    • Rather moderate load
      • Waiting for new campaigns to be launched
    • Ticketing campaign to update Phedex DB connection secrets
      • To be done every few years
      • A main-developer has recently left CMS

Maria asks whether there are more news with the SAM issues currently being experienced at CERN. Christoph explains there are internal discussions within CMS to consider shortening the lifetime of pilots but there is nothing decided yet. Iain explains that so far the problem seems to be related to priority scheduling issues. This only affects LSF batch since Condor based is working OK.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of pp data at T0/1/ sites, Monte Carlo mostly at T2, user analysis at T0/1/2D sites
      • A central LHCb cvmfs problem has been tracked down and fixed. This affected some jobs at all sites as new software and conditions versions in October could not be pushed out.
    • T0
      • Failed EOS (GGUS:116608) transfers appear to be due to wider cvmfs problems (CRLs?)
      • CERN-RAW (GGUS:116321). This is related to very large numbers of small files being produced by the online system.
    • T1
      • RAL DB downtime tomorrow.

Andrew explains that the CVMFS problem was related to a failed cron job with bad syntax that was preventing to publish LHCb CVMFS. Measures have now been taken to get notifications for failed cron jobs in the future.

Sites / Services round table:

  • ASGC: A router will be replaced in October 14th. The downtime will last from 14.10 at 2pm to 15.10 at 2am.
  • BNL: There are no issues with T1 services. FTS server processes keep crashing with high rate, leaving big core dumps each time they crash. BNL is in contact with the main developer. BNL confirms they are running the last version of the FTS release.
  • CNAF: We planned an upgrade of the OS in the frontier general-purpose (non LHC) router to fix an OS bug. The intervention may cause a network interruption on that connection of few minutes. We will inform you as soon as the date of the intervention will be scheduled.
  • FNAL: NTR
  • GridPP: NA
  • IN2P3: NTR
  • JINR: NA
  • KISTI: NA
  • KIT: All services are now online after the downtime scheduled last week. Maarten mentions that job success rate has gone down dramatically. Dimitri confirms this is a known issue and experts are looking into it.
  • NDGF:
    • Alice tape problems: We had a documentation error giving too little memory to the dcache pools accepting Alice tape data. This has been fixed now, alice automatically resubmitted the files that got broken.
    • Downtime: On 7.10 18:00 for 3 hours there is a window when our network provider will reboot our main router. This will take a lot less time than 3 hours. So far we have a warning downtime for this.
  • NL-T1: NTR
  • NRC-KI: NA
  • OSG: NTR
  • PIC: After dcache upgrade last week, CMS user IDs for files were misconfigured and files were not migrated to tape. Also some transfers were affected. This has been fixed now and more disk has been added to the buffer. In the next days, the file bottleneck will be cleared out and files will be copied to tape.
  • RAL: Several problems during the weekend: ATLAS high load caused some issues to disk servers in front of tape. More disk servers have been added. LHCb disk was also unavailable during the weekend. There were also some issues with gLexec. All this is now fixed.
  • TRIUMF:

  • CERN batch and grid services: Iain reminds the LSF upgrade planned for October. More details in OTG:0025165.
  • CERN storage services: There will be an update of EOS instances in one month to the latest version. This will affect all LHC instances.
  • Databases: NTR
  • GGUS: NA
  • Grid Monitoring: NA
  • MW Officer: NA

AOB:

RAL has reported a downtime reported in GOCDB not propagated to the Google downtimes calendar. Maria will check with the monitoring team taking care of the calendar.

Thursday

Attendance:

  • local: Maria Alandes (Chair and minutes), Iain Bradford Steers (Grid and Batch), Xavier Espinal (Storage), Asa Hsu (ASGC), Maarten Litmaath (ALICE), Andrew McNab (LHCb)
  • remote: Michael Ernst (BNL), Lisa Giachetti (FNAL), Rolf Rumler (IN2P3), Thomas Hartmann (KIT), Ulf Tigerstedt (NDGF), Dennis van Dok (NL-T1), Kyle Gross (OSG), Gareth Smith (RAL), Sang Uhn An (KISTI), Di Qing (TRIUMF)
Experiments round table:

  • CMS reports (raw view) -
    • Likely nobody from CMS available to call in
    • Progress on the two present issues at CERN
      • GGUS:116092 - Argus failures
      • GGUS:116468 - Scheduling of SAM tests
        • Pilot lifetime reduced from one week to 3 days
        • Some problems fixed on the CEs

  • ALICE -
    • high activity
    • we thank KIT for applying more flexible cgroups limitations, allowing again for a good success rate of our jobs!

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of pp data at T0/1/ sites, Monte Carlo mostly at T2, user analysis at T0/1/2D sites
      • Data processing stopped for Full Stream and Turbo Calibration due to conditions distribution problem
    • T0
      • Ticket about aborted pilots at CERN after BLAH related timeouts (GGUS: 116795)
    • T1
      • Recovered from RAL storage DB downtime on Tuesday (thank you!)
      • SARA ticket about aborted pilots due to PBS authorization (GGUS: 116797)

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NA
  • FNAL: NTR
  • GridPP: NA
  • IN2P3: NTR
  • JINR: NA
  • KISTI: NTR
  • KIT: The memory limits reported in the VO Cards are now being enforced using the cgroups utility that allows to have a more accurate memory usage accounting and limit the jobs RSS. Maarten explains that ALICE jobs started to fail dramatically last week since this was put in place but that the situation has improved a lot with the latest tuning. Thomas confirms that tuning is indeed ongoing to properly configure cgroups.
  • NDGF: Due to a network intervention that caused a misconfiguration, the danish cluster disappeared for a while, but this has now been fixed.
  • NL-T1: NTR
  • NRC-KI: NA
  • OSG: NTR
  • PIC: NTR
  • RAL: The scheduled downtime for the CASTOR system happened without any major issues. A small upgrade is still needed and scheduled for next Tuesday. More details in GOCDB
  • TRIUMF: NTR

  • CERN batch and grid services:
    • LSF Upgrade CERN Public Batch being upgraded to LSF 9 on Tuesday 13th from 10:00 to 12:30.
      • Job submission and querying will not work, but running jobs should remain unaffected.
      • ATLAS T0 instance upgrade delayed due to ongoing data processing.
    • Experienced several slapd crashes with ARC-CE/topbdii servers.
      • Maria explains that there seems to be a problem with the relay/overlay mechanism in the slapd config file. For top BDIIs this could be easily worked around removing the o=shadow relay. However, this is not so easy for ARC CEs where more than one relay is used. This has been reported to RedHat and we hope the developers can provide some more details. A fix for top BDII could be released in the meantime but the problem would still occur in ARC CEs. Site and Resource BDIIs are not affected.
    • Had an ARGUS service incident on Tuesday evening, several thousand connections to ARGUS at the same time. Seemed to originate from glexec on batch nodes. Still investigating.
  • CERN storage services:
    • The upgrade of EOS during the technical stop is being prepared. There is now an agreement to upgrade the ATLAS offline instance on 09.11.2015 from 9h00 to 10h00.
    • The latest version is running in EOSPUBLIC since last Tuesday without any problems.
    • The configuration change to CASTOR ALICE has now been implemented without issues. This is a big improvement to solve some of the performance issues reported in the last weeks.
  • Databases: NTR
  • GGUS: NTR
  • Grid Monitoring: NTR
  • MW Officer: NTR

AOB:

Maria explains that it's not clear whether the downtimes calendar displays all T1s or only CMS T1s. This is being discussed among the SCOD team and monitoring team in IT-SDC. Once this is clear, the T1s/T2s Google Calendar will be advertised again and a procedure to avoid downtime clashes among T1s will be also proposed.

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2015-10-08 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback