Week of 170403

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes




  • local: Andrea (MW Officer + FTS), Belinda (storage), Gavin (computing), Julia (WLCG), Liana (databases), Maarten (SCOD + ALICE + GGUS), Marian (networks), Michal (ATLAS)
  • remote: FaHui (ASGC), Alberto (monitoring), Andrew (NLT1), David B (IN2P3), David M (FNAL), Di (TRIUMF), Dimitri (KIT), Gareth (RAL), Kyle (OSG), Luca (CNAF), Oliver (CMS), Pepe (PIC), Ulf (NDGF), Vincenzo (EGI), Xin (BNL)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • reprocessing - jobs sometime require a little more memory than 2GB/core
      • frontier problems
        • started during weekend (only affected Lyon and RAL)
        • there is ongoing work on understanding frontier queries
        • tasks needing a lot of data from frontier were paused
    • Problems:
      • Taiwan-LCG2
        • Transfers from Taiwan were failing with "Internal server error" and to Taiwan with "Could not set service name" (GGUS:127403) - IPv6 setting problems
        • Unavailable and corrupted files at TAIWAN-LCG2_LOCALGROUPDISK and TAIWAN-LCG2_PHYS-SM (GGUS:127429) - files should be declared lost

  • CMS reports ( raw view) -
    • 24/7 running recording cosmic rays began 27. March
    • low production and analysis activity at the moment
    • CMS informed collaboration about planned interventions
      • Tuesday, April 4th, 13:30: CMSONR will be migrated back to P5
      • Tuesday, April 4th, 10:00: FTS3 service upgrade
    • Report on March 28 that CMS jobs are killing VMs
      • Used the given information and also asked for more information to track down the jobs
      • Identified the pilot, send the job information to analysis and production information
      • Jobs didn't show abnormal behavior, cannot explain killed VMs
      • Would it be possible to get more information in these cases, maybe job logfiles or hypervisor logfiles?
      • Oliver: those jobs were running on condor at CERN,
        submitted through the CEs and executed from CMS pilots.
    • LSF problem on March 29
      • LSF batch system lost all jobs (running and pending), OTG:0036705
      • No big complaints from users and production
    • Oracle Kibana monitoring problem March 30
      • Tracked down to problem with the stats collector machine by Kate
    • Network problems between Italy and CERN on March 30
      • root cause has been identified in a faulty linecard in the GEANT router in Geneva
      • OTG:0036738
    • Puppet and cron jobs didn't get along well
      • All crontab entries on CERN cvmfs-* release manager machines were wiped and needed to restored by the release managers
      • Reports that other CMS machines were affected as well
    • EOS monitoring
      • since Friday, EOS availability Kibana monitoring shows EOS not available, although network traffic indicates that EOS is working fine
      • ticket: INC:1329643
      • Belinda: we see the same problem with all EOS instances and are looking into it

  • ALICE -
    • Very high activity
      • Running jobs records repeatedly surpassed up to 134k on Sunday!

  • LHCb reports ( raw view) -
    • These items were provided after the meeting:
    • Activity
      • MC Simulation, Stripping
      • Staging campaign for Stripping27, Stripping28 and Stripping24b, as well as 2015 EM should take 6 to 7 weeks with peaks of staging.
    • Site Issues
      • T0:
        • The 3 gridftp doors were saturated. Added 2 new one.
      • T1:
        • RAL: suffering huge issue with SRM. Under investigation
        • CNAF: Stager was blocked for a while
        • FZK: Seem to have found a somewhat correct balance between timeouts and performance for transfers

Sites / Services round table:

  • ASGC:
    • FaHui: the IPv6 issue is solved and we are looking into the file corruptions
  • BNL: NTR
  • CNAF:
    • Maarten: please acknowledge also the GGUS test alarms promptly, thanks!
  • EGI: NTR
  • IN2P3: NTR
  • JINR:
    All OK for the last week.
    SE Disk end-point will be closed tomorrow for 6 hours for major upgrade of dCache 2.13->2.16 and Postgres 9.4->9.6.
    SE MSS end-point will be closed 07-04-2017 for 6 hours for major upgrade of dCache 2.13->2.16 and Postgres 9.4->9.6.
  • KISTI:
  • KIT:
    • Dimitri: we will replace our site BDII hosts and also their alias, can we just update the URL in the GOCDB?
    • Maarten: that should indeed be sufficient, as top BDIIs should regularly update their configurations;
      when the new alias has been declared, the old hosts should see their clients disappear
      • after the meeting: the ALICE LDAP server will also need to be updated
  • NDGF:
    • Ulf:
      • 1 ALICE disk array was moved OK to another rack today
      • slowly recovering from the ATLAS Frontier incident over the weekend
  • NL-T1: NTR
  • NRC-KI:
  • OSG:
    • Kyle: WLCG security officer Romain Wartel will be meeting the new OSG security officer Susan Sons
  • PIC: NTR
  • RAL:
    • There are problems with the LHCb Castor instance. We are working with LHCb (Raja) on it, but it is still not understood.
    • Alastair is involved in investigating the problems with the Atlas Frontier service.

  • CERN computing services:
    • Standard manual procedure on LSF ended up with a bad system file, resulting in its state directory not being mountable - All jobs were lost (running and pending) [ OTG:0036706 ]. Subsequent issues with CREAMs connecting and then suffering from "missing jobs". HTCondor service was not affected.
    • Puppet 4 upgrade affected a few services when a default changed, and manually managed cron jobs were lost (in some case with no backup).
  • CERN storage services:
    • FTS: upgrade to v 3.6.7 planned for tomorrow April 4th at 10:00 CEST (OTG:0036732). The service will be unavailable from 10:00 to 12:00, experiments are suggested to switch to another FTS instance during the intervention (if possible)
    • EOSCMS went down several times over the weekend due to LevelDB error. We have rolled back to the previous version and are looking into the problem.
    • EOSLHCB crashed several times over the weekend due to being over-loaded. We have added more capacity to handle the extra load.
    • EOS is having monitoring problems (out-dated probe infos being displayed): "Service unavailable" is wrong. We are investigating.
  • CERN databases:
  • GGUS:
    • Last week's release and alarm tickets went well
      • CNAF only responded to their ticket the next day, though
  • Monitoring:
    • Draft SAM3 reports for March 2017 sent to the WLCG office.
  • MW Officer: An issue affecting IPV6 transfers to DPM when gridftp redirection is enabled has been discovered ( GGUS:127285). We suggest sites not to enable gridftp redirection until the issue is fixed.
  • Networks:
  • Security: NTR


Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2017-04-04 - MaartenLitmaath
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback