Week of 141027

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria Dimou (chair & notes), Luca Canalli (CERN DB), Alessandro Fiorot (CERN Data Mgnt), Hervé Rousseau (CERN Data Mgnt), Tsung-Hsun Wu (ASGC), Xavier Espinal (CERN Storage), Maarten Litmaath (ALICE), Aris Angelogiannopoulos (CERN Batch & Grid Services).
  • remote: Christian (NDGF), Dea-Han Kim (KISTI), Sonia (CNAF), Tiju Idiculla (RAL), José Hernandez (CMS), Rolf Rumler (IN2P3), Onno Zweers (NL_T1), Vladimir Romanovskiy (LHCb), Josep Flix (PIC), Kyle Gross (OSG).

Experiments round table:

  • ATLAS reports (raw view) - Not connected.
    • Daily Activity overview
      • Deletion backlog at FZK, TW tier1s - DDM ops moved datadisk endpoints to different machines (scratchdisk endpoints stayed)
    • CentralService/T0/T1s

  • CMS reports (raw view) -
    • Sunday: LSF problems:
      • Group : Pending job threshold reached
      • Only affected/affects CMS
      • User wrote script to submit jobs to LSF, by mistake it was calling itself, submitting more than 10E5 jobs
      • LSF support (Ulrich) helped on Sunday (Many Thanks!!): " I have inactivated the queue and killed first the running jobs, then all the pending once. All jobs are gone now, and the queue is active again"
    • Monday (today): starting 3 week cosmic data taking run

  • ALICE -
    • the pilot (job agent) code has been patched for running as many tasks as possible within the pilot lifetime
    • KIT: high OPN traffic to CERN due to raw data reprocessing that did not find the data available locally
      • experts found a misconfiguration of the ALICE storage system at KIT

  • LHCb reports (raw view) -
    • MC and User jobs. Prestaging from tapes for Stripping21 campaign has started.
    • T0: NTR
    • T1: NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: not connected
  • CNAF: ntr
  • FNAL: not connected
  • GridPP: not connected
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: ntr
  • KIT: not connected
  • NDGF: There will be a network reconfiguration of the connection to slovenian sites to IPv6, hopefully brief with no service interruption.
  • NL-T1: ntr
  • OSG: ntr
  • PIC: ntr
  • RAL: ntr
  • RRC-KI: not connected
  • TRIUMF: not connected

  • CERN batch and grid services: The batch problem experienced yesterday (user submitting too many jobs) was fixed.
  • CERN storage services: ntr
  • Databases: ntr
  • GGUS: not represented
  • Grid Monitoring: not present
  • MW Officer: joins on Thursdays

AOB:

  • Service Incident Report has been provided for the CASTOR SRM-CMS instabilities two weeks ago (11 and 14 of October): WLCG SIRs

Thursday

Attendance:

  • local: Maria Dimou (chair & notes), Luca Canalli (CERN DB), Hervé Rousseau (CERN Data Mgnt), Felix Lee (ASGC), Xavier Espinal (CERN Storage), Maarten Litmaath (ALICE), Aris Angelogiannopoulos (CERN Batch & Grid Services), Andrea Manzi (MW Officer).
  • remote: Christian (NDGF), Soyun (KISTI), John Kelly (RAL), Christoph Wissing (CMS), Rolf Rumler (IN2P3), Dennis van Dok (NL_T1), Vladimir Romanovskiy (LHCb), Josep Flix (PIC), Lisa Giachetti (FNAL), Michael Ernst (BNL), Thomas Hartmann (KIT).

Experiments round table:

  • ATLAS reports (raw view) -
    • Apologies: most probably no one from ATLAS can connect today
    • Full SCRATCHDISKs at CA-VICTORIA-WESTGRID-T2 and AUSTRALIA
      • Caused by user amoroso - DAST contacted

  • CMS reports (raw view) -
    • One machine of CMS Glidein Infrastructure run wild with too many connections
    • Launched new MINIAOD production round today

  • ALICE - NTR

  • LHCb reports (raw view) -
    • MC and User jobs. Prestaging from tapes for Stripping21 campaign.
    • T0: NTR
    • T1: Problem with ROOT6 application at PIC

Sites / Services round table:

  • ASGC: ntr
  • BNL: connected but inaudible
  • CNAF: not connected
  • FNAL: voms-proxy-init didn't work yesterday. Outage was published HERE but lasted longer than expected. CERN DB and Grid services will investigate if information could have been better announced to reach all users. So far, it seems that all due steps were taken as the Service Status Board and GOCDB were updated. Christoph and Lisa will check if between CMS and USCMS a need to discuss additional info methods should be brought to the next WLCG Ops Coord meeting next week.
  • GridPP: not connected
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: ntr
  • KIT: A SIR is ready for a "tapes not readable" problem which is now understood. The reason was that an end marker was written at the wrong position, so the tape appeared to have reached its end prematurely. A workaround is now in place. SIR linked as KIT_SIR_Storage_20141023.pdf |
  • NDGF: ntr
  • NL-T1: A short DPM service interruption yesterday when a partition was found full due to ATLAS heavy activity. The problem appeared slightly before noon and was solved at 1pm. GOCDB was in read-only mode so an announcement couldn't be published.
  • OSG: A seemingly transient issue with the CERN VOMS was noticed yesterday. GGUS:109730 was created to track the issue. OSG will add the new CERN VOMS entries to the OSG VO Package for the upcoming November 11th release. ATLAS and CMS haven't yet opened access to the new servers for their users as they haven't finished testing their workflows.
  • PIC: ntr
  • RAL: ntr
  • RRC-KI: not connected
  • TRIUMF: not connected

  • CERN batch and grid services:
  • CERN-site: Quattor will be stopped tomorrow for all services except a few named experiment ones (which will last another month). The lxadm service will be closed down in favour of "aiadm.cern.ch". lxvoadm.cern.ch will remain open until the end of November. This chance concerns CERN/IT only, in principle, so all should be transparent, the experiments have 1 more month to move their services.
  • CERN storage services: CASTOR was moved from quattor to puppet and an unavailability was observed between 11:10 and 11:20hours while the DNS alias was being moved. Now all ok.
  • Databases: Complete info on yesterday's incident reported by Lisa (FNAL): The downtime of the DB on Demand infrastructure as announced in https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&n=OTG0015412&plogin=true has been scheduled on October 29th in the morning by the IT-DB infrastructure team as an urgent change needed to follow up on a series of incidents that had originally started on Sunday 26th with some sporadic reboots and by Wednesday 29th morning had escalated to an major incident and was judged a critical situation that could only be fixed with a hard reboot. Instead of performing the reboot immediately in the morning the infrastructure team has postponed it outside the official CERN working hours to reduce the impact.
  • GGUS: Release next Wed Nov. 5th with ALARM tests, as usual.
  • Grid Monitoring: not present
  • MW Officer:Reminder for sites running dCache 2.2.X. The dCache 2.2.x decommission deadline is 31-10-2014 (https://wiki.egi.eu/wiki/Software_Retirement_Calendar#dCache_v._2.2.X). Sites which don't upgrade, will receive tickets.

AOB:

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2014-10-31 - ThomasHartmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback