Week of 150518

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Andrea Sciaba', Herve' Rousseau (IT-DSS), Lorena Lobato (IT-DB), Manuel Guijarro (IT-PES), Maarten Litmaath (ALICE)
  • remote: Michael Ernst (BNL), Hung-Te Lee (ASGC), Lisa Giacchetti (FNAL), Sang Un Ahn (KISTI), Tommaso Boccali (CMS), Tiju Idiculla (RAL), Dmytro Karpenko (NDGF), David Cameron (ATLAS), Rolf Rumler (IN2P3-CC), Pepe Flix (PIC), Dimitri (KIT), Kyle Gross (OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • Grid running very full over the weekend, not much else to report.

  • CMS reports (raw view) -
    • 2015 Production ongoing (at least the one for Trigger)
    • A few issues which were worth a ticket (mostly for cern / storage related problems):
      • GGUS:113687 (CERN OPEN): all the SAM for T2_CH_CERN* are failing on eoscms, with "ERROR: [SE][GetSpaceTokens][SRM_AUTHORIZATION_FAILURE] httpg://srm-eoscms.cern.ch:8443/srm/v2/server: not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba". It could be a nasty effect of user "sciaba" having primary group = ALEPH(xu) and not CMS(zh). But in any case, why it is happening just now (it has always been like this). STILL OPEN ERROR CHANGED TO CGSI-gSOAP running on samnag-ai-08 reports Error reading token data header: Connection reset by peer
      • GGUS:113664 (CERN CLOSED): a serious problem with all the files written to EOS by T0 becoming unreadable later on, with metadata error. Solved by IT, but basically blocked T0 activities for ~12 hours at least.
      • GGUS:113657 (CERN OPEN): "the transfers to/from T0_CH_CERN are failing"; it seems to be working by now - the answer we got was CASTOR overloaded, but ticket still open. It happened again during the we.
      • GGUS:113032 (CERN OPEN): it seems a local EOS redirector misconfiguration, with some of the used ports firewalled?
      • GGUS:113707 (IN2P3 OPEN): jobs failures with "site-local-config.xml does not exist in CVMFS (looked at /cvmfs); is CVMFS running and configured properly?". Not completely clear if on the site or on services, investigating. more info sent to IN2P3: it seems SAM and prod jobs use a different site-local-config.xml?

Hervé adds that the redirector problem has been identified as caused as a firewall misconfiguration of the node itself.

Rolf adds that the reason why there are two versions of the site-local-config.xml is due to the fact that at IN2P3-CC there are two CMS sites, the Tier-1 and a Tier-2. More details in the ticket.

  • ALICE -
    • CERN: intermittent problems with writing to CASTOR on Sat, cured by CASTOR operations team late Sat evening!
      • one of the headnodes had a disk full
    • CNAF: VOBOX unreachable since late morning today (GGUS:113778)

  • LHCb reports (raw view) -
    • Operations dominated by MC jobs and user analysis
    • T0:
    • T1
      • Problem to contact SARA SRM Seems to be back. Reopened a ticket GGUS:113766

Sites / Services round table:

  • ASGC: scheduled downtime this Friday to upgrade the DPM mirroring, already in GOCDB
  • BNL: ntr
  • CNAF:
  • FNAL: scheduled downtime this Thursday from 8am to 12pm Central Time for a network intervention
  • GridPP:
  • IN2P3: scheduled downtime on June 16 affecting all storage services, duration 3 hours plus 5 hours for dCache alone, possibly other services. More details will be given one week before
  • JINR:
  • KISTI: scheduled downtime on May 21 from 0700Z to 1300Z due to a network intervention, which should be transparent
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: Unable to dial in because we're in a dCache workshop, but NTR
  • NRC-KI:
  • OSG: we receive support tickets for KISTI and it is not clear why this happens. They are created in GGUS and routed to OSG support. Maarten thinks it may be a human error and invites Kyle to send the ticket ID.

Update: The ticket is https://ticket.grid.iu.edu/25313, it was related to the KISTI CMS Tier-3 (not the ALICE Tier-1) and it was intended for the VO support, not for OSG.

  • PIC: scheduled downtime on May 26, for 8 hours to apply some changes in the router for the new firewall
  • RAL: the CASTOR namespace intervention initially foreseen for tomorrow has been postponed to a date to be decided
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services: ntr
  • Databases: ntr
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Andrea Sciabŕ, Hervé Rousseau, David Cameron (ATLAS), Claire Adam (ATLAS), Manuel Guijarro (IT-PES), Maarten Litmaath (ALICE)
  • remote: Dennis van Dok (NL-T1), Dmytro Karpenko(NDGF), Hung-Te Lee (ASGC), Rulf Rumler (IN2P3-CC), Sang Un Ahn (KISTI), Gareth Smith (RAL), Thomas Hartmann (KIT), Christoph Wissing (CMS), Lucia (CNAF), Kyle Gross (OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • Collisions in ATLAS at 13TeV but nothing on Grid
    • Reaching new highs in running job slots (210k this morning)
    • Unexpected RAL downtime this morning caused hiccups in some very high priority jobs
    • Data transfer stress tests in next days: T0 internal traffic + T1 export, and large-scale transfer of mc15 AODs from T1 -> T2

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Operations dominated by MC jobs and user analysis
    • LHCb Computing workshop going on, so not much operation activity
    • Downtime today though due to the Oracle DB upgrade
    • T0:
    • T1
      • Problem to contact SARA SRM Seems to be back. They claim that the fetch-crl problem could be now on CERN side. Upgrading on FTS servers and our vobox planned.

Sites / Services round table:

  • ASGC: to remind of the downtime scheduled tomorrow to upgrade the DPM server
  • BNL:
  • CNAF: ntr
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: the downtime scheduled today was completed without any problem
  • KIT: ntr
  • NDGF: ntr
  • NL-T1:
  • NRC-KI:
  • OSG: the KISTI ticket mentioned last Monday was routed to the CMS T3 support
  • PIC:
  • RAL:
    • there is a problem with the CASTOR database, as the backup is not working correctly after moving some nodes to another room. Therefore the database transactions are not properly propagated to the backup and they are just logged. If the problem is not solved in time, CASTOR will be stopped at the end of the day. Currently this is marked as a WARNING in GOCDB
    • as reported by Maarten in a ticket (GGUS:113843), a scheduled downtime for the decommissioning of the CREAM-CEs disappeared before its end because the endpoints were mistakenly removed from the GOCDB before the end of the downtime
  • TRIUMF:

  • CERN batch and grid services:
    • MyProxy running on myproxy.cern.ch has been updated on May 20th (please take a look at the ITSSB entry for more details)
  • CERN storage services: ntr
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

  • ATTENTION: next meeting on Tuesday May 26 !
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2015-05-21 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback