Week of 150907

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Oliver (CMS), Marc (LHCb), Emil (IT-DB), Xavier (IT-DSS+SCOD)
  • remote: Ulf (NDGF), Antonio (CNAF), Tiju (RAL), Andrezj (ATLAS), David (IN2P3), Sang-Uhn (KISTI), Pepe (PIC)

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • Temporary failures in exports from T0 to TAPE endpoints at CERN with communication error, followed in GGUS:115680
      • Problems with Rucio file transfer services over the weekend leading to slight decrease in production on Monday morning
      • Problems with cvmfs access by 32b software detected at few sites, need to be monitored by cloud support in HC tests

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Data Processing:
      • Dominated by MC and user jobs
    • T0
      • Investigation of slow worker ongoing (GGUS:116023)
      • dbod 9 Sep downtime - Plans in place, waiting for final confirmation about cluster migration beforehand.
    • T1
      • CNAF Outage: All seems OK now
      • SARA: DNS intervention caused some failed transfers over the weekend. All OK now.

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF: Still running in limited CPU power (missing 10% of pledged resources). Waiting for final green light for full operation. Post-mortem report in progress.
  • FNAL:
  • GridPP:
  • IN2P3:
  • JINR:
  • KISTI:
  • KIT:
  • NDGF: Old bug got resurrected: ATLAS file got 100 replicas on tape and at the time of the recall is asking for 100 replicas on disk. Investigating.
  • NL-T1:
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL:
  • TRIUMF:

  • CERN batch and grid services:
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

  • Note: the next meeting will be held on Friday .

Thursday: JeŻne genevois holiday

  • The meeting will be held on Friday instead.

Friday

Attendance:

  • local:
  • remote:

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • CERN EOSATLAS namespace crashed on Tuesday, repaired after a few hours.
      • RAL FTS repeatedly getting in stalled state under some (load) conditions, ticket opened.
      • FTS software update at BNL,RAL,CERN status? Needed for better handling of transfer messages.

  • CMS reports (raw view) -
    • There is a small probability that no one from CMS makes it to the call
      • In case of feedback, please address it by mail to Christoph
    • High load - up to 120k jobs in the Global Pool
    • Issues with one type of MC workflows with severe memory leak
      • Tried to allocate 'all' memory
      • Might have 'killed' some WNs at sites (sorry)
      • Problem localized by CMS Offline & Generator experts

  • ALICE -
    • CERN: myproxy.cern.ch had expired CRLs
      • team ticket GGUS:116095 opened Mon evening
      • cause: hosts in the Wigner data center had no outgoing connectivity for IPv6
      • in the meantime the squid proxy service for CRLs only used hosts in the Meyrin data center
    • CERN: EOS read and write failures Tue late afternoon (GGUS:116118)
      • port 1094 got overloaded for a few hours by badly behaved clients
      • WN at a few sites accessed EOS through NAT boxes without reverse DNS
        • should not be a problem, but the sites were asked to correct that
    • Accessing CASTOR for reading or writing raw data files:
      • A constructive meeting was held between ALICE experts and the CASTOR team.
      • Short- and longer-term ideas were discussed.
      • Reco jobs now download the raw data files instead of streaming them.
      • A few percent of those jobs still failed, possibly due to older jobs or unrelated issues.
      • DAQ and CASTOR experts will retrace how a particular file ended up lost.
      • Thanks for the good support!

  • LHCb reports (raw view) - * Data Processing:
      • Dominated by MC and user jobs
    • T0
      • Investigation of slow worker ongoing (GGUS:116023)
      • dbod 9 Sep downtime - DB transfer complete and all services restarted without issue
    • T1
      • CNAF: Network issues causing slow downloads/uploads (GGUS:116023)

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3:
  • JINR:
  • KISTI:
  • KIT:
  • NDGF:
  • NL-T1:
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL:
  • TRIUMF:

  • CERN batch and grid services:
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:


This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsMeetings > WLCGDailyMeetingsWeek150907
Topic revision: r13 - 2015-09-11 - XavierEspinal
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback