Week of 150907

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Oliver (CMS), Marc (LHCb), Emil (IT-DB), Xavier (IT-DSS+SCOD)
  • remote: Ulf (NDGF), Antonio (CNAF), Tiju (RAL), Andrezj (ATLAS), David (IN2P3), Sang-Uhn (KISTI), Pepe (PIC)

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • Temporary failures in exports from T0 to TAPE endpoints at CERN with communication error, followed in GGUS:115680
      • Problems with Rucio file transfer services over the weekend leading to slight decrease in production on Monday morning
      • Problems with cvmfs access by 32b software detected at few sites, need to be monitored by cloud support in HC tests

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Data Processing:
      • Dominated by MC and user jobs
    • T0
      • Investigation of slow worker ongoing (GGUS:116023)
      • dbod 9 Sep downtime - Plans in place, waiting for final confirmation about cluster migration beforehand.
    • T1
      • CNAF Outage: All seems OK now
      • SARA: DNS intervention caused some failed transfers over the weekend. All OK now.

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF: Still running in limited CPU power (missing 10% of pledged resources). Waiting for final green light for full operation. Post-mortem report in progress.
  • FNAL:
  • GridPP:
  • IN2P3:
  • JINR:
  • KISTI:
  • KIT:
  • NDGF: Old bug got resurrected: ATLAS file got 100 replicas on tape and at the time of the recall is asking for 100 replicas on disk. Investigating.
  • NL-T1:
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL:
  • TRIUMF:

  • CERN batch and grid services:
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

  • Note: the next meeting will be held on Friday .

Thursday: JeŻne genevois holiday

  • The meeting will be held on Friday instead.

Friday

Attendance:

  • local: Maarten (Alice), Prasanth (IT-DB), Xavier (IT-DSS+SCOD)
  • remote: Andrezj (ATLAS), Christoph (CMS), David (IN2P3), Michael (BNL), Ulf (NDGF), Tiju (Ral), Di (TRIUMF), Kyle (OSG), Andrew (NIKHEF)

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • CERN EOSATLAS namespace crashed on Tuesday, repaired after a few hours.
      • RAL FTS repeatedly getting in stalled state under some (load) conditions, ticket opened.
      • FTS software update at BNL,RAL,CERN status? Needed for better handling of transfer messages.

  • CMS reports (raw view) -
    • There is a small probability that no one from CMS makes it to the call
      • In case of feedback, please address it by mail to Christoph
    • High load - up to 120k jobs in the Global Pool
    • Issues with one type of MC workflows with severe memory leak
      • Tried to allocate 'all' memory
      • Might have 'killed' some WNs at sites (sorry)
      • Problem localized by CMS Offline & Generator experts

  • ALICE -
    • CERN: myproxy.cern.ch had expired CRLs
      • team ticket GGUS:116095 opened Mon evening
      • cause: hosts in the Wigner data center had no outgoing connectivity for IPv6
      • in the meantime the squid proxy service for CRLs only used hosts in the Meyrin data center
    • CERN: EOS read and write failures Tue late afternoon (GGUS:116118)
      • port 1094 got overloaded for a few hours by badly behaved clients
      • WN at a few sites accessed EOS through NAT boxes without reverse DNS
        • should not be a problem, but the sites were asked to correct that
    • Accessing CASTOR for reading or writing raw data files:
      • A constructive meeting was held between ALICE experts and the CASTOR team.
      • Short- and longer-term ideas were discussed.
      • Reco jobs now download the raw data files instead of streaming them.
      • A few percent of those jobs still failed, possibly due to older jobs or unrelated issues.
      • DAQ and CASTOR experts will retrace how a particular file ended up lost.
      • Thanks for the good support!

  • LHCb reports (raw view) -
    • Data Processing:
      • Dominated by MC and user jobs
    • T0
      • Investigation of slow worker ongoing (GGUS:116023)
      • dbod 9 Sep downtime - DB transfer complete and all services restarted without issue
    • T1
      • CNAF: Network issues causing slow downloads/uploads (GGUS:116023)

Sites / Services round table:

  • ASGC:
  • BNL: FTS upgrade is planned to be carried out on Monday.
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3: Intervention on ATLAS-AMI Database on Monday the 14th from 14h to 17h.
  • JINR:
  • KISTI:
  • KIT:
  • NDGF: Downtime on Monday for dcache patching and headnode reboots. This will take ~40 minutes. According to GOCDB intervention will take place between 9h to 10h.
  • NL-T1:
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL: CASTOR upgrade affecting ATLAS and ALICE instances next Tuesday the 15th
  • TRIUMF:

  • CERN batch and grid services:
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:
  • Networking:
    • This is to let people concerned with computing operations know that currently 3 out of 4 circuits across the Atlantic provided by ESnet are either out of service or impaired. Full connectivity between the majority of ATLAS and CMS sites in the U.S. and Europe is currently provided by the remaining 100 Gbps circuit that is running across an independent submarine cable system that is not having issues at this point. At this point we are not aware of any connectivity issues or bandwidth limitations regarding both LHCOPN and LHCONE traffic. For the records a few details as to the current situation:
      • There are a number of unrelated outages that are affecting the transatlantic portion of ESnet today. ESnet is working with all of their vendors to repair things as soon as possible and mitigate the risk of significant impact.
      • ESNET-20150807-002
        • The 40G circuit from Boston to Amsterdam is down. It was cut on the 7th. A ship is on-site and actively working on recovering and repairing the cable.
      • ESNET-20150910-005
        • The 100G circuit from Washington DC to CERN went down at 12:08 PDT. The vendor is working on identifying the fault. At this point ESnet doesn't know the cause, or has any estimates of the time to repair yet.
      • ESNET-20150910-001
        • The 100G circuit from NY to London has been experiencing intermittent faults. The vendor is working on replacing an amplifier in New York. It is currently up, but we expect intrusive maintenance events in the next day or so that will take it down for brief periods. There is a 30 minute maintenance window scheduled for 04:00am - 04:30am PDT on Sept 11, 2015.
    • The remaining 100G circuit from NY to London is up and running fine.
    • Finally, if the above mentioned circuit should fail ESnet will be moving LHC related transatlantic traffic to 2 (lightly loaded) 100 Gbps circuits owned by GEANT (NY - Paris) and the ANA consortium, both having agreed to provide backup services for the LHC community.
      Michael Ernst

AOB:

  • ATLAS asked for an update about the plans to upgrade the different FTS instances on BNL, RAL and CERN
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2015-09-11 - MichaelErnst
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback