Week of 150615

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Ilija (ATLAS), Jerome (batch and grid services), Maarten (SCOD + ALICE), Massimo (storage), Prasanth (databases), Stefan (LHCb), Xavi (storage)
  • remote: Christian (NDGF), Christoph (CMS), Felix (ASGC), Kyle (OSG), Lisa (FNAL), Michael (BNL), Onno (NLT1), Preslav (CMS), Rolf (IN2P3), Sang-Un (KISTI), Sonia (CNAF), Tiju (RAL), Vladimir (LHCb)

Experiments round table:

  • ATLAS reports (raw view) -
    • FTS upgrade to fix the issue with storm: all the FTS servers did it, thanks a lot!
    • CERN-PROD: high failure rate was understood and fixed thanks to the quick interactions with CERN-IT DSS. GGUS:114293 reopened today, possibly fix was not complete.

  • CMS reports (raw view) -
    • File access problems at CCIN2P3: GGUS:114343
      • Rolf: that issue concerns our T2
    • Possible file corruption issues at CERN EOS: GGUS:114304
      • Massimo: for different experiments we have seen a strong correlation with the network incident of June 11
    • Some files seem not to migrate to CASTOR tape at CERN: GGUS:114282
    • File transfer issues from FNAL to RAL: GGUS:114275
    • Tape staging test started at CERN: GGUS:114283 (for information logging)
    • Any news regarding P5-Wigner network link?
      • Xavi: this is still being followed up further by network experts

  • ALICE -
    • high activity

  • LHCb reports (raw view) -
    • Validating data processing workflow
    • T0
      • NTR
    • T1
      • NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • GridPP:
  • IN2P3:
    • reminder: downtime tomorrow
  • JINR:
  • KISTI: ntr
  • KIT:
  • NDGF:
    • downtime tomorrow 10:00-16:00 CEST for dCache upgrades
  • NL-T1:
    • this morning the SARA squid service crashed and was restarted
    • there was a DNS issue affecting the computing cluster at SARA; it has been fixed, but the issue might not fully be gone until all previously cached information has expired
    • the NIKHEF farm seemed rather quiet, maybe due to the squid problem?
      • after the meeting: ALICE had 2k jobs running throughout the day...
  • NRC-KI:
  • OSG: ntr
  • PIC: ntr
  • RAL:
    • reminder: network maintenance Wed afternoon
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services:
    • tomorrow new HW will be added, which should be transparent, but might affect any experiment
  • Databases:
    • tomorrow CMS DB firewall rules will be moved from DB triggers to iptables
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Ilija (ATLAS), Stefan (SCOD), (MW), Oliver Gutsche (CMS), Sang-Un (KISTI), Massimo (Storage), Jerome (Grid Services), Maarten (ALICE),
  • remote: Felix (ASGC), Michael (BNL), Sonia (CNAF), Lisa (FNAL), Rolf (IN2P3), Thomas (KIT), Christian (NDGF), Dennis (NL-T1), Rob (OSG), John (RAL),

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • Large backlog of FTS transfers from CERN to BNL - at the moment 30k files. We raised number of activated to 30, 60 and now 100, still rate is hovering around 1GB/s. We could possibly go to 200, but not sure if that would help. We contacted site.
      • SARA - need to delete some directories (and data in them) as central deletion fails GGUS: 114430
      • RAL - fails transfers as a destination due to full DATADISK. GGUS:114391

  • CMS reports (raw view) -
    • Rolling network intervention on Tuesday June 16th was successful, thanks. Did not cause any problems.
    • Some files seem not to migrate to CASTOR tape at CERN: GGUS:114282
      • needs to be reassigned to Castor support!
    • File transfer issues from FNAL to RAL: GGUS:114275
      • on hold for now, suspicion is co-incident with CMS jobs on the RAL farm which cause heavy WAITIO on the storage nodes, cannot be confirmed right now because load decreased
      • Lisa: Transfers to RAL from FNAL, the ticket was updated and the slowness is back, John: will look into it soon
    • Tape staging test at CERN ongoing: GGUS:114283
    • Intermittent EOS read issues from CMS Tier-0: GGUS:113389
      • bug in xrootd client identified, work around is to set
        export XRD_LOADBALANCERTTL=86400
      • Ilija: which client version is this? Oliver: Donít know.
    • Any news regarding P5-Wigner network link?
      • Massimo: a report on this will be given in the WLCG Ops Coordination meeting today 15.30

  • ALICE -
    • low activity today

  • LHCb reports (raw view) -
    • Validating data processing workflow
      • T0
    • T1
      • RAL - problem with mapping for few user; Already fixed.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NTR
  • FNAL: NTR
  • GridPP: NR
  • IN2P3: Downtime yesterday went well. Some problems on Monday with electrical equipment, some worker nodes were effected, solved by now.
  • JINR: NR
  • KISTI: HW maintenance on network link this morning 7am - 8am to Amsterdam which was transparent.
  • KIT: NTR
  • NDGF: NTR
  • NL-T1: request from SARA to ATLAS concerning a GGUS ticket requesting removal of several directories, SARA wants confirmation for those from ATLAS, Ilija: will come back to you
  • NRC-KI: NR
  • OSG: multicore accounting has been fixed and confirmation was received that the correct numbers are now in the APEL system. Will now update previous few months with good numbers
  • PIC: NR
  • RAL: CMS disk server unavailable this morning, shall be back tomorrow morning
  • TRIUMF: NR

  • CERN batch and grid services:
  • CERN storage services: Last Tuesday transparent upgrade of services. Incident yesterday, all Castor DBs went down and needed to be rebooted. Chasing network corrupted files following the June 11 network incident.
  • Databases: NR
  • GGUS:
    • Release scheduled for the 24th of June. Downtime announced on GOCDB. The service might not be available during the intervention. More info about the release
  • Grid Monitoring: NR
  • MW Officer: NTR

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Jun-15.pptx r1 manage 2871.0 K 2015-06-15 - 09:41 PabloSaiz  
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2015-06-18 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback