Week of 150601

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Alexey (ATLAS), Stefan (SCOD), Alessandro (Storage), Christoph (CMS), Ben (Grid Services), Katarzyna (Databases), Maarten (ALICE),
  • remote: Daniele (CMS), Hung-Te (ASGC), Jens (NDGF), Lisa (FNAL), Elizabeth (OSG), Alexander (NL-T1), Di-Qing (TRIUMF), Rolf (IN2P3), Sonia (CNAF),
  • apologies: Tiju (RAL), Joel (LHCb), Sang-Un (KISTI)

Experiments round table:

  • ATLAS reports (raw view) -
    • Tile reprocessing is running well
    • New record in number of transferred files (Rucio)
    • Issues with NDGF/FTS3 and INFN T1 GGUS ticket

  • CMS reports (raw view) -
    • Services
      • May-25: XROOT (criticality level 8) at 67% availability: check cmsxrootd.fnal.gov:1094
        • opened GGUS:113934.
        • Prompt reaction by Brian Bockelman. Service ran out of file descriptors, fills the log file with a line per failed accept(), host out of space. Took immediate measures, discussed about lack of alarms and planned preventive actions for the future.
        • system monitored, looked stable, GGUS closed by the CMS-CRC on Jun 1st
      • May 27: ELOG slow/unresponsive around mid-week
        • unclear. Solved by itself. Archived as “glitch”. No tickets.
    • T1:
      • since the weekend: T1_UK_RAL pileup reading errors (GGUS:114010 as of Jun 1st, currently investigating)
      • since May-27: issues in file transfers from T1_FR_CCIN2P3 to T2_US_Florida (GGUS:113937, pending)
      • since May-27: Failing transfers from T1_RU_JINR_Disk to T1_UK_RAL_Disk (GGUS:113936, now understood, needs few corrupted files invalidation)

  • ALICE -
    • high job failure rates starting Fri evening, then very low activity on Sat through Sun morning due to an ALICE SW issue
      • a cure was applied Sat evening and MC production was restarted

  • LHCb reports (raw view) -
    • Operations dominated by MC jobs and user analysis; Some RAW DATA are foreseen on Wednesday
    • T0
      • tape set defined for the new RAW data file on CASTOR
    • T1
      • RRCKI : site declared in Downtime in GOCDB as WARNING but all transfer failed. So if there is an interruption of service the downtime should be DOWN and not WARNING or AT RISK.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NR
  • CNAF: NTR (tomorrow bank holiday)
  • FNAL: Intervention in computer rooms, approx. quarter of CMS allocations down. Currently working to restore services, expect to be back by mid week.
  • GridPP: NR
  • IN2P3: NTR
  • JINR: NR
  • KISTI: NTR
  • KIT: NR
  • NDGF: FTS issues last Friday struggles with transfers for most of the day. Mitigated by software fix the day after.
  • NL-T1: Tomorrow morning DT for software updates on dCache and workernode kernels.
  • NRC-KI: NR
  • OSG: NTR
  • PIC: NR
  • RAL: NTR
  • TRIUMF: NTR

  • CERN batch and grid services: ongoing issue with FTS bringonline daemon (code problem?), currently being looked at.
  • CERN storage services: new EOS/ATLAS grid ftp door opened (now four),
  • Databases: NTR
  • GGUS: NR
  • Grid Monitoring: NR
    • Draft availability reports for May sent, and available at the SAM3 UI
  • MW Officer: NR

AOB:

Thursday

Attendance:

  • local: Joel (LHCb), Stefan (SCOD), Christoph (CMS), Alessandro (ATLAS), Ben (Grid Services), Xavi (Storage), Nacho (Grid Services), Maarten (ALICE),
  • remote: Dennis (NL-T1), Rolf (IN2P3), Lisa (FNAL), Michael (BNL), Hung-Te (ASGC), John (RAL), Jeremy (GridPP), Sang Un (KISTI), Jens (NDGF), Di Qing (TRIUMF), Pepe (PIC), Rob (OSG), Daniele (CMS), Lucia (CNAF),

Experiments round table:

  • ATLAS reports (raw view) -
    • Smooth first official day of data taking!
    • Tier-0 issues on some WNs: can someone clarify the issue?
      • Xavi: on long jobs, tracked down to a bug in the hypervisor numa mapping of cores in Openstack, not fixed yet. Alessandro prefers not to use those machines.
    • HLT reprocessing: successfully managed to reprocess 1M events for the 1st week HLT reprocessing needs.

  • CMS reports (raw view) -
    • General
      • exciting days for the start of the 13 TeV LHC era!
      • most people focussed on smooth start-up, Tier-0 and surroundings (i.e. short report today!)
    • Services
      • nothing major to report on central services since lat Monday’s 3pm call
      • Daniele noticed CVMFS degradation, CMS shifters will monitor this
    • T1:
      • since early May: T1_FR_CCIN2P3 is debugging stuck outgoing transfers, GFAL-related, testing a solution.. (GGUS:113510, GGUS:113937)
      • since Jun-01: T1_UK_RAL pileup reading errors currently investigating (GGUS:114010)
      • since May-27: issues in file transfers from T1_FR_CCIN2P3 to T2_US_Florida (GGUS:113937, pending)
      • since May-27: failing transfers from T1_RU_JINR_Disk to T1_UK_RAL_Disk, now understood, needs few corrupted files invalidation (GGUS:113936)
        • Xavi: in contact with P5 people b/c of slow transfers from pit to CERN, issue possibly related to routers in Wigner
        • Rolf: IN2P3 transfer problems related to old version of GFAL of Phedex, local expertise is limited, careful with testing alternatives, testing update to GFAL2, Daniele: Sebastien is following up also on this

  • ALICE -
    • CERN: CEs are extremely slow since Wed evening (team ticket GGUS:114103)
      • Ben: looking into it

  • LHCb reports (raw view) -
    • First data collisions since yesterday
    • T0
    • T1
      • GGUS ticket send to all T1 in order to make sure that all the RAW data will be store on the same tape set in each tape system when it is feasible (114013 to 114019).

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NTR
  • FNAL: NTR
  • GridPP: NTR
  • IN2P3: NTA
  • JINR: NR
  • KISTI: NTR
  • KIT: NR
  • NDGF: NTR
  • NL-T1: NTR
  • NRC-KI: NR
  • OSG: issue with BDII publishing in OSG from Monday night to Tuesday, problem is fixed. Only effect seen is that some monitoring statistics for a site in Florida were impacted.
  • PIC: NTR
  • RAL: NTR
  • TRIUMF: NTR

  • CERN batch and grid services:
    • MyProxy to be migrated to CC7 and a new backend for credentials storage (ITSSB entry)
    • New Nordugrid ARC Compute Elements being added to BDII and GOCDB. Fronting the HTCondor based grid pilot at CERN Batch. Bear in mind that the current resources behind these CEs is limited. Hostnames are ce501 and ce502.
    • _Maarten: an update of globus-gssapi (in EPEL testing), changes how host certificates are validated on the client side. This may cause issue with services behind aliases (eg. at CERN FTS, VOMS, myproxy, ...). Nacho: it can be fixed on the service side. Discussion ongoing, once fixed an egi broadcast will be sent, see also GGUS:114076_
  • Databases: NR
  • GGUS: NR
  • Grid Monitoring: NR
  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2015-06-04 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback