Week of 131216

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Alessandro, Belinda, Felix, Jerome, Maarten, Stefan, Steve, Xavier E
  • remote: Christian, Jose, Lisa, Michael, Onno, Pepe, Rob, Rolf, Sang-Un, Stefano, Tiju, Xavier M

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • NTR
    • T0/T1
      • IN2P3-CC SOURCE error during TRANSFER_PREPARATION phase: RQueued GGUS:99777 , solved
      • INFN-T1 Transfers failing with error Request timeout GGUS:99771 , solved
      • RAL-LCG2 Transfer failures with "source file doesn't exist" GGUS:99768 , waiting for reply
      • FZK-LCG2 issue in reading from tape, site is working on it. (FZK internal monitoring which shows no activity http://gridmon-kit.gridka.de/tapeview/atlas/index.html )
      • BNL-ATLAS is in scheduled maintenance, US Cloud offline during the first part of the intervention (which affect the network).
    • openssl issue: https://operations-portal.egi.eu/broadcast/archive/id/1066
      • Maarten summarized the events leading up to the broadcast (further details there) and added that besides CREAM also other SLC6.5 services can be affected, e.g. WMS or even storage elements, as reported below by LHCb; as it looks unlikely that RedHat will re-enable support for 512-bit proxies in a future update, we will need to pursue fixing all "client" instances that still generate such proxies
      • Rob added that OSG experts are working on reducing the fallout on the OSG side
      • new "gridsite" versions have just been released now:

  • CMS reports (raw view) -
    • Very quiet weekend. No relevant issues to report.
      • Rob: the glideinWMS factory at Indiana University ran out of disk space on Fri and has been taken out of the list temporarily, while a new SSD drive is being awaited, which probably will not arrive before Jan

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • reprocessing of Proton-Ion collisions started last week at GRIDKA/CERN,
    • At other sites main activities are simulation & user jobs
    • T0:
    • T1:
      • FZK: Pilot problems, solved (GGUS:99725)
      • FZK: Issue with tape system over the week-end, now resolved, staging throughput increasing.
    • Other:
      • Problems with FTS3 transfers to CBPF which is running slc6.5. This linux version produces SSL3 handshake problems (GGUS:99398)
        • Steve: the FTS-3 nodes have almost finished getting reinstalled with SLC6.4 (sic), which we probably can live with for a few weeks
        • after the meeting FTS-3 project lead Michail Salichos clarified that both the FTS-3 client and the server depend on the "gridsite" provided by EPEL-stable; since the new version should get there soon and the few server instances can be kept on SL6.4 for now, standard updates can be done in Jan; the FTS-3 Wiki has been updated

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • network intervention ongoing
      • new switch installed, connectivity restarted
      • new spanning tree algorithm just started, being checked
    • tomorrow dCache upgrade to v2.6 for SHA-2 support
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KISTI
    • last week's network problems were due to a chain of events:
      • a logical volume for hypervisor storage accidentally got overwritten in a test
      • VMs then could not mount their storage
      • as the DNS was running in a VM, it became unavailable, which caused all kinds of services to fail
    • the DNS is now running on a physical node and the services have been recovered
  • KIT
    • last week the SE for ATLAS was upgraded and ran into file system problems:
      • 90 TB are still unavailable; the tech support is coming from the US
      • reading from tape was not possible, but should be OK again now
  • NDGF
    • short downtime Wed ~noon CET to reboot some pool nodes and update them to dCache 2.6
  • NLT1
    • tomorrow evening at-risk downtime for tape back-end; files only on tape will be unavailable for a while
  • OSG - nta
  • PIC
    • Thu Dec 19 downtime for cooling system maintenance plus various upgrades
  • RAL - ntr

  • grid services
    • CVMFS Stratum-0 and -1 have been migrated and upgraded OK
    • FTS-3 is being downgraded to SL6.4 because of the openssl issue (almost done)
  • storage
    • transparent EOS updates to improve http performance and e-groups support:
      • EOS-CMS ongoing
      • EOS-ATLAS tomorrow morning

AOB:

Thursday

Attendance:

  • local: Alessandro, Belinda, Felix, Jerome, Maarten, Maria D, Pablo, Stefan
  • remote: Christian, Dennis, John, Lisa, Rob, Rolf, Sonia, Xavier

Experiments round table:

  • ATLAS reports (raw view) -
    • OpenSSL issue: is there any official broadcast with the latest news (described by Maarten at the WLCG ex-daily on Monday )
      • re-observed for FTS3 Pilot, Steve has been contacted.
      • Maarten explained that the matter is not fully understood at this time:
        • there has not been a big impact on the infrastructure so far
        • in direct job submission tests with CERN CREAM and UI instances the delegated proxies ended up with 1024-bit keys, despite that nothing was updated
          • but 512-bit keys can still be reproduced at DESY-ZN
        • the SAM WMS have not been updated with the new gridsite version, hence continue generating 512-bit proxies, yet nobody reported problems due to that
          • the new gridsite has been tested and the update can be done at short notice, if needed
          • otherwise it will be done in Jan
        • sites are advised to keep SL6 services on SL6.4 for the next 2 weeks, unless an urgent security update requires SL6.5
      • Rob explained that OSG have done an emergency release on Tue to fix the affected Globus components
        • complications due to the use of a private interface to OpenSSL; the code now uses a public interface instead
    • plans for the next weeks at today's WLCG Ops Coordination meeting
    • ADC20131210.pdf: CVMFS inode issue

  • ALICE -
    • Thanks for your contributions to another successful year and best wishes for 2014!

  • LHCb reports (raw view) -
    • reprocessing of Proton-Ion collisions in full swing at CERN/GRIDKA
    • At other sites main activities are simulation & user jobs
    • T0:
      • Impressive staging performance (140TB/24h), stage in for reprocessing is finished
    • T1:
      • GRIDKA: problems with file access via xroot, switched back to dcap for the time being.

      *
     /.\
    /..'\      Many season's greetings
    /'.'\      to all sites and services.
   /.''.'\     Thanks for all your work
   /.'.'.\     and support during 2013. 
  /'.''.'.\    
  ^^^[_]^^^    LHCb Grid Operations Team

Sites / Services round table:

  • ASGC
    • network intervention tomorrow 07:00-10:00 UTC
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • the 90 TB of unavailable ATLAS data are back, checksum verifications are still ongoing
  • NDGF - ntr
  • NLT1
    • many ATLAS jobs at NIKHEF are failing due to the CVMFS inode counter overflow bug to be fixed in the next release; in the meantime the only cure is to unmount and remount the ATLAS repository, which can only be done when no process has it open
      • Alessandro: not a big impact so far; you could try a rolling intervention, viz. draining selected WN of ATLAS jobs, then fix those WN
  • OSG - nta
  • PIC
    • Apologies, I cannot attend to today's meeting (Pepe). Today's downtime is going pretty well. We foresee to start services even before the declared end time for today's downtime.
  • RAL - ntr

  • dashboards - ntr
  • GGUS
    • Reminder: For the Year End period: GGUS is monitored by a system connected to the on-call service. In case of total GGUS unavailability the on-call engineer (OCE) at KIT will be informed and will take appropriate action. If GGUS is available but there is a problem with the workflow, e.g. ALARM to CERN doesn't generate email notification to the operators, then WLCG should submit an ALARM ticket, notifying Site DE-KIT, which triggers a phone call to the OCE. If the web portal is unavailable, contact details for KIT are recorded in the GOCDB.
  • grid services
    • lcg-bdii.cern.ch and sam-bdii.cern.ch can have different statuses for services that currently are not found in their site BDII (GGUS:99827)
      • lcg-bdii would then be wrong due to a faulty component that will be updated in Jan
  • storage - ntr

AOB:

  • THANKS for your contributions to making 2013 a very successful year for WLCG!
    • Further challenges and opportunities await us in 2014... smile
  • Next meeting: Mon Jan 6

Season's Greetings!

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf ADC20131210.pdf r1 manage 169.2 K 2013-12-19 - 15:18 AleDiGGi CVMFS inode issue
Unknown file formatpptx MB-Dec.pptx r1 manage 2843.5 K 2013-12-16 - 10:36 PabloSaiz  
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2013-12-19 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback