Week of 131216

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Alessandro, Belinda, Felix, Jerome, Maarten, Stefan, Steve, Xavier E
  • remote: Christian, Jose, Lisa, Michael, Onno, Pepe, Rob, Rolf, Sang-Un, Stefano, Tiju, Xavier M

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • NTR
    • T0/T1
      • IN2P3-CC SOURCE error during TRANSFER_PREPARATION phase: RQueued GGUS:99777 , solved
      • INFN-T1 Transfers failing with error Request timeout GGUS:99771 , solved
      • RAL-LCG2 Transfer failures with "source file doesn't exist" GGUS:99768 , waiting for reply
      • FZK-LCG2 issue in reading from tape, site is working on it. (FZK internal monitoring which shows no activity http://gridmon-kit.gridka.de/tapeview/atlas/index.html )
      • BNL-ATLAS is in scheduled maintenance, US Cloud offline during the first part of the intervention (which affect the network).
    • openssl issue: https://operations-portal.egi.eu/broadcast/archive/id/1066
      • Maarten summarized the events leading up to the broadcast (further details there) and added that besides CREAM also other SLC6.5 services can be affected, e.g. WMS or even storage elements, as reported below by LHCb; as it looks unlikely that RedHat will re-enable support for 512-bit proxies in a future update, we will need to pursue fixing all "client" instances that still generate such proxies
      • Rob added that OSG experts are working on reducing the fallout on the OSG side
      • new "gridsite" versions have just been released now:

  • CMS reports (raw view) -
    • Very quiet weekend. No relevant issues to report.
      • Rob: the glideinWMS factory at Indiana University ran out of disk space on Fri and has been taken out of the list temporarily, while a new SSD drive is being awaited, which probably will not arrive before Jan

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • reprocessing of Proton-Ion collisions started last week at GRIDKA/CERN,
    • At other sites main activities are simulation & user jobs
    • T0:
    • T1:
      • FZK: Pilot problems, solved (GGUS:99725)
      • FZK: Issue with tape system over the week-end, now resolved, staging throughput increasing.
    • Other:
      • Problems with FTS3 transfers to CBPF which is running slc6.5. This linux version produces SSL3 handshake problems (GGUS:99398)
        • Steve: the FTS-3 nodes have almost finished getting reinstalled with SLC6.4 (sic), which we probably can live with for a few weeks
        • after the meeting FTS-3 project lead Michail Salichos clarified that both the FTS-3 client and the server depend on the "gridsite" provided by EPEL-stable; since the new version should get there soon and the few server instances can be kept on SL6.4 for now, standard updates can be done in Jan; the FTS-3 Wiki has been updated

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • network intervention ongoing
      • new switch installed, connectivity restarted
      • new spanning tree algorithm just started, being checked
    • tomorrow dCache upgrade to v2.6 for SHA-2 support
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KISTI
    • last week's network problems were due to a chain of events:
      • a logical volume for hypervisor storage accidentally got overwritten in a test
      • VMs then could not mount their storage
      • as the DNS was running in a VM, it became unavailable, which caused all kinds of services to fail
    • the DNS is now running on a physical node and the services have been recovered
  • KIT
    • last week the SE for ATLAS was upgraded and ran into file system problems:
      • 90 TB are still unavailable; the tech support is coming from the US
      • reading from tape was not possible, but should be OK again now
  • NDGF
    • short downtime Wed ~noon CET to reboot some pool nodes and update them to dCache 2.6
  • NLT1
    • tomorrow evening at-risk downtime for tape back-end; files only on tape will be unavailable for a while
  • OSG - nta
  • PIC
    • Thu Dec 19 downtime for cooling system maintenance plus various upgrades
  • RAL - ntr

  • grid services
    • CVMFS Stratum-0 and -1 have been migrated and upgraded OK
    • FTS-3 is being downgraded to SL6.4 because of the openssl issue (almost done)
  • storage
    • transparent EOS updates to improve http performance and e-groups support:
      • EOS-CMS ongoing
      • EOS-ATLAS tomorrow morning

AOB:

Thursday

Attendance:

  • local:
  • remote:

Experiments round table:

  • ALICE -
    • Thanks for your contributions to another successful year and best wishes for 2014!

  • LHCb reports (raw view) -
    • reprocessing of Proton-Ion collisions in full swing at CERN/GRIDKA
    • At other sites main activities are simulation & user jobs
    • T0:
      • Impressive staging performance (140TB/24h), stage in for reprocessing is finished
    • T1:
      • GRIDKA: problems with file access via xroot, switched back to dcap for the time being.

      *
     /.\
    /..'\      Many season's greetings
    /'.'\      to all sites and services.
   /.''.'\     Thanks for all your work
   /.'.'.\     and support during 2013. 
  /'.''.'.\    
  ^^^[_]^^^    LHCb Grid Operations Team

Sites / Services round table:

  • PIC
    • Apologies, I cannot attend to today's meeting (Pepe). Today's downtime is going pretty well. We foresee to start services even before the declared end time for today's downtime.
  • GGUS
    • Reminder: For the Year End period: GGUS is monitored by a system connected to the on-call service. In case of total GGUS unavailability the on-call engineer (OCE) at KIT will be informed and will take appropriate action. If GGUS is available but there is a problem with the workflow, e.g. ALARM to CERN doesn't generate email notification to the operators, then WLCG should submit an ALARM ticket, notifying Site DE-KIT, which triggers a phone call to the OCE. If the web portal is unavailable, contact details for KIT are recorded in the GOCDB.

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf ADC20131210.pdf r1 manage 169.2 K 2013-12-19 - 15:18 AleDiGGi CVMFS inode issue
Unknown file formatpptx MB-Dec.pptx r1 manage 2843.5 K 2013-12-16 - 10:36 PabloSaiz  

This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsMeetings > WLCGDailyMeetingsWeek131216
Topic revision: r9 - 2013-12-19 - AleDiGGi
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback