Week of 140317

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Alessandro, Felix, Luca M, Maarten, Przemek
  • remote: Dimitri, Lisa, Lucia, Onno, Pepe, Rob, Rolf, Sang-Un, Stefano, Tiju, Vladimir

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • ss-cnaf-asgc(atlas-SS04), ss-ral(atlas-SS06) and ss-sara-ru(atlas-SS07): services went down repeatedly since 12:00 UTC on Sunday. Followed up by ADC DDM ops
    • T1
      • RRC-KI-T1: Transfer failures due to a routing problem of WAN. GGUS:101911. Fixed.
      • INFN-T1_DATATAPE: transfer failures due to permission denied. The site fixed relevant ACLs. GGUS:101528 closed.
      • TAIWAN-LCG2_MCTAPE: Transfers from UK are failing GGUS:102275. Cleaning up corrupted files. Now transfers look OK.

  • CMS reports (raw view) -
    • Continues to be quiet.
    • Running production and analysis full throttle and working on increasing site utilization
    • open tickets with T1's:
      • GGUS:101912 slow transfers to IN2P3 - Storage Consistency Check campaign at CC-IN2P3 completed, good collaboration site-CMSops. Expect to be resolved
      • GGUS:102264 SAM error at CNAF, can be closed

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Stripping, MCsimulation and User jobs.
    • T0: NTR
    • T1: NTR

Sites / Services round table:

  • ASGC
    • ATLAS transfer failures suspected to be due to network issue; experts looking into it
      • Alessandro: please check if your PerfSONAR installation is useful in debugging this matter and let us know
  • CNAF
    • CMS problem due to SAN issue; as it comes and goes, the ticket has not been closed yet
  • FNAL - ntr
  • IN2P3
    • reminder of downtime tomorrow, affecting batch, dCache export and MSS all day
  • KISTI
    • downtime March 19-21 for EMI-3 migration, ALICE queue being drained
  • KIT
    • network restart Wed March 19 07:00-08:00 UTC: outage for LHCb, at risk for others
  • NLT1
    • tomorrow afternoon a dCache pool node memory module will be swapped; warning declared in GOCDB
  • OSG - ntr
  • PIC - ntr
  • RAL
    • downtime to 5 pm today for firewall migration; looking good so far

  • CERN storage
    • next Mon at 09:00 CET EOSCMS upgrade, 5-10 min downtime
    • Tue next week LHCb CASTOR + DB upgrade
  • databases
    • tomorrow 09:30-11:30 CET LCGR DB downtime for upgrade, affecting various grid services

AOB:

Thursday

Attendance:

  • local: Felix, Luca M, Maarten, Maria A, Pablo, Przemek
  • remote: Alessandro, Daniele, Dennis, John, Michael, Pepe, Rob, Saerda, Sang-Un, Sonia, Vladimir

Experiments round table:

  • ATLAS reports (raw view) -
    • T1
      • TAIWAN-LCG2 problems in writing into TAPE, GGUS:102275
    • Alessandro: we are investigating slow transfer rates between several sources and destinations
      • the issues were first seen with FTS-3 transfers and discussed between FTS-3 experts, but were then found to be present also in FTS-2 transfers
      • the first case to be noticed was for very recent transfers between Cambridge and BNL, but bad performance between those sites already occurred in Dec
      • in some cases lcg-cp was found to work better with fewer streams
      • investigations are continuing and will be summarized in a report

  • CMS reports (raw view) -
    • In general, quiet
    • Production activities
      • HeavyIon rereco pass lunched and running, smooth so far. Quite mem (3-4 GB) and CPU (96h jobs) demanding. Also using HLT-Cloud
      • Phase II upgrade MC, soon also 13 TeV MC digi/reco
    • Tickets on T1s
    • Other tickets (newer on top)
    • Miscellanea
      • next week is CMS Spring Offline & Computing week at CERN
        • Daniele may be traveling to CERN during the 3pm call on Monday, will post report anyway

  • ALICE -
    • CERN: EOS shortly unavailable this morning
      • Luca: the wrong enclosure was switched off due to human error; also CMS were affected, but for them the service appears to have been available read-only

Sites / Services round table:

  • ASGC
    • ATLAS ticket: transfer timeouts between UK and TW were investigated involving PerfSONAR, which showed a bandwidth of 8 MB/s should be reachable, while only 100 kB/s was observed for the failed transfers; after tuning the disk servers the transfer bandwidth rose to 7 MB/s and the problem looks solved
  • BNL - ntr
  • CNAF - ntr
  • KISTI
    • downtime extended until March 22 because of errors in VOBOX and DPM configuration: the grid-mapfile cannot be made due to SSL negotiation failures for voms.cern.ch
      • Maarten is looking into it
  • NDGF - ntr
  • NLT1 - ntr
  • OSG
    • Rob: to what extent are the new VOMS servers already available?
    • Maarten: they can already be used for making grid-mapfiles, whereas we have put in place firewall rules preventing clients from getting proxies, until we are sure WLCG is ready for them; this means OSG can and should add the new servers to configuration files already now
    • Rob: OK, should be part of the next OSG SW release foreseen for Apr 8
  • PIC
    • multicore queues to be opened on Apr 1
    • joining Xrootd federations for ATLAS and CMS
  • RAL
    • CMS SRM test failures have been understood by the CASTOR developers

  • CERN grid services
    • SAM WMS have some issue, they cannot cope with the load since they were updated yesterday (GGUS:102492)
      • things look better after rolling back the ICE rpm
  • CERN storage
    • reminder of CASTOR + DB upgrade downtime for LHCb next Tue
  • databases
    • LCGR DB upgrade on Tue went OK
    • next Mon 09:30-12:30 CET LHCb offline DB upgrade to 12c
  • GGUS
    • Next release on the 26th of March. The release will bring the possibility to notify multiple sites with one ticket, Shibboleth support, and implement several CMS specific requests. The service will be in downtime from 7:00 to 10:00

AOB:

-- SimoneCampana - 20 Feb 2014

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Mar-v2.pptx r2 r1 manage 2850.9 K 2014-03-17 - 18:35 MaartenLitmaath GGUS activity summary for WLCG service report
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2014-03-20 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback