Week of 120618

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local (Andrea, Doug, Ian, David, Maarten, Vladimir, Luca, Eva); remote (Michael/BNL, Ulf/NDGF, Lisa/FNAL, Rolf/IN2P3, Jhen-Wei/ASGC, Tiju/RAL, Onno/NLT1, Kyle/OSG, Dimitri/KIT).

Experiments round table:

  • ATLAS reports -
    • T1
      • IN2P3 FTS channels got stuck. GGUS:83320 solved: some channel agents did not recover the Oracle connexion after the logrotate at 4:00 AM due to a problem with Oracle virtual IPs. Solved by defining a new connection string which does not use Oracle virtual IPs.
      • TRIUMF 1745 files lost.Files declared to the consistency service. Savannah:95440. Ticket will be updated when the exact number of lost files is confirmed.
      • [Doug: also had a hickup in T0 processing two nights ago due to a full disk, not properly monitored hence not noticed by the shifters. David: what monitoring was this? Doug: the issue was in Firefox for the T0 console monitoring, we are now trying to improve our monitoring]

  • CMS reports -
    • LHC machine / CMS detector
      • Good data taking
    • CERN / central services and T0
      • Problems with the few Express stream files. Software experts are looking
    • Tier-1/2:
      • Problems with FNAL over the weekend. Network issues and problems on the submission services. It seems to be recovered now
      • Migration issues at ASGC. Local admins are working

  • ALICE reports -
    • Over the weekend, a large number of jobs at CERN failed due to insufficient scratch space: GGUS:83345

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • T0:
      • CERN (GGUS:83351) batch system: We have peak of submitted jobs every night.
      • [Vladimir: also had pilots aborted with reason 999 during the last two days, no ticket as the issue is now fixed. Maarten: must have been an LSF issue.]
    • T1: ntr

Sites / Services round table:

  • Michael/BNL: ntr
  • Ulf/NDGF:
    • ATLAS ticket about files not returned from tape is being investigated, may be related to dcache rather than tapes
    • tomorrow electrical maintenance in Slovenia, some ATLAS files will be unavailable
  • Lisa/FNAL: ntr
  • Rolf/IN2P3: ntr
  • Jhen-Wei/ASGC: ntr
  • Tiju/RAL: work on site network tomorrow morning 8am to 11am
  • Onno/NLT1: this morning SARA downtime, completed at 2pm: dcache was upgraded to 2.2 and the tape library was fixed for cartridge insertion issues
  • Kyle/OSG: ntr
  • Dimitri/KIT: ntr

  • Luca/Storage: ntr
  • David/Dashboard: ntr
  • Eva/Databases: ntr

AOB: (MariaDZ) https://twiki.cern.ch/twiki/pub/LCG/WLCGOperationsMeetings/ggus-tickets.xls is up-to-date and attached to twiki WLCGOperationsMeetings. Complete ALARM drills are attached at the end of this page. There were 7 real ALARMS since the last MB, all from ATLAS, all for CERN, mostly storage and LSF issues.

Tuesday

Attendance: local(David, Eva, Ignacio, Luca M, Maarten, Oliver, Yuri);remote(Gareth, Gonzalo, Jeremy, Jhen-Wei, Lisa, Lorenzo, Michael, Rob, Rolf, Ulf, Vladimir, Xavier M).

Experiments round table:

  • ATLAS reports -
    • T0
      • T0,CERN-PROD ~5000 transfer failures:SRMV2STAGER:SRM_FAILURE. GGUS:83361 solved: see GGUS:83360.
      • T0 problems in writing/retrieving data to/from t0merge and t0atlas pools. ALARM GGUS:83360 solved:configuration issue all the diskservers were unreachable,fixed.
      • T0 delay with finishing 6000 running jobs, many pending jobs since 5pm, June18. ALARM GGUS:83362 solved: filesystems unavailable issue fixed in <1h, June 18.
        • Luca: all those problems were due to the CASTOR disk servers not being reachable; it took ~45 minutes before almost all of them had recovered; 5 remained in a funny state, fixed ~20:00 yesterday evening
      • T0: LSF very slow response time to bsub (>3-5min.) affects event reco distribution to T1. ALARM GGUS:83375 assigned at ~7:40am, June 19. Looks better after 8am.
        • Ignacio: this time no culprit was identified yet; snapshots and logs have been sent to Platform who are trying to reproduce the problem in their labs; the problem disappeared by itself, then it got a bit worse again later around noon
    • T1
      • NDGF-T1 files can't be pinned from tape issue. GGUS:83349 solved: HSM script failed points to tape problems, files restored, transfers succeeded.
      • FZK many transfer failures due to the log FTS partition full. ALARM GGUS:83367 solved: cleaned up this morning.

  • CMS reports -
    • LHC machine / CMS detector
      • Machine development
      • Preparing to move Tier-0 to different machine with more disk space, testing today
    • CERN / central services and T0
      • It was discovered by the CERN Security Team that a security incident happened on the CMS HyperNews system, with a security hole exploited and resulting in a bunch of passwords being exposed online. All measures have been taken within few hours, including informing users and blocking unsafe access to resources: operations are not compromised, and post-mortem is in progress
    • Tier-1/2:
      • KIT had tape issues yesterday which caused very low CPU efficiencies for running jobs (no writing and almost no reading from tape). This is fixed now. But now running almost no jobs because of the fairshare of CMS.
        • Xavier: the problem is back, it started failing during the night; currently no one can write to tape, because the failing library is the only one with free space! the other libraries are available for reading only; we will post an entry in the GOCDB

  • ALICE reports -
    • Some EOS-ALICE disk servers cannot be reached from outside CERN, leading to job failures and/or inefficiencies; being worked on.
      • Luca: it has been fixed just now
    • IN2P3: bad job efficiency being investigated, looks due to issues with accessing local storage.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3
    • will look into issues reported by LHCb and ALICE
  • KIT - nta
  • NDGF - ntr
  • OSG
    • one week from today (i.e. June 26) OSG central services will be patched during the maintenance window that day
  • PIC - ntr
  • RAL
    • today's planned network outage went OK, the access routers were updated

  • dashboards - ntr
  • databases - ntr
  • grid services - nta
  • storage - nta

AOB:

Wednesday

Attendance: local (Andrea, Yuri, Oliver, David, Luca, MariaDZ, Ignacio, Eva); remote (Michael/BNL, Ulf/NDGF, Lisa/FNAL, Pavel/KIT, Jhen-Wei/ASGC, Ron/NLT1, Gonzalo/PIC, Tiju/RAL, Rolf/IN2P3, Rob/OSG; Vladimir/LHCb).

Experiments round table:

  • ATLAS reports -
    • CENTRAL SERVICES
      • GGUS:82907 updated: still not possible to specify VO when sending a GGUS team ticket from comp@P1. In other words, on the right of the field VO there is no option neither a box to fill. [!MariaDZ: GGUS developer discovered what happens. It is OK if the certificate is loaded in the browser, VOMS knows it's ATLAS. It is not OK if the username and password are used because the certificate is not seen and the user is not associated to ATLAS. Will be fixed on Monday, ATLAS please test it on Monday.]
    • T0
      • NTR
    • T1
      • NTR
    • T2 + OTHERS

  • CMS reports -
    • LHC machine / CMS detector
      • Machine development
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • NTR

  • ALICE reports -
    • ALICE VOMRS service is failing, GGUS:83432 [Ignacio: fixed now, it was related to the tns database upgrade on Friday, the low level address was used instead of the tns alias and the port was changed. Eva: it is always better to use the tns alias if possible.]

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • T0:
      • CERN (GGUS:83351) batch system: We have peak of submitted jobs every night.
    • T1:
      • FZK-LCG2: (GGUS:83425) Jobs failed with "dcap: Last IO operation timeout."

Sites / Services round table:

  • Michael/BNL: ntr
  • Ulf/NDGF: ntr
  • Lisa/FNAL: ntr
  • Pavel/KIT: tape system is now fully operational
  • Jhen-Wei/ASGC: ntr
  • Ron/NLT1: had to reboot SRM to fix a dcache issue and a storage issue, now moving to a new kernel and a new driver
  • Gonzalo/PIC: announcement of a major full-day intervention on the core router on July 4th (urgent but presently waiting before of ICHEP pressure)
  • Tiju/RAL: investigating a problem with network traffic into RAL
  • Rolf/IN2P3: ntr
  • Rob/OSG: ntr

  • David/Dashboard: ntr
  • Eva/Databases: LHCb online database is being patched with security updates
  • Luca/Storage: ntr
  • Ignacio/Grid: still working with the platforms group on the problem with latency and submissions, looking into both network and storage

  • CERN VOMRS - Registration processing including renewals is currently impossible for LHC VOs since one or two days. The situation will be corrected today or tomorrow at latest.

AOB: none

Thursday

Attendance: local (Andrea, Yuri, Stephen Marcin, Mike, Luca, Ignacio); remote (Gonzalo/PIC, Ulf/NDGF, Lisa/FNAL, John/RAL, Jhen-Wei/ASGC, Ronald/NLT1, Rolf/IN2P3, Rob/OSG; Vladimir/LHCb).

Experiments round table:

  • ATLAS reports -
    • CENTRAL SERVICES
      • NTR
    • T0
      • NTR
    • T1
      • SARA: file transfer failures to various sites in CA with "failed to contact on remote SRM" . GGUS:82490. Probably caused by a wrong kernel level on the SRM. Booted the SRM with a new kernel on June 20 (afternoon).
    • T2
      • GOEGRID->FZK transfer failures. Source error: failed to contact on remote SRM. GGUS:83444 solved (June 21, 8:17). A pool node got stuck and needed to be rebooted.

  • CMS reports -
    • LHC machine / CMS detector
      • Machine development
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • T0:
      • CERN (GGUS:83351) batch system: We have peak of submitted jobs every night.
    • T1:
      • FZK-LCG2: (GGUS:83425) Jobs failed with "dcap: Last IO operation timeout."
      • FZK-LCG2: (GGUS:83456) LHCb VO-box at GridKa is down; Fixed

Sites / Services round table:

  • Gonzalo/PIC: ntr
  • Ulf/NDGF: ntr
  • Lisa/FNAL: ntr
  • John/RAL:
    • network issue mentioned yesterday has been understood and fixed this morning at 10am
    • next Wednesday will upgrade the database behind Castor, will be in GOCDB
  • Jhen-Wei/ASGC: ntr
  • Ronald/NLT1: ntr
  • Rolf/IN2P3: ntr
  • Rob/OSG: ntr

  • Mike/Dashboard: ntr
  • Luca/Storage: ntr
  • Ignacio/Grid: ntr
  • Marcin/Database:
    • yesterday patched LHCb online db
    • tomorrow will patch CMS online db and CMS active data guard

AOB: none

Friday

Attendance: local (Andrea, Yuri, Maarten, Luca, Marcin, Mike, Ignacio); remote (Alexander/NLT1, Lorenzo/CNAF, Xavier/KIT, Lisa/FNAL, Gareth/RAL, Ulf/NDGF, Jhen-Wei/ASGC, Jeremy/GridPP, Rolf/IN2P3; Vladimir/LHCb).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • NTR
    • T1
      • GridKa informed on some issues with the disk stack at ~5pm June21 resulting offlining the 13 dCache disk-only pools. No DT announcement in GOCDB. Didn't affect ATLAS transfer and production. [Xavier/KIT: issues solved this morning, pools back online this morning at 9am.]
    • T2 + OTHERS
      • Running jobs failures after applying the SL5 python security update/patch (python-2.4.3-46.el5_8.2) https://cern.service-now.com/service-portal/view-request.do?n=RQF0111006 Discussed on Wed. June 20. Now this update/patch is in both FNAL and CERN repo. ATLAS prepared the special pilot patch to fix this issue, but it will be implemented only on Monday in order to complete the urgent tasks. We'd like to recommend the sites to postpone the SL5/python update till Monday as well if possible. [Maarten: Rod Walker sent an EGI broadcast about this. Maarten also asked OSG to do a similar broadcast. Rob/OSG: thanks will follow up.]

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • T0:
      • CERN (GGUS:83351) batch system: We have peak of submitted jobs every night.
    • T1:
      • FZK-LCG2: (GGUS:83425) Jobs failed with "dcap: Last IO operation timeout." [Xavier/KIT: problem seems a bit random, we can detect that this only happens on a set of nodes but this still needs some investigation. In any case this is probably not specific to Gridka, there is a lot of discussion about these issues on the dcache lists.]
      • PIC: (GGUS:83469) Pilots aborted; Fixed

Sites / Services round table:

  • Alexander/NLT1: ntr
  • Lorenzo/CNAF: ntr
  • Xavier/KIT: nta
  • Lisa/FNAL: ntr
  • Gareth/RAL: work planned for new Wed, declared in GOCDB
  • Ulf/NDGF: ntr
  • Jhen-Wei/ASGC: tomorrow scheduled intervention for network, will rely on backup
  • Jeremy/GridPP: ntr
  • Rolf/IN2P3: ntr
  • Rob/OSG: issues in communication between Indian and Brokhaven for BDII publishing, seems to be an issue with network in Chicago area, still being investigated

  • Mike/Dashboard: ntr
  • Luca/Storage: ntr
  • Ignacio/Grid: ntr
  • Marcin:
    • CMS patching today went ok
    • next Monday afternoon will patch ALICE database
    • next Monday will also do two transparent interventions on storage infrastructure: in the morning for CMS at the pit, in the afternoon for CMS online, LCGR, CMSR and COMPASS

AOB: none

-- JamieShiers - 22-May-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2546.5 K 2012-06-18 - 11:54 MariaDimou Complete GGUS ALARM drills for the 2012/06/19 WLCG MB.
Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2012-06-22 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback