Week of 120917

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Massimo, Peter, Yuri, Kath, Alexandre, Maarten, Ivan, Xavier, Maria, Ian);remote(Joel, JhenWei, Michael, Saverio, Lisa, Pavel, Rolf, Onno, Tiju).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • TRIUMF file transfer failures: failed to contact on remote SRM. GGUS:86120 submitted at ~7am on Saturday. Network issues: broken link TRIUMF-BNL, didn't reroute through other path automatically for some reason.
      • BNL -> CA file transfer failures: failed to contact on remote SRM. GGUS:86121 resolved on Saturday: issue with the direct BNL-TRIUMF circuit is under investigation by the network service providers. BNL team managed to reroute the traffic to CA via CERN temporarily.
      • RAL file transfer failures in UK sites: the proxy at lcgfta02.gridpp.rl.ac.uk expired. GGUS:86124 filed at ~1pm (Sat.), solved on Sunday. Savannah:97509, the problem seems to disappear at ~10pm. No more errors of this type.
      • TAIWAN-LCG2 ~1K transfer failures: gridftp errors and failed to contact on remote SRM srm2.grid.sinica.edu.tw. GGUS:86127 filed at ~4am on Sunday, solved at ~6pm. Backup process caused the high load on CASTOR DB, recovered.
  • CMS reports -
    • LHC / CMS
      • Technical Stop
    • CERN / central services and T0
      • We are rotating the Tier-0 DB during technical stop, probably Tuesday
      • We reported a team ticket last night for 2 inaccessible files holding up the Tier-0 Prompt Calibration loop. This was promptly addressed.
    • Tier-1/2:

  • ALICE reports -
    • This morning the AliEn DB servers had their MySQL version upgraded and other central services also underwent maintenance operations during the 3.5 h downtime.

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Reprocessing at T1s and selected T2s : started
    • New GGUS (or RT) tickets
    • T0:
    • T1 :
      • IN2P3 : downtime tomorrow
      • PIC : downtime tomorrow
Sites / Services round table:
  • ASGC: recovering from a power cut induced by an animal entering the main power distribution
  • BNL: Network problem being analysed (involving ESnet and the Canadian network provider. One of the reason of the problem was that no automatic fallback was triggered since the link was reporting to be up although no data could flow. Michael pointed out that in similar cases a ticket should be attributed to both ends for better troubleshooting (e.g. 2 GGUS tickets with cross-reference).
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: Downtime announced for LHCb. dCache intervention will be 30' but recall from tape might be longer (few hours)
  • KIT: On Wednesday we have the team event which could result in the response delays during the day.
  • NLT1: ntr
  • RAL: ntr

  • CASTOR/EOS: EOSCMS upgraded to 0.2
  • Central Services: ce203.cern.ch will be drained as of Wednesday for re-installation and software upgrade. Other CEs will follow in the coming weeks.
  • Data bases: CASTOR NS DB intervention tomorrow (10:00-13:00). Campaign of rolling (transaprent) intervention is continuing
  • Dashboard: ntr

  • GGUS: (MariaDZ) File ggus-tickets.xls is up-to-date and attached from twiki page WLCGOperationsMeetings. Final slides for tomorrow's MB are attached to this page.
AOB:
  • Phone conf not working correctly (can't read the names of participants from the usual window)

Tuesday

Attendance: local(Jamie, Massimo, Xavier, Peter, Ivan, Maria, Kate, Julia, Alessandro, Lothar, Maarten, MariaDZ, Alex);remote(Ulf, Michael, Xavier, Rolf, Lisa, Joel, Kyle, Jhen-Wei, Ronald, Tiju).

Experiments round table:


  • CMS reports -
  • LHC / CMS
    • Technical Stop
  • CERN / central services and T0
    • We are rotating the Tier-0 DB during technical stop, so downtime tomorrow, Tuesday for 2 hours
    • We reported a team ticket last night for 2 inaccessible files holding up the Tier-0 Prompt Calibration loop. This was promptly addressed.
    • However, those errors are getting too frequent, and the proposed "resolution" (scheduled downtime Sep 17) is not consistent with the time the problem occurred (Sep 16)
  • Tier-1/2:
    • NTR
Lothar Bauerdick is CRC


  • ALICE reports -
    • CERN: some EOS-ALICE disk servers are not externally accessible, which causes jobs to be inefficient (failing over on timeout) or even fail altogether. GGUS:86186.

  • LHCb reports -
  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
  • Simulation at T2s
  • Reprocessing at T1s and selected T2s : started

  • T0:

  • T1 :
    • IN2P3 : downtime tomorrow
    • CERN : castor intervention
    • PIC : downtime tomorrow
    • KIT: GGUS ticket overloading test system with a lot of requests; will decrease load. Will add ticket # to report.

Sites / Services round table:

  • NDGF - ntr
  • BNL - ntr
  • KIT - nta
  • IN2P3 - still in downtime; within plan; batch back at full capacity tomorrow during day instead of Thursday morning. Joel- will storage be available tonight? A: should be ok now; very small intervention on dCache.
  • FNAL - ntr
  • ASGC - ntr
  • NL-T1 - ntr
  • CNAF - ntr
  • RAL - next Tuesday downtime to upgrade CASTOR for ATLAS
  • OSG - ntr

  • KIT: (from yesterday) On Wednesday we have the team event which could result in the response delays during the day.

  • CERN cvmfs-stratum-one.cern.ch There will be a disk server change for the CVMFS stratum one at CERN at 07:00 UTC on Wednesday. This should be transparent to all.

  • CERN DB - intervention of DB for CASTOR. Migrated to latest DB and CPU. Intervention on ATLAS DB and downstream capture. Future interventions for CMS confirmed: online Thu, offline Wed. Tomorrow rolling intervention on COMPASS DB and NAS storage for all CASTOR DBs. Possible downtime of up to 30'

  • CERN Storage = yesterday we had incident on ATLAS; lot of WN from Uni Freiburg doing "denial of service". Ale - these are user jobs running on local WNs, i.e. not through Panda. Don't yet have a complete understanding. These jobs were failing as they don't have authentification. Will look how to shield nameserver. Xavi - want to update EOSALICE to newest version; confirmed by online waiting for offline OK. ~1h downtime.

  • CERN dashboards - presentation of new version of WLCG transfer dashboard; got a lot of useful feedback from people running xrootd federations and hopefully have a new version covering xrootd and FTS transfers within one month

AOB:

  • WLCG operations coordination: next Monday 24 September kick-off meeting. 1/2 meeting from 14:00 - 17:00 with pause at 15:00 for this meeting. Strongly encourage sites - Tier1 responsibles plus all sites who want to be in this team to join. Will be transmitted by Vidyo.

Wednesday

Attendance: local(Peter. Xavier, Maarten, Jamie, Lothar, Kate. Ivan, Alex);remote(Saverio, Michael, Joel, Lisa, Rolf, Tiju, Jhen-Wei, Pavel).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • FZK-LCG2_MCTAPE errors staging from tape. Some concern about this although dataset is not needed urgently... for now. GGUS:85955
      • PIC_DATADISK getting full. This is ATLAS data management problem being investigated. Thanks to PIC ops for adding more space. Savannah:132089


  • CMS reports -
    • LHC / CMS
      • Technical Stop
    • CERN / central services and T0
      • Tier-0 intervention at 2pm, cleaning up/reinstalling databases
      • EOS issues: does anyone else see such problems? Follow-up on existing tickets?
    • Tier-1/2:
      • NTR
Lothar Bauerdick is CRC

  • ALICE reports -
    • CERN: some EOS-ALICE disk servers were still found inaccessible externally, which causes job failures. GGUS:86186.
    • CERN: the high-rate conditions data replication from old disk servers into the EOS-ALICE OCDB name space was causing EOS deadlocks; the rate should be significantly lower now since ~14:45.
    • CNAF: yesterday's WN overload was due to jobs that after having downloaded their SW instantly failed trying to access data on EOS disk servers that were not reachable from outside CERN. Some hot data was replicated elsewhere and hopefully some of the affected disk servers became accessible in the meantime. Has the situation at CNAF improved?

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Reprocessing at T1s and selected T2s : 12k jobs running
    • T0:
    • T1 :
      • IN2P3 : downtime finished!
Sites / Services round table:

  • CNAF - ntr
  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - downtime finished
  • RAL - ntr
  • ASGC - ntr
  • KIT - no news on staging from tape; update tomorrow
  • NDGF - ntr

  • OSG - We have made some changes in the RSV uploading logic. We have contacted the team with details and look forward to confirmation everything is working correctly.

  • CERN DB - today CMS offline and COMPASS patching; during upgrade of NAS storage problem of access to storage; tomorrow rolling intervention on CMS online to apply latest security

  • CERN storage - ALICE EOS update this morning; will follow up on other issues
AOB:

Thursday

Attendance: local(Jamie, Maarten, Peter, Xavier, Lothar, Przymek, Ivan, Alex);remote(Joel, Rolf, John, Jhen-Wei, Gonzalo, Kyle, Finland, WooJin, Salvatore, Lisa, Paco).

Experiments round table:

  • CMS reports -
  • LHC / CMS
    • Technical Stop
  • CERN / central services and T0
    • Tier-0 intervention successfully performed, thanks to Kate for swiftly helping with a database backup
  • Tier-1/2:
    • some data transfer problems at sites, asking then to restart PhEDEx agents

  • ALICE reports -
    • CERN: EOS-ALICE disk servers should all be externally accessible now, thanks! Let's see... (GGUS:86186)

  • LHCb reports -
  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
  • Simulation at T2s
  • Reprocessing at T1s and selected T2s : 12k jobs running
  • Dashboard availability : We see NIKHEF and PIC unavailable in the dashboard as of this morning. This is most probably not a real problem but jobs are timing out. The reason is that we are submitting with role lhcb_prod and the jobs are dying in the queues b/c we run heavily reprocessing at the moment and sites are fully loaded.

  • T0:

  • T1 :
    • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework..

Sites / Services round table:

  • IN2P3 - ntr
  • RAL - ntr
  • ASGC - ntr
  • PIC - yesterday evening we had problem of ATLAS reporting FTS failures due to proxy expired. Added ref to open ticket; patch being brought out in near future
  • NDGF - ntr
  • KIT - status of error on tape staging but still no progress / news
  • CNAF - ntr; quick feedback on ALICE issue of yesterday - intermittent failure now looks ok
  • FNAL - ntr
  • NL-T1 - ntr
  • OSG - ntr

  • CERN storage - ticket from LHCb - seems to be CASTOR file inaccessible; will follow-up after meeting. ALICE: reboot of nameserver, should be recovering

  • CERN DB - CMS online DB being patched now; Monday 10:00 intervention on CMS stager together with CASTOR; Tuesday 10:00 migration of ATLAS stager together with CASTOR; ALICE also on Tuesday - all in SSB

AOB:

Friday

Attendance: local(Jamie, Peter, Xavier, Lothar, Giuseppe, Maarten, Ivan, Kate, Alex);remote(Joel, Xavier, Salvatore, John, Gonzalo, Onno, Lisa. Rolf, Christian. Jhen-Wei).

Experiments round table:


  • CMS reports -
  • LHC / CMS
    • CMS preparing for end of Technical Stop
  • CERN / central services and T0
    • CASTOR problems this morning disrupted Tier-0 running -- do we know what happened?
    • just opening a ticket about tens of files given errors when rfcp'ing: "File has no copy on tape and no diskcopies are accessible" Kate- there was a problem with CMS standby ; ran out of space and blocked also production DB and also LHCb stager ~06:00 this morning. LatB: service degradation went on until ~09:00. Xavier - degradation in SLS from 06:00 - 07:30. Kate - changed some things on standby config so should not reoccur. Giuseppe: files successfully written but unsuccessfully overwritten. More news in ticket - we saw it just 5' ago!
  • Tier-1/2:


  • LHCb reports -
  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
  • Simulation at T2s
  • Reprocessing at T1s and selected T2s : 12k jobs running

  • T0:
    • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 )
    • Xavier - on file lost investigating

  • T1 :
    • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework..

Sites / Services round table:

  • KIT - ntr
  • CNAF - ntr
  • RAL - last night problems with ATLAS CASTOR instance for ~4h ; due to an orphaned subrequest.
  • PIC - ntr
  • NL-T1 - ntr
  • FNAL - ntr
  • NDGF - ntr
  • ASGC - ntr
  • IN2P3 - ntr
  • OSG - ntr (by e-mail)

  • CERN DB - CMS online and active DG latest patches. Other DBs will be postponed by 2 weeks because of LHC schedule change. Removed from SSB but also need to do in GOCDB. Peter - impact? Kate : DB has storage migration so no possibility to do it online

  • CERN Storage: nta

AOB:

  • Alcatel problems: see below.
Dear user,

With the latest releases of web browsers (IE, FireFox, Chrome) available at the beginning of September, MyTeamWork users who upgraded their browsers are experiencing difficulties in accessing and/or running their audio-conferences on the server. 

Since then, experts have been actively working on finding workarounds detailed below. Please note that these are only workarounds for which we cannot guarantee that all features are working in all conditions. If ever you are still experiencing difficulties after implementing them, we invite you to report to the Service Desk, so that we can further investigate alternatives.

Link to workarounds: http://cern.ch/audioconferencing/WebBrowserCompatibility.htm 

The software editor is also informed of the situation, and the next release of MyTeamWork, announced for November 2012, should correct all bugs related to the new web browsers. 

We are sincerely sorry about the inconvenience caused,
Best regards,
IT/CS

-- JamieShiers - 02-Jul-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r2 r1 manage 2382.0 K 2012-09-17 - 11:44 MariaDimou Final slides with GGUS ALARM drills for the 2012/09/18 WLCG MB
Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2012-09-21 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback