Week of 121217

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alessandro, David, Ignacio, Jan, Jerome, Maarten, Maria D);remote(Christian, Dimitri, Gonzalo, Jacob, Joel, John, Michael, Paolo, Rob, Rolf, Tiziana, Wei-Jen).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • CERN-PROD failed to contact on remote SRM. GGUS:89701. SRM-EOSATLAS was stuck, service restarted at 10:30 local time, seems to be OK now.
    • T1
      • NTR
    • Alessandro: details on ATLAS activities during the Xmas break will be provided in the coming days

  • CMS reports -
    • LHC / CMS
      • Proton physics finished this morning
    • CERN / central services and T0
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573, last updated 2012-12-14
    • Tier-1:
      • HC Job submission failures on T1_TW_ASGC computing elements, GGUS:89705
        • the error was due to a bad CVMFS symlink, fixed
      • SUM-CE problems at T1_DE_KIT, GGUS:89703
        • see KIT report
    • Tier-2:
      • NTR

  • ALICE reports -
    • KIT: network problem starting some time between 11:50 and 12:50, still ongoing. ALICE VOBOX is reachable, but intermittently it cannot resolve e.g. myproxy.cern.ch; GGUS is unreachable intermittently. KIT have declared an unscheduled downtime.

  • LHCb reports -
    • Reprocessing running smoothly. All reprocessing for 2012 submitted. LHCb will start to test the ONLINE FARM to be used during Christmas..
    • T0:
    • T1:
      • CERN : some pilot failing (still same issue which is treated in some GGUS ticket)

Sites / Services round table:

  • ASGC
    • during the weekend 1 week ago there were transfer failures for ATLAS due to a HW problem affecting a disk partition: that issue has now been fixed
    • during the Xmas break there will be scheduled downtimes for DPM, CASTOR and network maintenance, details to be decided in the coming days
  • BNL - ntr
  • CNAF
    • tomorrow at 18:00 CET there will be a network intervention of ~1h affecting the grid services
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • since ~10:00 there are network problems around the institute; most of them look fixed, but some services are still down
      • Tiziana: services in the gridka.de domain are reachable when GGUS is not
      • Dimitri: we are awaiting feedback from the network team
      • Maarten: please send a broadcast when most of the services are back, or if the downtime may last a lot longer
        • as of 21:00 CET the declared downtimes have expired and KIT looks OK at least for GGUS and ALICE
  • NDGF - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL
    • 1 ATLAS disk server currently broken, expected to be back later this afternoon

AOB:

  • Joel: since many days we have an open ticket against VOMS without answer
    • Maria: use the escalate button!

  • Jacob: are EOS upgrades still foreseen to happen this week?
    • Jan: no, we are not ready; new provisional dates have not been decided yet

Tuesday

Attendance: local(Alessandro, Eddie, Jerome, Maarten, Marcin, Maria D, Massimo);remote(Christian, Gonzalo, Ian, Jeremy, Joel, John, Lisa, Michael, Paolo, Rob, Rolf, Ronald, Wei-Jen, Xavier).

Experiments round table:

  • CMS reports -
    • LHC / CMS
      • Proton physics still finished
    • CERN / central services and T0
      • NTR
    • Tier-1:
    • Tier-2:
      • NTR

  • LHCb reports -
    • Reprocessing running smoothly. All reprocessing for 2012 submitted. LHCb will start to test the ONLINE FARM to be used during Christmas..
      • Alessandro: ATLAS would be interested in how LHCb make their online farm available for offline computing; will follow up privately
    • T0:
    • T1:
      • RAL : Outage for a short period.

Sites / Services round table:

  • ASGC
    • downtime Dec 23-25 for network intervention, DPM and CASTOR upgrades
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • KIT
    • resolution of the network incident is part of the GGUS report
    • between 9:00 and 10:00 CET 5 file servers were down due to power supply maintenance
    • the batch farm migration is proceeding OK
  • NDGF
    • Norway: BCCS site ARC upgrade downtime extended; only computing services are affected, all data should have replicas elsewhere
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL
    • there was a site-wide network outage this morning, mostly solved now; still looking into possible fall-out, e.g. CMS ticket GGUS:89792

  • dashboards - ntr
  • databases
    • yesterday and today the CMS online DataGuard instance suffered random broken connections; the error messages suggested "the database was being shut down" while it was not; after a restart of one instance the problem was gone; the DB team will try to investigate if the problem was due to a bug
  • GGUS
    • About yesterday's discussion on the VOMS-GGUS synchronisation problems: It is correct that Raja (LHCb), Alessandra (ATLAS) and a number of U.K. VO members can be found in the TEAM or ALARM groups or roles in VOMS, so there seem to be hiccups in the server's availability to list members. It is also correct that the renewal of the UK users' membership in the VO was automated, courtesy of the server manager, Steve Traylen, but not in the Groups/Roles. So, if people are not yet brought up-to-date, they should check themselves (see instructions is yesterday's entry) a.s.a.p. The VOMS-GGUS interface is being re-designed. Progress can be followed via Savannah:133063 .
      • Joel: is there a way for the GGUS portal to show the authorized teamers and alarmers? it would be quite useful
      • Alessandro, Maarten: agreed
      • MariaD: please add a request for that feature to Savannah:133063
        • Done
    • About yesterday's network problem at KIT the broadcast for the incident end at 17:20 CET is https://operations-portal.egi.eu/broadcast/archive/id/844 and its beginning was published around 10:40 CET on http://www.scc.kit.edu/dienste/meldungen.php (in english at the end). Official statement issued today:
       The network failure yesterday (Monday, 17.12.2012) was caused by two
       independent, almost simultaneously occurring faults in the network of KIT.
       Due to the interaction this resulted in a very unclear picture about the
       real reasons.  On North Campus, there was a hardware failure in one of
       the core backbone routers, the redundant hardware part rebooted completely
       unexpectedly and without any event without any configuration.
       On South Campus, there was another fault that was caused by the network
       in a building. Because of the large impact the localization of the cause
       was very difficult.  The faulty network components in the wiring closets
       have been replaced in the afternoon of Dec 17th.  Sorry for the
       inconvenience caused.
            
    • For the Year End period: GGUS is monitored by a monitoring system which is connected to the on-call service. In case of total GGUS unavailability the on-call engineer (OCE) at KIT will be informed and will take appropriate action. Apart from that WLCG should submit an alarm ticket which triggers a phone call to the OCE.
  • grid services - ntr
  • storage - ntr

AOB:

Wednesday

Attendance: local(Alessandro, David, Maarten, Marcin, Maria D, Massimo, Stefan);remote(Ian, John, Michael, Paolo, Pavel, Rob, Rolf, Ron, Ulf, Wei-Jen).

Experiments round table:

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR
    • General news: CMS will be running data reprocessing, simulation, and analysis over the holiday. We recognize that effort levels are lower for everyone during this period, but ask that issues be responded to as well as possible during the time. Our schedule for the spring depends on accomplishing some work now.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • IN2P3 - ntr
  • KIT
    • 90% of the batch system migration to UNIVA GE (Grid Engine) has been done
  • NDGF
    • 1 site will reboot their disk pools for upgrades tomorrow, should be transparent
    • some pool nodes rebooted spontaneously today; the issue has been fixed
  • NLT1 - ntr
  • OSG - ntr
  • RAL
    • in the coming days a bunch of disk servers will be rebooted for firmware upgrades; the experiments will be informed as needed

  • dashboards - ntr
  • databases - ntr
  • GGUS
    • Concerning the experiments' desire to see all TEAMers/ALARMers known to GGUS in GGUS and compare with the VOMS view, this was recorded in Savannah:133063#comment10 and discussed at the dev. meeting today. The functionality is already in place for TEAMers of one's own VO. If you are a TEAMer, from https://ggus.eu , click on "My TEAM tickets" then follow https://ggus.eu/pages/view_team_member.php The request for implementing the ALARMers' list is Savannah:134672 . When these workflows entered production in 2008/07/03 experiments were claiming there would be no more than 5 authorised ALARMers per VO.
      • Maarten: I have asked for such info to be made easier to find and visible to anyone in the VO
    • We periodically revisit Support Units that appear inactive. We have the VO CPPM defined with supporters at IN2P3 and the VO Planck with a VOMS server at CNAF. Do people know if they are active, if they have users and if they wish to remain in GGUS for support?
      • Rolf: please open tickets for the respective NGIs
  • grid services
    • 2 new EMI-2 CREAM CEs have been added: ce201 and ce202
  • storage
    • this morning during maintenance the e-group of CMS users was not visible for a while, which may have caused some glitches
    • question for CMS: what is the status of the deletion of streamer data from 2009, 2010 and possibly 2011?
      • Massimo will follow up with CMS
    • there is some spare capacity that could be added to busy CASTOR pools for the Xmas break
      • Massimo: ALICE should be OK already; will follow up with CMS
      • Stefan: LHCb will be creating MC data that will go to LHCb_DISK (and/or EOS)
      • Alessandro: CASTOR-ATLAS should not be exercised; EOS looks OK as well

AOB:

Thursday

Attendance: local(Alessandro, David, Jan, Jerome, Maarten, Marcin, Maria D, Stefan);remote(Gareth, Ian, Jeremy, Lisa, Michael, Paolo, Pavel, Rob, Rolf, Ronald, Ulf, Wei-Jen).

Experiments round table:

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Has been lots of HC failures on CERN Cream-CEs, but seems to be OK now
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • LHCb reports -
    • Reprocessing running smoothly all files submitted, shall be finished by the week-end
    • Simulation workflows on the online farm have been successfully validated, starting ramp up of production activities now
    • T0:
      • VOMS : (GGUS:89497) voms-admin command timing out, being investigated
      • LHCb pilots failing on the grid during the network glitch, b/c they tried to access an afs hosted web service
      • Redundant pilots (GGUS:87448), fixed on 3 CEs, started submitting pilots to those and no issues so far
        • Jerome: currently only the EMI-2 CREAM CEs (ce201, ce202, ce203) have a patch to prevent stuck jobs
    • T1:
      • RAL: picking up jobs again after the network problem yesterday
        • Gareth: the network problem was on Tue; not clear why LHCb jobs did not ramp up earlier, maybe other VOs had priority
      • GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425)
      • GRIDKA: new CEs are publishing 999999999 in the BDII for max CPU time (GGUS:89857)
        • Pavel: we will look into that after the meeting

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP
    • the next 2 weeks will have best effort support at the UK T2 sites
      • and elsewhere...
  • IN2P3
    • the next 2 weeks T1 services will have fairly standard support (taking into account bank holidays and weekends as usual)
  • KIT
    • the migration of the farm has finished, 17k cores are available for production
    • 8 new CREAM CEs cream-ge-{1,2,...,8}-kit.gridka.de:8443/cream-sge-core1 should be used and added to the experiment-specific monitoring
      • Alessandro: ATLAS are already testing some of them
      • Maarten: will update the ALICE configuration
  • NDGF
    • this morning the FTS suffered a 15 min unavailability due to an incident with SUNET (Swedish University Network)
  • NLT1 - ntr
  • OSG - ntr
  • RAL
    • currently doing firmware upgrades on ATLAS/LHCb/... disk servers, temporarily unavailable during the interventions

AOB:

Friday

Attendance: local(Alessandro, David, Jan, Jerome, Maarten, Marcin);remote(Gareth, Gonzalo, Jeremy, Lisa, Michael, Onno, Paolo, Rob, Thomas, Wei-Jen, Xavier).

Experiments round table:

  • ATLAS reports -
    • MC! (which is Merry Christmas but also MC production a lot during Christmas holidays, together with delay stream processing and group production)

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

Sites / Services round table:

  • ASGC
    • Reminder: Dec 23-25 downtime for network, CASTOR and DPM maintenance
  • BNL - ntr
  • CNAF
    • Due to an upgrade of the Tier1 core switch and of the largest part of our storage systems, CNAF will be down for 3 days from Monday 7th 9.00 AM CET (all srm end-points will be closed) to Thursday 10 morning at 10.00 CET. To allow a smooth completion of jobs, the queues on CNAF farm will be closed on January 4 2013 evening, excepting LHC ones (as requested by local support staff). On Tuesday morning the core switch will be upgraded starting from 10.00 AM CET: on that day CNAF could be not accessible from WAN and the LAN connectivity could be interrupted. The upgrade operations of the storage will be completed in 3 days and consequently the farm, the queues and the srm end-points will be reopened on Thursday 10 morning at 10.00 CET.
  • FNAL - ntr
  • GridPP - ntr
  • KIT
    • 1 broken tape caused 52 LHCb files to be lost, the list has been sent to LHCb
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards
    • ATLAS CREAM CE job submission tests for CNAF timed out yesterday, OK now
    • on Jan 8 at 10:00 CET the ATLAS DDM dashboard will be upgraded, should take ~30 min
  • databases - ntr
  • grid services - ntr
  • storage - ntr

AOB:

  • THANKS for your contributions to making 2012 a very successful year for WLCG!
    • Further challenges and opportunities await us in 2013... smile
  • Next meeting: Mon Jan 7

Season's Greetings!

-- JamieShiers - 18-Sep-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r2 r1 manage 2564.5 K 2012-12-17 - 14:34 MariaDimou Final ALARM drills for the 2012/12/18 WLCG MB
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2012-12-21 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback