Week of 110815

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local (AndreaV, Elena, Cedric, Ricardo, Daniele, Raja, Eva, Nilo, Massimo, Jamie, Mike, Lola); remote (Michael, Ron, Giovanni, Catalin, Tiju, Jhen-Wei, Dimitri, Christian, Rob).

Experiments round table:

  • ATLAS reports -
    • Physics
      • Reprocessing started on Friday
    • T0/Central Services
      • CERN-PROD_DATADISK: problem with transfers to IN2P3-CC_PHYS-SM (GGUS:73470)
    • T1 site
      • INFN-T1: srm down on Saturday afternoon and early morning on Sunday (GGUS:73452, it was escalated to ALARM, GGUS:73456). Upgrade to ALARM didn't work again. Instabilities in storage should be fixed with new release of STORM (expected at the end of this month). [Giovanni: I think the ALARM was received properly, but will check. Andrea: why does ATLAS think this was not received properly? Elena: it took 4h before we got a reply. Giovanni: will check if the 4h response time was due to problems receiving the alarm or to other issues.]
      • NIKEF-ELPROD: CVMFS problem on WNs (GGUS:73451). Fixed in 2 h. Thanks. Problem with storage system (GGUS:7346).
      • IN2P3-CC: 50% of reprocessing jobs were failing, problem with SW release at specific WNs (GGUS:73461).
      • BNL: Reduced efficiency for transfers from CERN-PROD-T0 was observed on Friday night (GGUS:73446 was wrongly assigned to CERN, sorry; GGUS:73446). It was caused by high load of PNFS server. Solved. Thanks. Reprocessing jobs were failing due to the pilot problem (GGUS:73467). [Michael: we noticed issues even before shifters reported them. There is a suspicion that Panda resubmits jobs more than once and that log files cannot be written out because the LFN is always the same. In summary, suspect that the problem is in pilot submission or Panda, rather than in the storage element. Elena: strange, because the problem suddenly appeared yesterday afternoon, while before all was working fine. Michael: will follow up offline and report tomorrow, Paul Nillson is looking at this with the local experts.]
      • RAL-LCG2: FT tests and 20% transfers from RAL were failing (GGUS:73454). Problem with one disk server. Solved. Thanks.
      • TW-LCG2: problems in international network. Routing had been switch to backup, but insufficient bandwidth made transfers timeout and fail (GGUS:73464). Immediate reply. Thanks.
    • T2 sites
      • Nothing to report

  • CMS reports -
    • LHC / CMS detector
      • Cryo issues (Sat). Cryo compressor restarted (Sat afternoon). Trip of the cold compressor (Sat night). Change of the control electronics of the cold compressor (Sun morning). Electronics for cryogenics replaced (Sun). Cryo condition expected to be recovered around 14:00 on Tuesday. Beams not expected before Tuesday.
    • CERN / central services
      • issues with AFS and CVS? ALARM GGUS:73474
        • after the alarm the IT-SSB was updated, and within 30 mins from the ticket most volumes were reported to be back. Very prompt response, thanks. [Andrea: was this AFS issue only seen by CMS? Raja: also LHCb saw it. Massimo: this was a problem with one specific server, so it affected only specific experiments and individual users. It was due to problems with the SSD frontend cache of this AFS server.]
      • Full T0 CPUs utilization (according to the current config) observed twice last week. Discussing set-up optimizations to go beyond.
      • SLS observations by shifters: read below. None impacted ops, so no tickets.
        • CASTORCMS_DEFAULT: still some SLS low availability from time to time. Could not add info to the GGUS since it was commented and verified already
        • CASTORCMS_T0EXPORT: still SLS low availability observed, but could be just glitches
        • CASTORCMS_T0EXPRESS: SLS drops in 'total space', observed few times, duration ~15 minutes each
    • T1 sites:
      • ASGC: >2k production jobs failed for stage-out problems at ASGC. Felix promptly checked but so far cannot reproduce the error (Savannah:122820). [Jhen-Wei: did not get any update yet, will follow up with colleagues at ASGC.]
      • FNAL-CNAF transfers: low quality, but due to just 1 file only (issues with adler32 checksum, probably not a site issue: investigating) (Savannah:122835)
    • T2 sites:
      • mistakenly opened a ticket to Florida T2, who was in maintenance but shifters did not intercepted it on OIM (ticket closed)
      • T2_PT_LIP_Lisbon fails SRM SAM tests after the downtime is over (Savannah:122829)
      • T2_US_Caltech and T2_US_UCSD have stage-out problems in the MC production (Savannah:122834)
      • T2_UK_London_IC and T2_US_Wisconsin have quite some transfer errors in import (Savannah:122840)
    • CRC change: Daniele Bonacorsi -> Ian Fisk (from tomorrow on for next 7 days)

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK: High load in one of the SEs started on Thursday evening but on Friday another two were affected. It looks like the machine cannot deal with the high number of files opened by xrootd clients. The problem is under investigation but has not been understood yet. [Dimitri: cannot comment now, will follow up offline.]
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Quiet weekend. Backlog of jobs at GRIDKA cleared now. Running out of disk space (LHCb-Disk) at SARA. High number of problems with "input data resolution" at PIC.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • AFS problems this morning
      • T1
        • ntr

Sites / Services round table:

  • BNL: nta
  • NLT1: two issues
    • overloaded disk server is under investigation
    • file corruption on CVMFS is under investigation
  • CNAF: nta
  • FNAL: ntr
  • RAL: ntr
  • ASGC: network problem last night, should be ok now
  • KIT: short problem with power supply this morning (around 20 minutes), some dcache pools were affected
  • NDGF:
    • Oslo is back after security incident
    • downtime this morning was cancelled on ATLAS request, will need to be rescheduled
  • OSG: ntr

  • Storage services:
    • reminder of interventions tomorrow and Wednesday as announced last week
    • ticket mentioned by ATLAS has been understood
  • Database services: ntr
  • Grid services: ntr
  • Dashboard services: ntr

AOB: none

Tuesday:

Attendance: local (AndreaV, Cedric, Ricardo, Jamie, Raja, Massimo, Mike, Nilo, Eva, Lola); remote (Catalin, Xavier, Ronald, Michael, Jeremy, Giovanni, Tiju, Jhen-Wei, Marc, Christian, Rob; Ian).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • Transfers errors from Castor@CERN : Still no answer on GGUS:73470. [Massimo: sorry, thought by mistake this was essentially closed. Will follow up with support colleagues.]
    • T1 sites
      • Lyon : Incident yesterday evening : "Following the electrical incident of yesterday evening around 22:30, The batch system is still working in very degraded mode. About 15% of the workers are in production. The rest of the infrastructure (interactives, file transfer.....) is working. The experts are working on the electrical system. We do not expect the stopped workers to be back before the end of the day.". We stopped production/analysis queues in the meantime.
      • INFN-T1 : Some files (needed for repro) cannot come online after 3 attempts GGUS:73490
      • NIKHEF-ELPROD_PRODDISK full because of reprocessing
      • [Michael: it turns out that BNL issues yesterday were caused by jobs that were incorrectly set up, not by site issues. Cedric: thanks, confirm all ok now.]
    • T2 sites
      • ntr

  • CMS reports -
    • LHC Beam hopefully back this afternoon
    • CERN / central services
      • We exceed the PhEDEx EOS quota. Now fixed.
    • T1 sites:
      • No new issues
    • T2 sites:
      • Data Loss at T2_IT_PISA of 70TB. We will invalidate and retransfer
      • 2 Reports of transfer problems
    • CRC is now Ian Fisk

  • ALICE reports -
    • T0 site
      • Root directory of both VOBoxes at CERN filled up last evening. In the afternoon Alien was updated to the lastest version which had a bug that provoked this problem. During the night the numbers of jobs dropped because the CE AliEn service was not working due to an incomplete definition of the environment variables
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 1 (SARA #73478)
      • T2: 0
    • Issues at the sites and services
      • T0
      • T1
        • IN2P3 power cut problem affected many jobs. Seems to be recovering now.
        • Continuing problems with PIC - problem with their tape system.
        • SARA has added some capacity to LHCb-Disk. Looking forward to rest of disk coming online.
        • 3 Files lost at RAL due to bad tape. Recovery ongoing by LHCb.

Sites / Services round table:

  • FNAL: ntr
  • KIT: ntr
  • NKT1: cvmfs was updated today to fix an issue reported yesterday; but just discovered that there is an even more recent version of cvmfs with a specific fix for those issues, so will make another update
  • BNL: nta
  • GridPP: ntr
  • CNAF: checked with Luca who confirmed that the ALARM escalation (reported yesterday) worked ok; the delay in response was due to problems with the phone connection
  • RAL: ntr
  • ASGC: ntr
  • IN2P3: power cut yesterday appears to have been caused by UPS failure. One of two UPS was started this morning, but still running in degraded mode. Putting back WNs in production, all should be ok by this evening. Still following up the UPS issue with a UPS expert from the vendor.
  • NDGF: ntr
  • OSG: ntr

  • Storage services:
    • transparent intervention on CMS CASTOR was completed successfully
    • intervention on CASTORT3 is ongoing, but realised that it cannot be done transparently. Is it ok for ATLAS and CMS to have a short interruption this afternoon at 4pm? It will be an interruption of only a few minutes, but the problem is that existing transfers will be dropped. Jamie: are you not increasing the risk that this happens during data taking if we delay it? Massimo: this is only for analysis, not production. Cedric: OK for ATLAS. Ian: OK for CMS, analysis jobs retry if transfers fail.
    • reminder: intervention on CASTOR public tomorrow
  • Grid services:
    • CERN VOMS Reminder, voms upgrade from gLite 3.2 to EMI 1. Expected to be transparent. See http://cern.ch/itssb for any updates.
      • voms.cern.ch will upgrade 08:00 -> 10:00 UTC Wednesday 17th August.
      • lcg-voms.cern.ch will upgrade 08:00 -> 10:00 UTC Thursday 18th August
  • Database services: ntr
  • Dashboard services: ntr

AOB: none

Wednesday

Attendance: local (AndreaV, Cedric, Jamie, Mike, Massimo, Luca, Ricardo, Raja, Nilo, Eva, Lola); remote (Michael, Gareth, Giovanni, Catalin, Rob, Jhen-Wei, Ron, Christian, Pepe; Ian, Daniele).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • Downtime on castorpublic : https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=59064&grid_id=0 . srm-atlas.cern.ch marked as affected which is bad ! since automatic blacklisting applied. Can we remove the dependency? [Massimo: dependency comes fromautomatic testing, it has been like this for a couple of years already. Plan to take this out completely at the beginning of September, after the many interventions at the end of August.]
      • [Eva: tonight received a call at 4am from the shifter due to a maximum number of sessions exceeded on the ATLAS_COOLONL_PIXEL Normally should not be called for these application-specific issues. Cedric: will follow up offline.]
    • T1 sites
      • Lyon back.
      • KIT : GridKa link to CERN running on backup : "currently, the direct link from GridKa to CERN is down. Connections are routed via IN2P3. Most likely, the backup will be used throughout the night."
      • NIKHEF-ELPROD_PRODDISK full. ~20 TB reassigned from ATLASDATADISK. Situation now under control.
    • T2 sites

  • CMS reports -
    • Short fill taken last night was uneventful from a computing perspective
    • CERN / central services
      • Castor under intervention this morning. Transfers failing as expected
    • T1 sites:
      • FNAL lost power to all of the worker nodes over night due to a UPS failure. Need to look at how this was communicated to operations. [Ian: learnt this by email by chance through a FNAL mailing list. Rob: also OSG did not receive any message fromFNAL. Catalin: yes there is probably something to be improved in the communication, will follow this up.]
      • [Ian: also observed problems with VOMS yesterday afternoon in some FNAL jobs, any explanation? Ricardo: there was an intervention this morning, but Steve confirmed that nothing was done yesterday to prepare for this. Ian: thanks, will follow up within CMS.]
    • T2 sites:
      • 2 Reports of down PhEDEx agents at Tier-2s

  • ALICE reports -
    • T0 site
      • New AliEn version running at CERN since yesterday afternoon . The patches to solved the issues reported yesterday has been applied and services are working smoothly.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Ongoing processing of data
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 1 (Aborted pilots at INFN CAGLIARI #73582)
    • Issues at the sites and services:
      • T0
      • T1
        • Continuing problems at IN2P3 (more power cuts?).
        • Continuing problems with PIC - problem with their tape system. Any update? [Pepe: LT05 tape drives degraded: 3 out of 4 drives out of production due to persistent RW errors. Only one tape drive (LT05) is available at the moment, in R mode. VOs could see the system is low-performing when reading data from LT05 tapes. In contact with the vendor, but they have no spare drives for the moment.]
        • Possible problems accessing data at NIKHEF and possibly SARA this morning. Wait and see. [Ron: two problems with storage this morning: one dcache head node issue, fixed; another issue with hanging processes on 3 pool nodes, fixed but being investigated. All should be back ok now.]
        • [Raja: was there any problem with GGUS in the last ten minutes? Did not manage to submit tickets. Andrea: GGUS expert (MariaD) on holiday till tomorrow. Please try again and send a message if it still fails, will contact GGUS support in that case. - After the meeting, Raja confirmed that the problem persisted and Andrea contacted ggus-info.]

Sites / Services round table:

  • BNL: ntr
  • RAL: ntr
  • CNAF: a CE issue was discovered via the Nagios tests and is being fixed right now.
  • FNAL:
    • WNs are down now, power should be up by noon.
    • also following up the problem with authentication
  • OSG: there was an outage yesterday at IU affecting BDII. The network is now back up and stable. More details can be found at http://osggoc.blogspot.com/2011/08/network-outage-at-indiana-university.html.
  • ASGC: ntr
  • NLT1: nta
  • NDGF: ntr
  • PIC: nta

  • Database services: nta
  • Grid services: new myproxy-test.cern.ch can now be tested by the experiments
  • Dashboard services: ntr
  • Storage services:
    • in addition to Castor public intervention this morning, EOS intervention is ongoing now
    • summary of proposed interventions during the machine development and technical stop at the end of August:
      • Castor interventions (1h) for CMS (25th), LHCb (30th), Alice (31st) to upgrade to 2.1.11-2 and to enable the Transfer Manager (alreay in prod in ATLAS, CERNT3 and PUBLIC)
      • Castor intervention (2.5h) for Castor ATLAS (proposed for 29th) - the main part is DB maintenance
      • we are considering a longer (order 4h) intervention on nameserver proposed for 31st, including hardware intervention and affecting all instances
      • on the 31st we would like to intervene on the CMS instance (DB only) in order to provide better performance and better isolation (new hardware)

AOB: none

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

  • CERN VOMS The two voms interventions are completed without incident , now running EMI1 release. Unfortunately the bug, GGUS:73577, that motivated this upgrade in the first place is not fixed. frown Each of the voms servers continues to crash from time to time. The service of 4 hosts has been increased to 6 hosts. The crash will happen less often abd there is less chance of hitting the bad node before cron restarts it.

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 18-Jul-2011


This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsMeetings > WLCGDailyMeetingsWeek110815
Topic revision: r11 - 2011-08-18 - SteveTraylen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback