Week of 110815

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local (AndreaV, Elena, Cedric, Ricardo, Daniele, Raja, Eva, Nilo, Massimo, Jamie, Mike, Lola); remote (Michael, Ron, Giovanni, Catalin, Tiju, Jhen-Wei, Dimitri, Christian, Rob).

Experiments round table:

  • ATLAS reports -
    • Physics
      • Reprocessing started on Friday
    • T0/Central Services
      • CERN-PROD_DATADISK: problem with transfers to IN2P3-CC_PHYS-SM (GGUS:73470)
    • T1 site
      • INFN-T1: srm down on Saturday afternoon and early morning on Sunday (GGUS:73452, it was escalated to ALARM, GGUS:73456). Upgrade to ALARM didn't work again. Instabilities in storage should be fixed with new release of STORM (expected at the end of this month). [Giovanni: I think the ALARM was received properly, but will check. Andrea: why does ATLAS think this was not received properly? Elena: it took 4h before we got a reply. Giovanni: will check if the 4h response time was due to problems receiving the alarm or to other issues.]
      • NIKEF-ELPROD: CVMFS problem on WNs (GGUS:73451). Fixed in 2 h. Thanks. Problem with storage system (GGUS:7346).
      • IN2P3-CC: 50% of reprocessing jobs were failing, problem with SW release at specific WNs (GGUS:73461).
      • BNL: Reduced efficiency for transfers from CERN-PROD-T0 was observed on Friday night (GGUS:73446 was wrongly assigned to CERN, sorry; GGUS:73446). It was caused by high load of PNFS server. Solved. Thanks. Reprocessing jobs were failing due to the pilot problem (GGUS:73467). [Michael: we noticed issues even before shifters reported them. There is a suspicion that Panda resubmits jobs more than once and that log files cannot be written out because the LFN is always the same. In summary, suspect that the problem is in pilot submission or Panda, rather than in the storage element. Elena: strange, because the problem suddenly appeared yesterday afternoon, while before all was working fine. Michael: will follow up offline and report tomorrow, Paul Nillson is looking at this with the local experts.]
      • RAL-LCG2: FT tests and 20% transfers from RAL were failing (GGUS:73454). Problem with one disk server. Solved. Thanks.
      • TW-LCG2: problems in international network. Routing had been switch to backup, but insufficient bandwidth made transfers timeout and fail (GGUS:73464). Immediate reply. Thanks.
    • T2 sites
      • Nothing to report

  • CMS reports -
    • LHC / CMS detector
      • Cryo issues (Sat). Cryo compressor restarted (Sat afternoon). Trip of the cold compressor (Sat night). Change of the control electronics of the cold compressor (Sun morning). Electronics for cryogenics replaced (Sun). Cryo condition expected to be recovered around 14:00 on Tuesday. Beams not expected before Tuesday.
    • CERN / central services
      • issues with AFS and CVS? ALARM GGUS:73474
        • after the alarm the IT-SSB was updated, and within 30 mins from the ticket most volumes were reported to be back. Very prompt response, thanks. [Andrea: was this AFS issue only seen by CMS? Raja: also LHCb saw it. Massimo: this was a problem with one specific server, so it affected only specific experiments and individual users. It was due to problems with the SSD frontend cache of this AFS server.]
      • Full T0 CPUs utilization (according to the current config) observed twice last week. Discussing set-up optimizations to go beyond.
      • SLS observations by shifters: read below. None impacted ops, so no tickets.
        • CASTORCMS_DEFAULT: still some SLS low availability from time to time. Could not add info to the GGUS since it was commented and verified already
        • CASTORCMS_T0EXPORT: still SLS low availability observed, but could be just glitches
        • CASTORCMS_T0EXPRESS: SLS drops in 'total space', observed few times, duration ~15 minutes each
    • T1 sites:
      • ASGC: >2k production jobs failed for stage-out problems at ASGC. Felix promptly checked but so far cannot reproduce the error (Savannah:122820). [Jhen-Wei: did not get any update yet, will follow up with colleagues at ASGC.]
      • FNAL-CNAF transfers: low quality, but due to just 1 file only (issues with adler32 checksum, probably not a site issue: investigating) (Savannah:122835)
    • T2 sites:
      • mistakenly opened a ticket to Florida T2, who was in maintenance but shifters did not intercepted it on OIM (ticket closed)
      • T2_PT_LIP_Lisbon fails SRM SAM tests after the downtime is over (Savannah:122829)
      • T2_US_Caltech and T2_US_UCSD have stage-out problems in the MC production (Savannah:122834)
      • T2_UK_London_IC and T2_US_Wisconsin have quite some transfer errors in import (Savannah:122840)
    • CRC change: Daniele Bonacorsi -> Ian Fisk (from tomorrow on for next 7 days)

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK: High load in one of the SEs started on Thursday evening but on Friday another two were affected. It looks like the machine cannot deal with the high number of files opened by xrootd clients. The problem is under investigation but has not been understood yet. [Dimitri: cannot comment now, will follow up offline.]
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Quiet weekend. Backlog of jobs at GRIDKA cleared now. Running out of disk space (LHCb-Disk) at SARA. High number of problems with "input data resolution" at PIC.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • AFS problems this morning
      • T1
        • ntr

Sites / Services round table:

  • BNL: nta
  • NLT1: two issues
    • overloaded disk server is under investigation
    • file corruption on CVMFS is under investigation
  • CNAF: nta
  • FNAL: ntr
  • RAL: ntr
  • ASGC: network problem last night, should be ok now
  • KIT: short problem with power supply this morning (around 20 minutes), some dcache pools were affected
  • NDGF:
    • Oslo is back after security incident
    • downtime this morning was cancelled on ATLAS request, will need to be rescheduled
  • OSG: ntr

  • Storage services:
    • reminder of interventions tomorrow and Wednesday as announced last week
    • ticket mentioned by ATLAS has been understood
  • Database services: ntr
  • Grid services: ntr
  • Dashboard services: ntr

AOB: none

Tuesday:

Attendance: local (AndreaV, Cedric, Ricardo, Jamie, Raja, Massimo, Mike, Nilo, Eva, Lola); remote (Catalin, Xavier, Ronald, Michael, Jeremy, Giovanni, Tiju, Jhen-Wei, Marc, Christian, Rob; Ian).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • Transfers errors from Castor@CERN : Still no answer on GGUS:73470. [Massimo: sorry, thought by mistake this was essentially closed. Will follow up with support colleagues.]
    • T1 sites
      • Lyon : Incident yesterday evening : "Following the electrical incident of yesterday evening around 22:30, The batch system is still working in very degraded mode. About 15% of the workers are in production. The rest of the infrastructure (interactives, file transfer.....) is working. The experts are working on the electrical system. We do not expect the stopped workers to be back before the end of the day.". We stopped production/analysis queues in the meantime.
      • INFN-T1 : Some files (needed for repro) cannot come online after 3 attempts GGUS:73490
      • NIKHEF-ELPROD_PRODDISK full because of reprocessing
      • [Michael: it turns out that BNL issues yesterday were caused by jobs that were incorrectly set up, not by site issues. Cedric: thanks, confirm all ok now.]
    • T2 sites
      • ntr

  • CMS reports -
    • LHC Beam hopefully back this afternoon
    • CERN / central services
      • We exceed the PhEDEx EOS quota. Now fixed.
    • T1 sites:
      • No new issues
    • T2 sites:
      • Data Loss at T2_IT_PISA of 70TB. We will invalidate and retransfer
      • 2 Reports of transfer problems
    • CRC is now Ian Fisk

  • ALICE reports -
    • T0 site
      • Root directory of both VOBoxes at CERN filled up last evening. In the afternoon Alien was updated to the lastest version which had a bug that provoked this problem. During the night the numbers of jobs dropped because the CE AliEn service was not working due to an incomplete definition of the environment variables
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 1 (SARA #73478)
      • T2: 0
    • Issues at the sites and services
      • T0
      • T1
        • IN2P3 power cut problem affected many jobs. Seems to be recovering now.
        • Continuing problems with PIC - problem with their tape system.
        • SARA has added some capacity to LHCb-Disk. Looking forward to rest of disk coming online.
        • 3 Files lost at RAL due to bad tape. Recovery ongoing by LHCb.

Sites / Services round table:

  • FNAL: ntr
  • KIT: ntr
  • NKT1: cvmfs was updated today to fix an issue reported yesterday; but just discovered that there is an even more recent version of cvmfs with a specific fix for those issues, so will make another update
  • BNL: nta
  • GridPP: ntr
  • CNAF: checked with Luca who confirmed that the ALARM escalation (reported yesterday) worked ok; the delay in response was due to problems with the phone connection
  • RAL: ntr
  • ASGC: ntr
  • IN2P3: power cut yesterday appears to have been caused by UPS failure. One of two UPS was started this morning, but still running in degraded mode. Putting back WNs in production, all should be ok by this evening. Still following up the UPS issue with a UPS expert from the vendor.
  • NDGF: ntr
  • OSG: ntr

  • Storage services:
    • transparent intervention on CMS CASTOR was completed successfully
    • intervention on CASTORT3 is ongoing, but realised that it cannot be done transparently. Is it ok for ATLAS and CMS to have a short interruption this afternoon at 4pm? It will be an interruption of only a few minutes, but the problem is that existing transfers will be dropped. Jamie: are you not increasing the risk that this happens during data taking if we delay it? Massimo: this is only for analysis, not production. Cedric: OK for ATLAS. Ian: OK for CMS, analysis jobs retry if transfers fail.
    • reminder: intervention on CASTOR public tomorrow
  • Grid services:
    • CERN VOMS Reminder, voms upgrade from gLite 3.2 to EMI 1. Expected to be transparent. See http://cern.ch/itssb for any updates.
      • voms.cern.ch will upgrade 08:00 -> 10:00 UTC Wednesday 17th August.
      • lcg-voms.cern.ch will upgrade 08:00 -> 10:00 UTC Thursday 18th August
  • Database services: ntr
  • Dashboard services: ntr

AOB: none

Wednesday

Attendance: local (AndreaV, Cedric, Jamie, Mike, Massimo, Luca, Ricardo, Raja, Nilo, Eva, Lola); remote (Michael, Gareth, Giovanni, Catalin, Rob, Jhen-Wei, Ron, Christian, Pepe; Ian, Daniele).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • Downtime on castorpublic : https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=59064&grid_id=0 . srm-atlas.cern.ch marked as affected which is bad ! since automatic blacklisting applied. Can we remove the dependency? [Massimo: dependency comes fromautomatic testing, it has been like this for a couple of years already. Plan to take this out completely at the beginning of September, after the many interventions at the end of August.]
      • [Eva: tonight received a call at 4am from the shifter due to a maximum number of sessions exceeded on the ATLAS_COOLONL_PIXEL Normally should not be called for these application-specific issues. Cedric: will follow up offline.]
    • T1 sites
      • Lyon back.
      • KIT : GridKa link to CERN running on backup : "currently, the direct link from GridKa to CERN is down. Connections are routed via IN2P3. Most likely, the backup will be used throughout the night."
      • NIKHEF-ELPROD_PRODDISK full. ~20 TB reassigned from ATLASDATADISK. Situation now under control.
    • T2 sites

  • CMS reports -
    • Short fill taken last night was uneventful from a computing perspective
    • CERN / central services
      • Castor under intervention this morning. Transfers failing as expected
    • T1 sites:
      • FNAL lost power to all of the worker nodes over night due to a UPS failure. Need to look at how this was communicated to operations. [Ian: learnt this by email by chance through a FNAL mailing list. Rob: also OSG did not receive any message fromFNAL. Catalin: yes there is probably something to be improved in the communication, will follow this up.]
      • [Ian: also observed problems with VOMS yesterday afternoon in some FNAL jobs, any explanation? Ricardo: there was an intervention this morning, but Steve confirmed that nothing was done yesterday to prepare for this. Ian: thanks, will follow up within CMS.]
    • T2 sites:
      • 2 Reports of down PhEDEx agents at Tier-2s

  • ALICE reports -
    • T0 site
      • New AliEn version running at CERN since yesterday afternoon . The patches to solved the issues reported yesterday has been applied and services are working smoothly.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Ongoing processing of data
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 1 (Aborted pilots at INFN CAGLIARI #73582)
    • Issues at the sites and services:
      • T0
      • T1
        • Continuing problems at IN2P3 (more power cuts?).
        • Continuing problems with PIC - problem with their tape system. Any update? [Pepe: LT05 tape drives degraded: 3 out of 4 drives out of production due to persistent RW errors. Only one tape drive (LT05) is available at the moment, in R mode. VOs could see the system is low-performing when reading data from LT05 tapes. In contact with the vendor, but they have no spare drives for the moment.]
        • Possible problems accessing data at NIKHEF and possibly SARA this morning. Wait and see. [Ron: two problems with storage this morning: one dcache head node issue, fixed; another issue with hanging processes on 3 pool nodes, fixed but being investigated. All should be back ok now.]
        • [Raja: was there any problem with GGUS in the last ten minutes? Did not manage to submit tickets. Andrea: GGUS expert (MariaD) on holiday till tomorrow. Please try again and send a message if it still fails, will contact GGUS support in that case. - After the meeting, Raja confirmed that the problem persisted and Andrea contacted ggus-info.]

Sites / Services round table:

  • BNL: ntr
  • RAL: ntr
  • CNAF: a CE issue was discovered via the Nagios tests and is being fixed right now.
  • FNAL:
    • WNs are down now, power should be up by noon.
    • also following up the problem with authentication
  • OSG: there was an outage yesterday at IU affecting BDII. The network is now back up and stable. More details can be found at http://osggoc.blogspot.com/2011/08/network-outage-at-indiana-university.html.
  • ASGC: ntr
  • NLT1: nta
  • NDGF: ntr
  • PIC: nta

  • Database services: nta
  • Grid services: new myproxy-test.cern.ch can now be tested by the experiments
  • Dashboard services: ntr
  • Storage services:
    • in addition to Castor public intervention this morning, EOS intervention is ongoing now
    • summary of proposed interventions during the machine development and technical stop at the end of August:
      • Castor interventions (1h) for CMS (25th), LHCb (30th), Alice (31st) to upgrade to 2.1.11-2 and to enable the Transfer Manager (alreay in prod in ATLAS, CERNT3 and PUBLIC)
      • Castor intervention (2.5h) for Castor ATLAS (proposed for 29th) - the main part is DB maintenance
      • we are considering a longer (order 4h) intervention on nameserver proposed for 31st, including hardware intervention and affecting all instances
      • on the 31st we would like to intervene on the CMS instance (DB only) in order to provide better performance and better isolation (new hardware)

AOB: none

Thursday

Attendance: local(Mike, Cedric, Jamie, Ricardo, Eva, Nilo, Miguel, Ignacio, Raja, MariaDZ);remote(Ian, John, Michael, Paco, Jhen-Wei, Marc, Giovanni, Rob, Pavel, Christian, Catalina).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • Problem with Santa Claus (T0 export). /var/log was full but notification sent to bad list. Some datasets declared processed but were not. Under investigation.
      • Central Catalogs issue : The 2 writer nodes (voatlas72 and voatlas181) started to show many seg fault in the Appache logs at about the same time. Reason unknown since no changes done in the last weeks. The service was moved to 2 new nodes atlddm16 and voatlas61 and put under Load Balancing.
    • T1 sites
      • KIT : Network problem reported yesterday seems to have fixed by itself.
    • T2 sites
      • DESY-HH : routing to/from IN2P3 fixed

  • CMS reports -
  • CERN / central services
    • Power cut - some machines at pit still coming back, most services at CC seemed to have survived ok
    • First runs from Tuesday will be working their way through prompt reco
    • When the machine is back we will write a dedicated stream of high pile-up events. This is a sparsely populated dataset of the most complicated events. Some change the reconstruction of these events will cause Tier-0 node problems and we will adjust accordingly
  • T1 sites:
    • FNAL is back and we're making heavy use of it. Whatever was wrong with the authentication module was fixed with a local reboot.
  • T2 sites:
    • NTR


Experiment activities:

  • Ongoing processing of data

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2 (#73589 RALPP, #73590 Oxford)

Issues at the sites and services

  • T0
  • T1
    • Continuing problems at IN2P3. Marc - is there a ticket for this problem? Raja - no, assumed continuation of power cut issues but if you think site should be ok will open a ticket. Marc - no problem since power cut; Raja - will open ticket after meeting
    • Continuing problems with PIC - degraded tape system. Any update?
    • Somme user jobs failing access at GridKa - wait and see.
    • Accessing data at NIKHEF and SARA fine now.
    • GGUS problem sorted last evening.


Sites / Services round table:

  • RAL - ntr
  • BNL - ntr
  • NL-T1 - ntr
  • ASGC - ntr
  • IN2P3 - power cut issue over now - everything should be available
  • CNAF - problem with CE reported yesterday solved; currently problems with CMS data transfer ; problem just started so no detail
  • KIT - net problems but now everything fine
  • NDGF - ntr
  • FNAL - solved problem with authorization ; rebooted server. Power cut - power restored around 1:30 local time and in 1h all WNs back in production
  • OSG - ntr

  • CERN VOMS The two voms interventions are completed without incident , now running EMI1 release. Unfortunately the bug, GGUS:73577, that motivated this upgrade in the first place is not fixed. frown Each of the voms servers continues to crash from time to time. The service of 4 hosts has been increased to 6 hosts. The crash will happen less often abd there is less chance of hitting the bad node before cron restarts it.

  • CERN DB - two incidents affecting online DB; 1 with no space for audit trace files affecting that node; after power cut in P5 some storage not reachable so could not startup DB. Now storage visible but have found corruption at ASM level and working to recover DB

AOB:

  • Affect of CERN power cut - CC protected by UPS

  • FacOPS from CMS about to open ticket about VOMRS server unreachable - Ricardo: will check

Friday

Attendance: local (AndreaV, Miguel, Jamie, Mike, Cedric, Raja, Eva, Ricardo, Lola); remote (Michael, Alexander, Jhen-Wei, John, Jeremy, Rob, Marc, Xavier, Catalin, Giovanni; Ian).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • All ATLAS production badly affected by migration of CERN voms servers moved from gLite 3.2 to EMI 1. Problem traced back to a bug in GridSite ( GGUS:71190). After installation of lcg-vomscerts-6.4.0-1.slc4 on our Panda/Bamboo server, the problem vanished.
    • T1 sites
      • BNL : Problem to contact LFC from CERN reported by Hiro. [Michael: do not see any operational issue with LFC or DDM, production tasks are running smoothly. The only issues observed are with some DQ2 commands; supecting problems in WAN, under investigation.]
      • [Mike: the results of one SRM test at RAL are unavailable, investigating.]
    • T2 sites

  • CMS reports -
    • CERN / central services
      • Runs earlier in the week going through Prompt Reco
    • T1 sites:
      • Sites were asked to run a consistency check of the hosted data
    • T2 sites:
      • NTR

  • ALICE reports -
    • T0 site
      • VOBox NAGIOS tests were failing since a couple of days ago in voalice12 and 14. The cause was that there were some settings done in .bashrc that were not valid any more. In particular X509_USER_PROXY was set to a non existing proxy
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Ongoing processing of data. Waiting for new data.
    • New GGUS (or RT) tickets:
    • Issues at the sites and services:
      • T0
      • T1
        • Continuing problems at IN2P3 with access to data. Overloaded pool. GGUS ticket opened as requested.
        • PIC :
          • Continuing problems with PIC - degraded tape system. Many ongoing problems with access ti data there.
          • Problems transferring some files to CERN (GGUS ticket 73576). This is blocking further transfers from PIC to CERN.
        • GridKa :
          • Abnormally terminated connections from GridKa IP 192.108.46.248 overloaded some services in DIRAC this morning. These services have been restored, but there is the danger that other services can be affected in future. GGUS ticket opened.
          • Also problems with pilots aborted at cream-5-kit.gridka.de - GGUS ticket opened.

Sites / Services round table:

  • PIC [Pepe]: we are still suffering with LT05 drives problem in PIC. Vendor came yesterday, but the problem is not yet solved. The vendor will provide an additional drive in the afternoon, or Monday, the latest. For LHCb, there are some files that can be retrieved from tape and cannot be transferred. There is a GGUS, and PIC team is working on this.
  • BNL: nta
  • NLT1: ntr
  • ASGC: ntr
  • RAL: ntr
  • GridPP: ntr
  • OSG: ntr
  • IN2P3: ntr
  • KIT:
    • link from Gridka to CERN was unstable this morning; may be due to a transmitter in Frankfurt, that will be changed; will reroute network via Lyon if this does not fix the issue
    • some files could not retrieved from tape for ATLAS; this should now be fixed, the files will be staged soon
  • FNAL: job submission recovering after the power cut
  • CNAF:
    • problem reported yesterday with CMS transfers disappeared without any intervention, probably due to transient issues
    • observing failure of a CE test, under investigation
    • other failures in CE tests may be due to a storm problem, under investigation

  • Grid services: ntr
  • Dashboard services: nta
  • Database services: will start deploying Oracle security updates next week, starting wuth LCGR, PDBR and ATLASARCHIVE

AOB: none

-- JamieShiers - 18-Jul-2011

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2011-08-19 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback