Week of 110912

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Massimo, Claudio, Peter, Guido, Dirk, Jamie, Maarten, Ivan, Eva, Maria D, Maria G, Andrea, Alexandre);remote(Michael BNL, Lisa FNAL, Rolf IN2P3, Onno NLT1, Vladimir LHCb, Jhen Wei ASGC, Tiju RAL, Gonzalo PIC, Daniele CNAF, Rob OSG).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • EOS: passing to a new machine with more memory (and testing failover from master to slave) - intervention ongoing
      • EOS support: tested successfully escalation of alarm tickets about EOS in GGUS
    • T1 sites
      • TAIWAN-LCG2: CRL expired in SRM (cron job was not running) GGUS:74039
      • INFN-CNAF: problem with SRM, apparently storm daemon stuck GGUS:74237
    • T2 sites
      • NTR

  • CMS reports -
    • LHC / CMS detector
      • Physics Friday and Saturday. Should restart in the afternoon
    • CERN / central services
      • A few Castor glitches tonight. Nothing serious
    • T0:
      • Processing time more than doubled and high memory consumption because of high PU. This creates problems. Backlog due to the castor problems of last week not yet absorbed. Job splitting and memory requirements for T0 jobs changed this morning.
    • T1 sites:
      • NTR
    • T2 sites:
      • NTR

  • ALICE reports -
    • T0 site
      • Central AliEn services were down for many hours Sunday afternoon and evening due to a power cut and associated network failures
    • T1 sites
      • RAL: GGUS:74098. ALICE SAM tests still fail; under investigation
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • We have only few short runs during weekend.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 3
    • Issues at the sites and services
      • CERN:
        • (GGUS:74175) Several files that are supposed to be on Castor are not accessible
      • CIC: We have no any downtime notification last week.

Sites / Services round table:

  • ASGC: CRL problem fixed
  • BNL: ntr
  • FNAL: ntr
  • PIC: intervention tomorrow (9:00 local time - all day): dCache intervention + network
  • RAL: intervention tomorrow (8:00 local time): network
  • KIT: 20 files lost (LHCb) due to dCache problems. Tomorrow (13:00 local time - 3 hours) intervention on the tape infrastructure
  • NLT1: intervention tomorrow (9:00 local time - all day): lot of changes (notably retiring the LCGCE, network, dCache and WNs). Observed again (today lunch time) another trip in a disk pool; restoring the default time-outs to verify this will fix the problem.
  • IN2P3: ntr
  • CNAF: intervention tonight (19:00 local time - less than 1 hour): network (transparent)
  • NDGF: network problems fixed. In the next days a series of short intervention (dCache)
  • OSG: ntr

  • Grid services: ntr
  • Dashboard services: ntr
  • Storage services: ntr
  • DB: LCG DB incident (overload). VOMSDB cannot cope with a much-increased level and connections. Problem to be undersood.

AOB:

Tuesday:

Attendance: local(Peter, Alexeandre, Massimo, Maarten, Ivan, Stephan, Eva, Alessandro, MariaD); remote(Michael (BNL), Xavier (KIT), Lisa (FNAL), Gonzalo (PIC), Tiju (RAL), Rolf (IN2P3), Ronald (NLT1), Mette (NDGF), Rob (OSG), Vladimir (LHCb)).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • EOS migration ongoing today, good progress
    • T1 sites
      • Errors in transfers from T0 to INFN-T1_DATADISK, disk server issue GGUS:74237
      • IN2P3-CC srm/chimera load causing jobs failures GGUS:74267
      • RAL LFC timeout, 25% job failures. No apparent problem with LFC GGUS:74253
      • last hour RAL DATADISK errors, possible FTS problem
    • T2 sites
      • NTR

  • CMS reports -
    • LHC / CMS detector
      • Next fill for physics
    • CERN / central services
      • NTR:
    • T0:
      • Increased memory reservation per job helping. A few nodes reported with problems. Reduced capacity (due to memory reservation) is an issue.
    • T1 sites:
      • NTR
    • T2 sites: *NTR

  • ALICE reports -
    • T0 site
      • NTR
    • T1 sites
      • RAL: GGUS:74098. ALICE SAM tests still fail; under investigation
      • NIKHEF: dashboard team reported SAM failures since yesterday 11:00 UTC - alice Queue is not enabled
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Low level activity; reconstruction and stripping.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 1
      • T2: 1
    • Issues at the sites and services
      • GRIDKA: Pilots aborted (GGUS:74265) - fixed very quickly.
      • CIC: Downtime notification fixed ( 167 notifications today )

Sites / Services round table:

  • BNL: ntr
  • FNAL: ntr
  • PIC: intervention continuing as scheduled
  • RAL: LSF problem just fixed
  • KIT: intervention finished as scheduled
  • NLT1: intervention continuing as scheduled. In addition the pool nodes problem is not solved: investigating postegres version. The ALICE problem was due to closing (by mistake) the ALICE resources too early in preparation of the intervention; expect to have it back as the intervention finishes.
  • IN2P3: ntr
  • NDGF:ntr
  • OSG: More investigation on the outage at Uni California affecting CMS analysis in the US: power cut.

  • Grid services: ntr
  • Dashboard services: ntr
  • Storage services: ntr
  • DB: ntr

AOB: (MariaDZ) CIC developers answered on the LHCb lack of notification problem: After a bad restart ( thousand of notifications were sent without header ) the problem is fixed now.The initial problem was the overload of the machine due to a bad configuration of the local monitoring system . Info also in GGUS:74252

Wednesday

Attendance: local(Jamie, Alessandro, Edoardo, Luca, Maria, Ivan, Peter, Maarten, Alex, Nicolo, Stephen, MariaDZ);remote(Michael, Gonzalo, Lisa, Ron, Mette, Pavel, Rolf, Daniele, John, Rob, Vladimir, Shu-Ting).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • EOS migration yesterday, new EOS backend now online
  • T1 sites
    • BNL FTS errors with transfers from CERN-PROD_DATADISK to US Tier2 sites GGUS:74293 (Invalid SRM version) - anyone knows what this means? Michael - issue with this particular FTS channel. Hiro is looking into it and it will be fixed this morning (US time)
    • RAL reprocessing a little slow, investigating with RAL, possible (pre) Stager issue - GGUS:74312 Both Shaun and Alistair are at CERN
  • T2 sites
    • NTR


  • CMS reports -
  • LHC / CMS detector
    • Data taking
  • CERN / central services
    • NTR
  • T0:
    • Return of CASTOR issues yesterday afternoon. ALARM ticket opened and resolved in the evening. GGUS:74283 . Team ticket then escalated to ALARM. Massimo restarted transfer manager and then things got moving again. Ticket now closed. New version of CASTOR deployed last Thursday had more debugging - hopefully they have more info to debug. Nico - spotted today some files still stuck in phedex for export, opened a couple of tickets, will add refs.
  • T1 sites:
    • NTR
  • T2 sites:
    • NTR


  • ALICE reports -
    • T0 site
      • NTR
    • T1 sites
      • RAL: GGUS:74098. ALICE SAM test failures; not yet resolved, hope to have more time to look at this today (Maarten)
    • T2 sites
      • Usual operations

  • LHCb reports - Reconstruction and stripping. Trigger was changed before packages were installed on grid, as result all jobs failed at GRIDKA and CNAF. Fixed by installing required packages. Please can GRIDKA and CNAF migrate to CernvmFS as we had no such problems at Tier1s that use CernvmFS. KIT - ok we are planning this, will let the team know of this wish. CNAF - taken into account, working on solution

Sites / Services round table:

  • BNL - nta
  • PIC - ntr
  • FNAL - ntr
  • NL-T1 - ntr
  • NDGF - ntr
  • KIT - ntr
  • IN2P3 - ntr
  • CNAF - nta
  • RAL - nta
  • ASGC - will have sched downtime - 11pm to 4am tomorrow.
  • OSG - ntr

  • CERN DB - put 2 accounts related to reprocessing to ATLR DB (?)

AOB:

Thursday

Attendance: local(Alex, Andrea, Dirk, Eva, Ivan, Jan, John, Maarten, Maria G, Peter);remote(Andreas, Felix, Gonzalo, John, Lisa, Mette, Michael, Rob, Rolf, Ronald, Stephen, Vladimir).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • NTR
    • T1 sites
      • RAL reprocessing affected by ALICE jobs filling /tmp, staging issue re-appeared but OK now GGUS:74312
    • T2 sites
      • NTR

  • CMS reports -
    • LHC / CMS detector
      • Record fill yesterday (peak, integrated (delivered and recorded), pileup)
    • CERN / central services
      • NTR
    • T0:
      • Validating new release to save CPU time and memory
    • T1 sites:
      • Backfill mostly, some MC in progress
    • T2 sites:
      • NTR

  • ALICE reports -
    • T0 site
      • NTR
    • T1 sites
      • RAL: switching from shared software area to Torrent; incident in ATLAS report due to config error...
      • RAL: GGUS:74098. ALICE SAM test failures; pending...
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Reconstruction and stripping.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 2
      • T2: 1
    • Issues at the sites and services

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • yesterday evening the SARA SRM got stuck and was quickly restarted
    • the announcement for SARA's downtime last Tue arrived late
      • Maarten: the CIC Portal team only fixed the downtime announcement system on Tue, so SARA's announcement probably fell victim to the earlier troubles; let's keep a close watch at the timelines for upcoming downtimes
    • Peter: can you check if sufficient job slots are available for ATLAS reprocessing?
    • Ronald: the backlog was due to the downtimes at SARA and NIKHEF, now things are ramping up to normal levels
  • OSG
    • various tickets from Armenia wrongly assigned to OSG
      • Maarten: they were assigned by the submitter himself, who then acknowledged they were assigned to the wrong support unit
    • Rob will be at CERN this Friday and at the EGI Tech Forum in Lyon next week
  • PIC - ntr
  • RAL - nta

  • CASTOR/EOS
    • Stephen: more news on CMS CASTOR incident of Mon evening?
    • Jan: investigations were not conclusive, but a debugging recipe has been established for the next time
    • Jan: CMS EOS downtime to add new HW next Tue
      • Stephen: Wed?
      • Jan: will check
    • Jan: dip in spare capacity due to ongoing HW retirement campaign
  • dashboards - ntr
  • databases
    • around 14:00 CEST the 2nd instance of the CMS offline DB rebooted and was back after a few minutes; the cause of the reboot is being investigated
  • grid services - ntr

AOB:

Friday

Attendance: local(Peter, Scott, Jamie, Tiju, Gareth, Rob, Torre, Stefan, Ivan, Alessandro, Ignacio, Stephen, John, Alex);remote(Michael, Xavier, Lisa, Gonzalo, Rolf, Onno, Andreas, Lorenzo, Jhen-Wei).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • NTR
  • T1 sites
    • Reprocessing mostly finished, NL pending due to network problem. FZK helping with NL repro jobs.
  • T2 sites
    • NTR


  • CMS reports -
  • LHC / CMS detector
    • Some network problems overnight, eventually rebooted router and it recovered

  • CERN / central services
    • CASTORCMS SRM displayed 0% availability in SLS, but was due to CASTORPUBLIC, GGUS:74343 [ Ignacio - working to remove dependency on CASTOR PUBLIC for SLS probes. Due to cert. handling. Problem with CASTORPUBLIC due to some overload - fixes deployed on other instances not yet on PUBLIC but will be today. Without fixes cannot see what is going on. John has PK for w/e and has agreed. ]
  • T0:
    • PromptReco down from 2pm to 3:30am due to a calibration problem found. Now have 20k backlog of jobs.
  • T1 sites:
    • Backfill mostly, some MC in progress
  • T2 sites:
    • NTR


  • ALICE reports -
    • T0 site
      • NTR
    • T1 sites
      • RAL: GGUS:74098. ALICE SAM tests OK today, but cause of improvement not understood
    • T2 sites
      • Usual operations

Experiment activities:

Reconstruction and stripping, LHCb magnet polarity change today

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 1

Issues at the sites and services

  • SARA: Problem with network tonight, storage and switches were not communicating properly.
  • IN2P3: SharedArea Problem b/c of overloaded volume, the LHCb volume was moved, problem fixed (GGUS:74334)


Sites / Services round table:

  • BNL - ntr
  • KIT - ntr
  • FNAL - ntr
  • PIC - tonight we had problem affecting ATLAS jobs and transfers - 5 storage pools failed in a few hours in the same way - kernel panic. Analysis this morning looks like XFS bug which shows up with full f/s with high activity. cron tries to defrag - hit bug and panic. Problem over, investigating how to solve definitively.
  • IN2P3 - ntr
  • NL-T1 - network issue: at NIKHEF network problem which was config error introduced during maintenance this week. Caused part of traffic from WNs to fail. Error fixed an hour ago - full bandwidth should be available again. Another issue at NIKHEF this morning - grid router had problem with line card; after reload was fixed. At SARA a few new WNs activited. 80 nodes with 960 cores, almost all available since an hour.. Will hopefully help with backlog...
  • NDGF - this Monday 09:00 - 12:00 some upgrade on pools in Norway, some ATLAS data unavailable for sometime. Some short glitch of 30" when headnode is patched.
  • CNAF - ntr
  • ASGC - ntr
  • RAL - AT RISK Tuesday AM - brief network outage of a few minutes whilst a fibre is changed, trying to chase low level packet loss problem.
  • OSG - ntr

  • CERN - ntr

  • CERN DB - question from CMS about why CMSR2 rebooted at 14:00? Mentioned yesterday but no DB present today

AOB:

-- JamieShiers - 25-Aug-2011

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2011-09-16 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback