Week of 120423

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Luc, Jamie, Stefan, Doug, Oliver, Andrea, Alessandro, Maarten, MariaDZ, Luca, Ignacio, Pablo);remote(Gonzalo, Mette, Lisa, Jhen-Wei, Jeremy, Tiju, Pavel, Onno, Paolo, Rolf, Rob).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • T0/Central Services
      • LFC stable
      • T0-> US & FR backlog. Fixed by recreating mysql DB on SS t0export machine.
      • FR: dq2 errors to report complete datasets. panda service had to be restarded on voatlas255.
    • T1s/CalibrationT2s
      • SARA-MATRIX: failing transfers with SRM autentification failure. GGUS:81488.
      • NDGF-T1: 50k deletion errors. GGUS:81490. [ Mette - GGUS ticket should be closed soon, was wrong LFC permissions, solved now. ]
      • FR: DDM SS for FR cloud has problems to contact IN2P3 FTS server (cclcgftsprod.in2p3.fr). Transfers reduced to minimum to absorb backlog. GGUS:81480.
  • ATLAS internal

  • CMS reports -
  • LHC machine / CMS detector
    • Technical stop, planned to restart with beam Friday, 6 PM CERN time
  • CERN / central services and T0
    • new:
    • old:
      • GGUS:81199 => GEANT Networking Problems to Baltics and Norway
        • last update 4/17, waiting for final confirmation that problem is fixed
      • LSF problems from last week, not seen again over the weekend, Atlas Alarm ticket still open: GGUS:81445
  • Tier-1/2:
    • new:
      • GGUS:81475 => imbalance of jobs at T1_DE_KIT, seems that on Saturday many dcms jobs were running and only a few production jobs, improved over the weekend,
        • SOLVED: dcms is not in cms fairshare resources included, overall fairshare is lower than all other LHC experiments currently.
    • old:
      • GGUS:81453 => CNAF WMS wms002.cnaf.infn.it does not match anything, seems to have recovered mostly but still seeing problems from time to time, would appreciate feedback


  • LHCb reports -
  • Prompt data reconstruction, data stripping and users analysis going on at Tier1s and T0.
  • DataReprocessing of 2012 data with new alignment to be launched today
  • MC simulation at Tiers2

  • T0
    • lost files due to a broken Castor disk server (GGUS:80973), final info on lost files received [ ticket can be closed ]
  • T1
    • GRIDKA:
      • job submission to Gridka WMS are failing (GGUS:81405). Ongoing (2 instances fixed, 2 failing) Pavel - now logs shw some different problems which need to be investigated
    • CNAF
      • problem with mapping of user credentials after upgrade of WMS server (GGUS:81291) [ needs to be followed up ]

Sites / Services round table:

  • PIC - ntr
  • NDGF - ntr
  • FNAL - ntr
  • ASGC - ntr
  • RAL - ntr
  • NL-T1 - ntr
  • KIT - ntr
  • IN2P3 - preannounce of outage 22 May: batch will be down all day; dCache in morning and tapes all day; details will come later. Still working with LHCb on short job problem
  • CNAF - ntr
  • BNL - VOMS outage 04:00 - 06:00 Eastern time. Probe detected issue and attempted to restart service. Due to full filesystem failed; looking into why fs was full
  • GridPP - ntr
  • OSG - trying to follow GGUS:78772: can't access gstat. Any known issues? Maarten - server at CERN forwards all traffic to asgc - was working recently.

  • CERN Grid: batch experts working with Platform (LSF) for ticket that happened last Friday

AOB: (MariaDZ) GGUS monthly release this Wednesday 2012/04/25. The usual test ALARMs will take place as well and normal tickets' tests for checking attachments' correct transfer to external ticketing systems and other data. Details on what will be tested from now on in Savannah:126505. Up-to-date file ggus-tickets.xls is attached to page WLCGOperationsMeetings. There were 4 real ALARMs last week. Detailed drills for all past month for tomorrow's MB are attached at the end of this page.

Tuesday

Attendance: local(Jamie, Stefan, Maarten, Jarka, Luca, Luca, Doug, Guido, Oli, Ignacio);remote(John, Gonzalo, Mette, Xavier, Jhen-Wei, Lisa, Michael, Ronald, Rolf, Rob).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • T0/Central Services
      • Tier0 problems recalling from tape (t0atlas pool). Backlog and tape libraries misconfiguration created the problem. Fixed (GGUS:81512)
      • LSF down in the week end and slow for a couple of days. Situation back to normal. Reason: high load on the system, problem investigated together with the LSF vendor (GGUS:81445)
    • T1s/CalibrationT2s
      • BNL voms server vo.racf.bnl.gov hanging (GGUS:81505 fixed very promptly). A service monitoring tool detected the problem, attempted restart but failed because for a full log partition. To be investigated why the partition status didn't generate any internal alerts. [ Michael - reported about this already yesterday ]
      • FZK-LCG2 CVMFS not correctly set on some WNs (GGUS:81491). Problem fixed
      • IN2P3-LPC network issue after migration to LHCONE (GGUS:81481). Still under investigation


  • CMS reports -
  • LHC machine / CMS detector
    • Technical stop, planned to restart with beam Friday, 6 PM CERN time
  • CERN / central services and T0
    • new:
    • old:
  • Tier-1/2:
    • new:
      • 10Gb uplink from Taipei to Chicago was down, is up again, the root cause is unknown, ASGC's ISP is investigating
        • All of the traffic was switched to 2.5Gb backup route which limits the bandwidth, we expected there will have some transfer errors (in particular, timeout errors) during the downtime of the 10Gb up link
    • old:



  • LHCb reports -
  • DataReprocessing of 2012 data at T1s with new alignment was launched today
  • MC simulation at Tiers2

  • T0
    • NTR
  • T1
    • GRIDKA:
      • job submission to Gridka WMS are failing (GGUS:81405). Ongoing (2 instances fixed, 2 failing), will be fixed during next DT
    • RAL
      • one diskserver stopped working yesterday 9pm, fixed this morning


Sites / Services round table:

  • RAL - LHCb diskserver problem: 2 weeks ago reported problem with d/s loosing network. Spreading to LHCb disk servers (was ATLAS) - solution update the kernel.
  • PIC - ntr
  • NDGF - crashed pool in Stockholm. Still investigating; some unique ATLAS data and some replicated
  • KIT - planned a downtime for tape b/e tomorrow - had to cancel and are searching for a new date
  • ASGC - this morning 10Gp link temp unreachable. Switched to backup link; some errors as CMS mentioned. 10Gb link up at 10:00 UTC. Running for 3h without any problem so ended downtime just now
  • FNAL - ntr
  • BNL - ntr
  • NL-T1 - ntr
  • IN2P3 - ntr
  • OSG - ntr

AOB: (MariaDZ) As requested by ATLAS in the past, here is a reminder of the GGUS monthly release tomorrow, Wednesday 2012/04/25. Details in yesterday's AOB.

Wednesday

Attendance: local(Massimo, Jamie, Jarka, Alexei, Stefan, Luca, Ignacio, Maarten, MariaDZ);remote(Mette, Lisa, Oliver, Michael, John, Jhen-Wei, Rolf, Giovanni, Pavel, Ron, Rob).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • ATLAS : muon calibration runs during the night, tests during the day
    • T0/Central Services
      • ATLAS fast reprocessing is in progress, Tier0 uses ~6500 slots, datasets are replicated to Tier-1s/2s
      • HLT reprocessing is started, 150 slots are used (under investigation)
  • ASGC is doing emergency CASTOR DB migration, 24h downtime is announced, T1 is excluded from MC production and analysis


  • CMS reports -
  • LHC machine / CMS detector
    • Technical stop, planned to restart with beam Friday, 6 PM CERN time
  • CERN / central services and T0
    • new:
    • old:
  • Tier-1/2:
    • new:
    • old:



  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • T1
      • RAL
        • another 18 disk servers to be rebooted today to avoid problem with network card
      • GRIDKA
        • 12 files lost at Gridka disk servers, first analysis shows that the files have been deleted, currently under investigation how this could happen (GGUS:81322)

Sites / Services round table:

  • NDGF - ntr
  • FNAL - ntr
  • BNL - ntr
  • ASGC - Last two weeks we restored castor DB from hardware breakdown issue. After that we are preparing new hardware and finding chance to migrate database service. Unfortunately, before we make schedule to it, Oracle gave us alarm this morning. Discusssed with Vendor and we decided to raise an emergency downtime for 24 hours to migrate DB right away. This unscheduled downtime continues until tomorrow morning. During this period Taiwan CASTOR is unavialble.
  • CNAF - ntr
  • RAL - rebooted any disk servers we believed vulnerable to bug this morning; now all d/s rebooted with new kernel.
  • IN2P3 - GGUS update of this morning triggered alarm tickets which revealed problem with interface to our system; under investigation, until then keep eye on GGUS for other "real" alarms
  • KIT - downtime for 3 of our WMS; this will go until Friday
  • NL-T1 - ntr
  • OSG - ntr

  • GGUS - 13 test alarms since GGUS release; we will have results of this tomorrow.

AOB:

Thursday

Attendance: local(Alexei, Jarka, Jamie, Stefan, Maarten, Massimo, Ignacio, Eva, Nilo);remote(Gonzalo, Xavier, Paco, Mette, Michael, Oliver, Lisa, Gareth, Jhen-Wei, Rolf, Giovanni, Rob).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • ATLAS : muon calibration runs during the night, tests during the day
    • T0/Central Services
      • ATLAS fast reprocessing is in progress, Tier0 uses ~6500 slots, datasets are replicated to Tier-1s/2s
  • ASGC is in downtime, T1 is excluded from MC production and analysis
  • wget from remote sites to one of ATLAS pandaserver machines at CERN failed (pandaserver.cern.ch port 25080 ).


  • CMS reports -
  • LHC machine / CMS detector
    • Technical stop, planned to restart with beam Friday, 6 PM CERN time
  • CERN / central services and T0
    • new:
    • old:
  • Tier-1/2:
    • new:
    • old:
      • ASGC Castor still down: today spent a lot of time to correct local SAN configuration issues with new storage, oracle DB migration continues with oracle ASM rebalance mechanism, hopefully done by tonight



  • LHCb reports -
  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Test: One T2 attached for new production workflow (cpu intensive, low I/O)

  • T1
    • GRIDKA
      • 12 files lost at Gridka disk servers (responsibility probably on LHCb side - files removed with DN of LHCb data manager), removal happened mid March, no more log files available, no further investigation possible (GGUS:81322) [ KIT - please rephrase to it is clear we are not responsible! ]
    • IN2P3
      • So far 10 corrupted files were found, file size is correct but checksum is not (GGUS:80338)


Sites / Services round table:

  • PIC - ntr
  • KIT - ntr; stuck tape in tape lib hence 16 tapes (ATLAS) not available; technician on site tomorrow to fix
  • NL-T1 - ntr
  • NDGF - ntr
  • BNL - due to communication problem cert on LFC expired; resolved in 30' and service restored
  • FNAL - ntr
  • RAL - ntr
  • ASGC - still working on DB / CASTOR issue
  • IN2P3 - on test alarm issue of yesterday ticket arrived late in local ticketing system due to temporary overload of mail servers: consider this to be solved
  • CNAF - ntr
  • OSG - ntr

  • CERN DB - ontinue with security patches; INTR for ATLAS, next week LHCb and LCG INTR; this morning some tests on CMS online cluster, found some network config issues that will be fixed soon

AOB:

  • GGUS alarm ticket summary (modulo IN2P3 "issue" above):

All other ALARMs went well, including those to North American sites that were raised after our daily meeting yesterday. A total of 16 tickets are involved. Details: here

Friday

Attendance: local(Jamie, Stefan, Alessandro, David, Maarten, Ignacio);remote(Onno, Michael, Ian, Xavier, Jhen-Wei, Rolf, Lisa, Gareth, Giovanni, Roger, Rob).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • ATLAS/LHC: data taking will start this night. Stable beam expected as of Sunday
    • T0/Central Services
      • ATLAS fast reprocessing almost done (expected end today), datasets are replicated to Tier-1s/2s
  • ASGC ended downtime this morning, T1 is excluded from MC production, analysis and RAW data export from CERN. If no problem as of today 15:00CET we will reinclude them
  • wget from remote sites to one of ATLAS pandaserver machines at CERN failed (pandaserver.cern.ch port 25080 ). GGUS:81645 .


  • CMS reports -
  • LHC machine / CMS detector
    • Technical stop, planned to restart with beam Friday, 6 PM CERN time
    • Over the weekend a low pile-up run is expected. There are datasets for this run.
  • CERN / central services and T0
    • NTR
  • Tier-1/2:
    • ASGC appears to be back successfully taking jobs
    • Transfer errors reported at ASGC
    • Software installations missing at T1_FR_IN2P3. Work by experts.

  • Other:
    • Ian Fisk CRC until Tuesday


  • LHCb reports -
  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • One T2 successfully attached for new production workflow to T1 storage (cpu intensive, low I/O), plan to attach one T2 to each T1 storage
  • Will switch way of distributing conditions data - will be by CVFMS.

  • T1
    • IN2P3
      • So far 10 corrupted files were found, file size is correct but checksum is not (GGUS:80338)
      • Pilots aborted tonight (GGUS:81677) [ Rolf - it was not possible for LHCb to add new waiting jobs; total # jobs running and in q was above limit; no running jobs were killed. Only waiting jobs ] Limit has now been raised


Sites / Services round table:

  • NL-T1 - this morning SARA SRM was stuck due to overload from srmls requests. Restart fixed it. Will move nameserver DB to faster h/w to remove this bottleneck. Affecting all VOs. On Monday National holiday!
  • BNL - ntr
  • KIT - ntr
  • ASGC - on CASTOR DB migration was done and downtime ended this morning. Got high load with high number of jobs in CASTOR; now about 2k CMS jobs running; observed large amount of connections to castor. Looking at this.
  • FNAL - noticed our FTS agents were consuming large amount of CPU and memory. Yesteday 3 packages that were explicitly mentioned in upgrade document had not been upgraded. Installed yesterday; monitoring machine to make sure that the problems don't return
  • IN2P3 - nta
  • RAL - ntr
  • CNAF - ntr
  • NDGF - ntr
  • OSG - ntr

  • CERN - ntr

AOB:

  • N.B. no call on Tuesday 1st May

-- JamieShiers - 23-Apr-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r2 r1 manage 2634.5 K 2012-04-23 - 12:07 MariaDimou Complete ALARM drills for the 2012/04/24 MB.
Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2012-04-27 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback