Week of 120130

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Torre, Yuri, Jamie, Eva, Eddie, Maarten, Andrea, MariaDZ, Alex, Luca);remote(Mette, Lisa, Tiju, Onno, Jhen-Wei, Rob, Jose, Rolf, Vladimir, Gonzalo, Dimitri).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • T0/Central Services
      • CERN-PROD TZERO: file transfers with "SRM_INVALID_PATH No such file or directory" errors on Sunday afternoon. GGUS:78737. Savannah:90999.
    • T1s
      • RAL-LCG2: 900 file transfer failures from SCRATCHDISK token. Source errors "locality is unavailable". GGUS:78731 filed on Saturday at ~17:40,fixed at ~21:40. A couple of the pools had gone offline with errors due to 'too many open files'. Increased ulimit, restarted all the ATLAS pools to pick up the increase.
      • TRIUMF-LCG2: GGUS:78468,78425 solved on Saturday. Problematic files (about 180) have been deleted.
      • NDGF-T1: Sunday morning 447 file transfer failures, MCTAPE space token: "Failed to pin file. No pool candidates available/configured/left for stage". GGUS:78733, had the ATLAS read pool failure (lhc-disk-39.pdc.kth.se) at PDC during the weekend (may be not related). Still some (>100) errors this morning. [ Resolved this morning and all ok now. ]

  • ATLAS internal
    • No transfer activity reported for DE cloud (voatlas119 box) by DDM dashboard on Sunday evening, sems recovered in a couple of hours. The cause (monitoring issue?) is under investigation. Savannah:91000. [ Eddie - issue related to LCGR database problem yesterday ]


  • CMS reports -
  • CERN / central services
    • INC:099331 LSF down this morning. Recovered but now we only see ~350 slots for the Tier-0 (see here) while we used to have 3-4k. [ Alex - Gavin said that this is now fixed. ]
  • CMS C2 T3 ran out of space; 85TB


  • ALICE reports -
    • Opened alarm ticket GGUS:78739 just after midnight today because the CERN VOMS servers were stuck on a DB problem, causing voms-proxy-init to hang and thereby affecting ALICE workflows. It was quickly cured by the DB team. NOTE: due to a comedy of errors the solution advertised in the ticket has nothing to do with the problem...


Experiment activities:

MC11 Monte Carlo productions

New GGUS (or RT) tickets

  • T0
  • T1
    • CNAF: problem with pilot submission (GGUS:78713). Solved immediately.
    • GRIDKA: Failure to log on to LHCb Tier-1 VO-box (GGUS:78720). Solved, network problem.
    • RAL: SAM tests for WMS submission failing (GGUS:78760). Solved, misconfiguration on the batch system.


Sites / Services round table:

  • NDGF - ntr
  • FNAL - ntr
  • RAL - have successfully completed ATLAS SRM updated. ATLAS mentioned file transfer failures - this was actually RAL T2 not RAL T1.
  • NL-T1 - ntr
  • ASGC - ntr
  • IN2P3 - ntr
  • PIC - ntr
  • KIT - ntr
  • OSG - ntr

  • CERN DB - yesterday morning observed problem with LCGR DB. Archive processes blocked by backup. Had to kill instance #1 and #4 to get rid of locking situation. DB recovered after this but unfortunately in the night the same problem occured but worse - whole DB got stuck and impossible to login on any instance. After killing instance #1 and rebooting node everything came back to normal. Archive log backups disabled since and keep 2 stand-bys up to date. Tomorrow migration to new hardware and migration to 11g. Could be a bug in 10g and/or a disk problem - 1 disk failed yesterday

  • CERN storage - there will be tomorrow and the day after some security updates for DB for castor. Details are on CERN status board and in GOCDB. Today we will active the tape gateway on ATLAS. An EOS update for ATLAS will probably be scheduled for this week

  • CERN dashboards - degraded service for CMS and ATLAS for 4 hours yesterday due to LCGR problems.

  • CERN Grid services: For tomorrow: an EGI broadcast was sent this Monday about CERN WMS server upgrade to EMI1, inviting VOs to update their UI config files.
    https://operations-portal.in2p3.fr/broadcast/archive/id/587

AOB:

Tuesday

Attendance: local(Alex, Eddie, Eva, Luca M, Maarten, Maria D, Nilo);remote(Burt, Gonzalo, Jeremy, Jhen-Wei, Lorenzo, Mette, Rob, Rolf, Ronald, Stefano B, Stefano P, Tiju, Torre, Vladimir, Xavier).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • T0/Central Services
      • LCGR scheduled downtime for Oracle 11g and hardware upgrade causing unexpected disruption due to VOMS unavailable because fallback to BNL not working as expected
        • Maarten: the VOMS clients in gLite 3.x are known for being able to hang when a particular VOMS server has some problem, even when a timeout option was supplied on the command line; for example, when the CERN VOMS server database had a problem Sunday night, voms-proxy-init just hung
        • Stefano B: we saw voms-proxy-init fail immediately during today's DB downtime
        • Torre: we will have a look at the details
      • CERN-PROD TZERO: file transfers with "SRM_INVALID_PATH No such file or directory" errors on Sunday afternoon. Still investigating apparent deletion of dataset. GGUS:78737. BUG:90999.
    • T1s
      • RAL-LCG2: unscheduled downtime yesterday 16:45-20:45, ATLAS SRM problems, resolved
      • SARA-MATRIX: transfer failures, transfer efficiency 70% Ticketed 4:48 this morning, at 11:33 asked us to check status after SRM restart. GGUS:78786
  • ATLAS internal
    • Large accrual of activated jobs on ANALY_LONG_BNL_ATLAS queue, over 20k waiting jobs. Under investigation. GGUS:78736


  • CMS reports -
    • T1: GGUS:78725 (proxy expiration at CCIN2P3 keep open or open ad-hoc one ? investigate with another tool ?)
      • Rolf: will check
    • CERN servicies: GGUS:78619 - (problem with gLite 3.2.10-0 on lxplus) solution not satisfactory, 3.2.10 setup points to 3.2.8 UI and new tools break as they expect well formed json. New ticket on SNOW INC:099759
      • Maarten: the 3.2.11 UI would be better, I will ask for it to be installed on AFS (GGUS:78808)



  • LHCb reports -
    • T0
      • Vladimir: there was no downtime for VOMS etc. in the GOCDB
      • Maarten: indeed, someone made a mistake
    • T1
      • CNAF: address of the LFC service disappeared in the DNS
        • Vladimir: the problem disappeared, no ticket was opened

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ?
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3
    • downtime Feb 7 all day for batch + dCache upgrades (1.9.12 and new HW), network will be available
  • KIT
    • LHCb conditions DB + LFC upgrade tomorrow
  • NDGF - ntr
  • NLT1 - ntr
  • OSG
    • a few problems were observed for comparison of availability/reliability reports from OSG RSV and SAM, probably due to today's DB downtime
  • PIC
    • yesterday LHCb LFC was upgraded to Oracle 11g, went OK
  • RAL
    • Oracle 11g upgrade of LHCb LFC and 3D databases ongoing, looking OK
    • LHCb SRM upgrade tomorrow

  • CASTOR/EOS
    • ALICE Xrootd redirector had a configuration problem, fixed
    • all CASTOR instances are using the tape gateway now (ALICE since today)
  • dashboards
    • no problems observed after today's DB upgrade
  • databases
    • LCGR upgrade took 15 min longer than scheduled, looking OK
    • ATLAS DDM dashboard DB remains disabled to clean up old data, should be back later this afternoon
    • Oracle security patches applied on CASTOR stager and SRM DBs for all LHC experiments
  • grid services
    • VOMS/FTS/LFC affected by LCGR Oracle DB upgrade around noon, now OK
    • Upcoming grid software upgrades:
      • 10% of worker nodes running EMI-1 in preprod in the coming days
      • 1 preprod CREAM CE to be upgraded to EMI-1 this week (5 remain on gLite 3.2)
      • users should switch to our EMI-1 WMS servers, an EGI broadcast was sent to affected communities
    • Reminder 1: gLite UI 3.1 is deprecated since a long time and will not work with new WMS servers (6 March). Warning message added to UI startup script.
    • Reminder 2: LCG-CE nodes to be stopped ASAP, but some VOs are still using them!

AOB:

Wednesday

Attendance: local(Jamie, Mike, Luca, Luca, MariaDZ, Alex);remote(Mette, Gonzalo, Lisa, Stefano, Pavel, Kyle, Tiju, Vladimir, Ron, Jhen-Wei).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • T0/Central Services
      • Further issues arising from LCGR downtime (they are not issues with the LCGR downtime itself):
        • ATLAS DDM collection/caching of VOMS info disabled by voms-admin returning error code 0 (with empty result) despite VOMS service down; protection added in DDM
        • CERN-PROD: Transfer submission to FTS services at CERN not working after LCR intervention, fixed by restart at 17:00. GGUS:78807
    • T1s
      • SARA-MATRIX: many transfer failures on srm.grid.sara.nl, ticketed 7:23 today. Site reported 11:46 problem due to high namespace load, number of concurrent srmLs requests reduced, awaiting end of FTS downtime (Oracle 11g upgrade) to see if it helps. GGUS:78819

ATLAS internal

    • NTR


Sites / Services round table:

  • NDGF - ntr
  • PIC - on Monday night decided to disconnect from LHCONE test - link was limited to 1Gbps from NREN to Geant. As more and more T2s are on this will disconnect until final 10Gbps link is there.
  • FNAL - ntr
  • KIT - upgrade of LHCb cond DB and LFC running smoothly else ntr
  • RAL - successfully upgraded LHCb SRM
  • IN2P3 - ntr
  • CNAF - LHCb LFC has been restored; intervention complete and is now working
  • NL-T1 - maintenance today on Oracle service; upgrade to 11g. Unfortunately goes a lot slower than expected and will prob prolong downtime until tomorrow
  • ASGC - ntr
  • OSG - Rob reported that RSV / SAM reports were down; now back to normal

  • CERN DB - upgrade to LCGR yesterday and had a few probs after. Node #2 had broken network card causing instabilities. Today redundant card put in. Investigation under going on specific workload (some jobs of CMS dashboard) that give errors related to 11g. For CMSR upgrade to 11g ongoing.

  • CERN dashboards - following LCGR upgrade most services back within 10' except ATLAS DDM. Cleanup took longer (rebuild of index was on node #2) - dashboard was working by 20:30.

AOB:

Thursday

Attendance: local(Jamie, Maria, Mike, Maarten, Torre, Alex);remote(Elizabeth, Michael, Mette, Jhen-Wei, Stefano, Gonzalo, Alexander, Rolf, Gareth, Andreas, Burt, Vladimir,MariaDZ).

Experiments round table:

  • ATLAS reports -
  • Report to WLCGOperationsMeetings
    • T0/Central Services
      • CERN-PROD TZERO: Still investigating apparent deletion of dataset. Alessandro Di Girolamo has asked on the ticket whether Castor team can find any info on who issued the deletion. GGUS:78737
      • Large increase in connection counts on ADCR instance 2 ( PanDA) in last days, periods of exceeding limit on simultaneous sessions, investigating at the ( PanDA monitor) application level
    • T1s
      • SARA-MATRIX: awaiting return of FTS after Oracle 11g upgrade to check transfer failure resolution. GGUS:78819
      • IN2P3-CC: Brief (10min) period of unreachable SRM yesterday ~13:00, site investigating the cause. GGUS:78833
      • RAL-LCG2: Job failures with stage-out errors yesterday, also transfer errors, OK by yesterday evening. This morning site reported problems found with SRM nodes, under investigation. GGUS:78842

  • ATLAS internal
    • Procedural change for sites placed offline due to analysis functional test failures: Sites are placed in 'test' mode rather than 'brokeroff' mode. The goal of the change is to decrease the users' exposure to site problems. See elog:33458 for more information.


Sites / Services round table:

  • BNL - ntr
  • NDGF - ntr
  • CNAF - ntr
  • ASGC - ntr
  • PIC - ntr
  • NL-T1 - have completed Oracle 11g upgrade (and FTS is now back)
  • IN2P3 - ntr
  • RAL - outage next Wed (8 Feb) for some networking work on core - in GOCDB
  • FNAL - ntr
  • KIT - yesterday migrated LHCb cond and LFC to Oracle 11g. Finished with a few hours delay: 21:00 for cond and 22:30 for LFC
  • OSG - LHC operations found problem with file attachments with tickets coming from GGUS. Inability to access SOAP attachment. Problem being worked on. MariaDZ - would like to take offline as trying to get in contact with developers.

  • CERN Grid: decommissioning of LCG CEs: should have happened before Xmas, still being used but will drain from Monday on. More details tomorrow from Ulrich

  • CERN DB - at the moment have LCGR instance #2 down due to eth-2 (network i/f) down. Working on it but keeping node down to avoid affecting service. ATLAS archive DB being migrated right now. CMS online upgraded this morning but streaming currently down - problem with log miner and working on it. yesterday CMSR upgrade after with PhEDEx applications from FNAL couldn't connect due to wrong configuration. Fixed.

AOB:

Friday

Attendance: local(Jamie, Torre, Maarten, Mike, Eva);remote(Ulf, Michael, Alexander, Gonzalo, Xavier, Vladimir, Lisa, Stefano Perazzini, Stefano Belforte, John, Rolf, Kyle).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • CERN-PROD: castoratlas-xrdssl problem with proxies this morning, promptly acted on and resolved (20min), expired host cert GGUS:78886
    • Still investigating high connection counts on ADCR instance 2 ( PanDA). Application-level problem. Counts lowered in the interim by removing two servers.
  • T1s
    • IN2P3-CC: cooling system failure required shutdown of some worker nodes this morning, unscheduled downtime 00:30-11:00, WNs now restored


  • LHCb reports -
    • T0
      • CERN Pilots aborted (GGUS:78893). Solved: LCG CEs removed from LHCb configuration, CREAMCEs fixed
    • T1

Sites / Services round table:

  • NDGF - ntr
  • BNL - ntr
  • NL-T1 - ntr
  • PIC - uploaded SIR for cooling incident 10 days ago to CERN Twiki
  • KIT - ntr
  • FNAL - ntr
  • CNAF - ntr
  • RAL - ntr
  • IN2P3 - cooling failure this morning was in equipment which is outside on roof, freshly installed but some error in installation. Junction was frozen. There has been long-standing ticket for LHCb to logically move files from 1 space token to another. dCache developers helped us with this with some scripts - now finished. Might be interesting for other installations. GGUS:75158 for more info
  • OSG - ntr

  • CERN CEs - planning to retire LCG CEs, 2 went into strange state last night, not entirely clear. CREAM CE 208 pre-production CE running EMI release, # gridftp sessions too small due to config error, will be patched locally but should be fixed in official release. LCG CEs will be in draining mode for final retirement next week.

  • CERN DB - problem with CMS replication after migration to 11g was fixed late in evening. It was due to change of parameter - not all instances were up; some had new value and some old. SR with Oracle was not very useful but capture was able to restart

AOB: MariaDZ: The problem with attachments in the GGUS-OSG ticketing system interface is handled in GGUS:78844 and hopefully solved.

-- JamieShiers - 12-Jan-2012

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2012-02-06 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback