Week of 111205

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Claudio, Jamie, Szymon, Dan, David, John, Manuel, Jhen-Wei, Maarten, MariaDZ, Massimo);remote(Michael, Gonzalo, Mette, Alexander, Kyle, Rolf, Tiju, Lisa, Dimitri, Paolo).

Experiments round table:

  • ATLAS reports -
  • T0
    • CERN-PROD transfers from T2s failed. GGUS:77007. The partition sizes on one of the FTS boxes are insane. Will schedule a transparent migration of some channels to some sane hardware configuration on Monday. Ticket closed.
    • ss_BNL_01 VOBOX degraded due to two subscriptions for the same dataset but with different versions have been entered by Panda. Deleted entries and service was back.
    • ATLAS-KIT-Frontier was unavailable by an outage. Restored it and service back.
    • ADCR's node 3 rebooted.
  • T1 sites
    • ntr
  • T2 sites
    • ntr


  • CMS reports -
  • LHC / CMS detector
    • HI Physics data taking
  • CERN / central services
    • Huge load on T1TRANSFER tonight (up to 1200 active + 1220 queued transfers). Likely due to stageout to CERN by CRAB jobs. To be understood how to address this problem within CMS.
    • T0EXPRESS filling up (GGUS:77036) asked to move empty servers from CMST3 [ John - moved some extra capacity to T0EXPRESS so should be ok now ]
  • T0
    • Running HI express and prompt reconstruction.
  • T1 sites:
    • MC production and/or reprocessing running at all sites.
    • T1_DE_KIT: File access problems (GGUS:76807)
    • T1_TW_ASGC: Migration problems (SAV:125041) and file access problems (GGUS:77047)
    • Problem of thoughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983) [ Rolf - ticket is with RENATER - try to speed up but for the most part out of our responsibility ]
  • T2 sites:
    • NTR
  • Other:
    • NTR


  • ALICE reports -
    • The ALICE build machines were affected by the micro power cut early afternoon. (which are not in the CC)


Sites / Services round table:

  • BNL - ntr
  • PIC - ntr
  • NDGF - ntr
  • NL-T1 - ntr
  • IN2P3 - we will have major downtime tomorrow for various services; exact times vary with service. Batch submission stops 19:00 UTC. Still running jobs will stop 06:30 CET tomorrow and batch will restart on 7 Dec 09:00 CET. dCache, xrootd, MSS etc impacted but for much shorter times. Reminder for LHCb: still waiting for reply on GGUS Ticket since > 1 week. GGUS:75158
  • RAL - ntr
  • FNAL - ntr
  • KIT - ntr
  • CNAF - ntr

  • OSG - ntr

  • CERN - upgrade ATLAS EOS to RC44 at 11:00 this morning. One of central servers crashed and failover took 20'.

  • CERN DB - the only problem was with ADCR - still being investigated. After reboot there was high risk of reboot of first node due to overload. Restarted and all services on proper nodes.

AOB: (MariaDZ) There will be the usual test ALARMs with the GGUS Rel. this Wed 2011/12/07. You may browse here the list of 13 items that will enter production.

Tuesday:

Attendance: local(Przemyslaw, Jamie, Dan, Steve, Stephen, Mattia, Lola, MariaDZ, Maarten);remote(Michael, Burt, Mette, Jeremy, Ronald, Kyle, Tiju, Jhen-Wei, Rolf).

Experiments round table:

  • ATLAS reports -
  • T0
    • ADCR Instance 3 not available from ~11-11:45AM. ATLAS DQ2 and LFCs had to be restarted. LFC restart was initiated by ALARM (GGUS:77049) [ Przemyslaw - from IT-DB, LFC problems last night on ADCR: problem solved itself, looks like one of the sessions (probably LFC) died due to unknown reason. Oracle has internal mechanism to ping session after 2h. Looks like one session died and kept resources until 2h timeout. ]
    • Around 18:00 response to a single "bsub" call started taking 30-60s on average. ALARM sent (GGUS:77065). Was caused by user calling bstatus repeatedly.
    • LFC had many locking sessions starting ~10pm. Jobs in TW, DE, T0 clouds started failing with LFC errors. ALARM sent to LFC folks (GGUS:77069). Locks gone around 11:30pm, but no reply to ticket last night.
  • T1 sites
    • SARA transfers failing (GGUS:77072). Site reports SRM is overloaded. [ Ronald - SRM indeed overloaded, currently optimising Postgres DB which so far has not had any results. Any change in access patterns? Lots of puts, gets and ls requests - seems to be abnormal. Dan - since outage you have been blacklisted, is load from outside? We will look if anything changed. Ronald - being followed up in GGUS ticket ]


  • CMS reports -
  • LHC / CMS detector
    • HI Physics data taking
  • CERN / central services
    • LCG Database issue today
    • Huge load on T1TRANSFER at the weekend (up to 1200 active + 1220 queued transfers). Likely due to stageout to CERN by CRAB jobs. To be understood how to address this problem within CMS.
    • T0EXPRESS filling up (GGUS:77036) more servers added. Thanks!
  • T0
    • Running HI express and prompt reconstruction.
  • T1 sites:
    • MC production and/or reprocessing running at all sites.
    • T1_DE_KIT: File access problems (GGUS:76807). Closed now, was a bad host.
    • T1_TW_ASGC: Migration problems (SAV:125041) and file access problems (GGUS:77047). They think it is their tape system but not sure if it is hardware or software.
    • Problem of thoughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983) [ Rolf - CMS network problem seems to be related to another problem with Japanese site; following up in both directions: to US and to Japan. ]
  • T2 sites:
    • NTR
  • Other:
    • NTR


Sites / Services round table:

  • BNL -
  • FNAL - ntr
  • NDGF - ntr
  • NL-T1 - ntr
  • RAL - ntr
  • IN2P3 - still in downtime but considering to open batch system earlier than foreseen but with reduced power; maybe already this evening
  • ASGC - ntr
  • OSG - ntr
  • GridPP - ntr

  • CERN DB: this morning (around 03:00) problem with LCGR started. Most affected application was VOMS, also dashboards and LCG SAM. i.e. most applications on LCGR. Investigated all morning. Before finding root cause instance 5 become unresponsive and was rebooted. After that all went back to normal. Looks like low-level locking in DB.. To find root cause will have to work with Oracle support.

  • CERN - FTS T2 service. IT SSB. At 09:00 UTC on Wednesday 7th December the tier2 FTS channels on fts-t2-service.cern.ch will be paused for up to two hours to allow migration to a sensible hardware configuration. During this time new transfers will queue for later processing.

  • CERN Dashboard - ATLAS collector had issues from DB problem; started to catch up and recover backlog. This afternoon ATLAS dashboard intervention was planned but due to LCGR problem postponed until tomorrow.

AOB: (MariaDZ) With tomorrow's GGUS Release the ALARM tickets' email notifications change template to https://savannah.cern.ch/file/Alarm_NEW.txt?file_id=22758 to avoid supporters being notified by text messages on their mobile phones (SMS) to see the series of asterisks that preceded the service affected. Details in Savannah:124169 .

Wednesday

Attendance: local(Jarka, Jamie, Lukasz, Przemyslaw, Manuel, John);remote(Gonzalo, Mette, Burt, Kyle, Tiju, Ron, Pavel, Stephen, Rolf).

Experiments round table:

  • ATLAS reports -
  • LHC:
    • 2011 data taking ends this afternoon.
  • T0
    • Regarding the LFC locking sessions (GGUS:77069) from Monday night, the source of the lock could not be identified (the relevant query is called from 44 places in the LFC code). Closed issue.
  • T1 sites
    • SARA transfers failing (GGUS:77072). Seem to be caused by an order of magnitude higher number of srmPut, Ls commands to the SARA SRM. ATLAS could not find anything corresponding in the Production Jobs or Data Transfers. Yesterday evening ATLAS un-blacklisted the queues and storage endpoints -- no errors observed afterwards.
    • RAL T0 export errors (GGUS:77108). Was fixed before ticket was sent. (error during TRANSFER_PREPARATION phase: [REQUEST_TIMEOUT] Request timeout (internal error or too long processing))


  • CMS reports -
  • LHC / CMS detector
    • HI Physics data taking finishing today
  • CERN / central services
    • CMSR database problem this morning, GGUS:77142 (around 10:30 and restored itself around 11:00) [ IT-DB: service request against Oracle support being prepared; cannot find root cause ourselves ]
    • Huge load on T1TRANSFER at the weekend (up to 1200 active + 1220 queued transfers). Likely due to stageout to CERN by CRAB jobs. To be understood how to address this problem within CMS.
  • T0
    • Running HI express and prompt reconstruction.
  • T1 sites:
    • MC production and/or reprocessing running at all sites.
    • T1_TW_ASGC: Migration problems (SAV:125041) and file access problems (GGUS:77047). They think it is their tape system but not sure if it is hardware or software.
    • Problem of thoughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983)
  • T2 sites:
    • NTR
  • Other:
    • NTR


Sites / Services round table:

  • PIC - we had an issue with network since around midnight. OPN link down due to incident in Madrid. Has no impact as data routed through internet. Around noon secondary incident of 20' which led to disruption but with little / no impact (CMS debug transfers). Following incident with NREN for fixing OPN link
  • NDGF - tonight CSC tape b/e will be down from 9pm to midnight and some ALICE data will be unavailable
  • FNAL - ntr
  • ASGC - ntr
  • RAL - had a problem with DNS around 11:00; worked around problem by changing order. Fixed now, affected some ATLAS transfers and batch work
  • NL-T1 - had problem yesterday with SRM; considerable overload with puts, gets, ls. Spent large part of day to try to tune further postgres db. Restarted everything late yesterday pm and problem disappeared.
  • IN2P3 - out of downtime and all services back; started yesterday in pm with low capacity
  • KIT - ntr

  • OSG - ntr

  • CERN - ntr

AOB: (MariaDZ) Please observe the "Did You Know?..." link today on the GGUS homepage. It changes with every release and it now shows the new functionality of Site Availability Status info directly from GOCDB. Details on https://ggus.eu/pages/didyouknow.php#2011-12-07

Thursday

Attendance: local(Dan, Jamie, Lukasz, Maarten, Zbyszek, Edoardo, Manuel, MariaDZ);remote(Mette, Michael, Burt, Ronald, Gareth, Rolf, Kyle, Stephen, ShuTing).

Experiments round table:

  • ATLAS reports -
  • EGI
  • T0
    • ntr
  • T1 sites
    • SARA-MATRIX transfer errors (GGUS:77224). SRM was unreachable for ~2 hours this morning. [ Ronald - SARA had not seen any errors in logfiles but got complaints from local LHCb user. Still under investigation. ]
    • Failed to contact on remote SRM at IN2P3-CC (GGUS:77242). Queues were also auto-excluded because of failing jobs. [ Rolf - we had an SRM crash at noon which might explain problems seen. ]


  • CMS reports -
  • LHC / CMS detector
    • Shutdown
  • CERN / central services
    • Change of UK CA causing problems for CMS User (probably first of many) GGUS:77179
  • T0
    • Running HI prompt reconstruction.
  • T1 sites:
    • MC production and/or reprocessing running at all sites.
    • T1_TW_ASGC: Migration problems (SAV:125041) and file access problems (GGUS:77047). They think it is their tape system but not sure if it is hardware or software. [ ShuTing - service manager still investigating and will update ticket ]
    • Problem of throughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983)
    • Corrupt files at IN2P3 GGUS:77173
  • T2 sites:
    • NTR
  • Other:
    • NTR


Sites / Services round table:

  • NDGF - work at CSC tape b/e not yet finished - will continue until Sat am. Server at Sweden offline 08:00 - 17:00 tomorrow. Will affect ATLAS tape reading.
  • BNL - announcement: next week will be upgrading HPSS so backend not available Mon-Fri. ATLAS informed.
  • FNAL - 1) increased timeouts in FTS for non-US T2s. 2) 4 MC files deleted; should not have been; in process of adding additional integrity and safety checks.
  • NL-T1 - nta
  • RAL - ntr
  • IN2P3 - nta
  • ASGC - nta
  • OSG - alarm tests yesterday went through correctly and were patched to CMS and ATLAS correctly.

  • CERN DB: next Wed would like to do intervention on replication to T1s; LHCb: 10-12, ATLAS 14:30 - 16:30. If objections please holler. (Preparation for upgrade to 11g).

AOB:

  • LHC page 1: 07-12-2011 18:00 "End of 2011 run. Thank you all for this brilliant and exciting year. We look forward to another unforgettable year in 2012. Start of 2011 Xmas stop."

Friday

Attendance: local(Jamie, Lukasz, Przemyslaw, Jan);remote(Xavier, Tore, John, Joel, Alexander, Jhen-Wei, Paolo, Rolf, Stephen, Burt).

Experiments round table:

  • ATLAS reports -
  • T0
    • One PanDA monitor server down with a hardware problem.
  • T1 sites
    • Some file deletion errors at TRIUMF (GGUS:77254). TRIUMF had a incident on network from 13:40 to 14:10 (UTC) but the failure also happened beyond this period. Still investigating.


  • CMS reports -
  • LHC / CMS detector
    • Shutdown
  • CERN / central services
    • Change of UK CA causing problems for CMS User (probably first of many) GGUS:77179
  • T0
    • Running HI prompt reconstruction.
  • T1 sites:
    • MC production and/or reprocessing running at all sites.
    • T1_TW_ASGC: Migration problems (SAV:125041) and file access problems (GGUS:77047). They think it is their tape system but not sure if it is hardware or software.
    • Problem of thoughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983)
    • Unable to run multicore jobs at PIC SAV:125174
  • T2 sites:
    • NTR
  • Other:
    • NTR

  • LHCb reports - yesterday we raised an alarm ticket for a VO box at CERN which was not responding Lemon did not report as we have an issue to have Lemon agent on this machine. Manuel - more than 20K connections in timewait so kernel cannot provide sockets for any other application. Lemon agent designed so that if it doesn't have resources it dies. Will reply soon to ask LHCb to have a look - connections to MySQL in another VO box. Joel - this started without any changes on our side; have tried to tune but without success. Should sit with an expert to try to find issue behind. Main general comment from LHCb: starting huge MC production which will last several weeks so will keep Tier1 and Tier2 busy. Rolf - when will this start? This w/e or next week? Joel - we have started some production with a low number of jobs but will increase progressively in coming weeks.

Sites / Services round table:

  • KIT - ntr
  • NDGF - ntr
  • RAL - next Monday "an away day" so staff won't be able to call in
  • NL-T1 - ntr
  • CNAF -
  • ASGC - ntr
  • IN2P3 - ntr
  • FNAL - ntr

  • OSG - ntr

  • CERN Storage - received first real EOS release and will try to deploy next week. Stephen - announced internally to CMS

AOB:

-- JamieShiers - 28-Nov-2011

Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r25 - 2011-12-09 - StephenJGowdy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback