Week of 130218

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance: local(Alex B, Belinda, Maarten, Stefan, Steve, Xavier);remote(Boris, Elizabeth, Gareth, Lisa, Michael, Onno, Rolf, Torre, Wei-Jen).

Experiments round table:

  • ATLAS reports -
    • T0/Central services
      • NTR
    • Tier1s
      • Sat pm: transfer failures to Taiwan. Attributed by site to busy disk servers, OK again and ticket close Sun night. GGUS:91581
      • Sun pm: Source errors in transfers from TRIUMF-LCG2 and other CA sites. FTS cannot contact non-CA FTS servers. Site is working on it. GGUS:91588
    • Tier 2 calibration centers
      • Sun am: ES CALIBDISK failures of Functional test transfers, SRM down at IFIC, all file transfers failing. Failure in one RAID group, now offline, Restoring Lustre and SRM. GGUS:91586
    • FYI: ATLAS AMOD(s) for this and next week not yet identified.

Sites / Services round table:

  • ASGC
    • ATLAS and CMS jobs affected by CVMFS 2.0.19 cache filling up due to known bug; for now mitigated by manual cleanups; should be fixed in upcoming (2.1.7) release expected in a number of days
  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • NDGF - ntr
  • NLT1
    • during the weekend there was 1 dCache pool node stuck, restarted yesterday night
  • OSG - ntr
  • RAL
    • some ongoing issues with the batch system not starting enough jobs, being investigated

  • dashboards - ntr
  • GGUS
    • NB!!! The italian Tier1 needs to update the host certificate for their ticketing system (ticketing.cnaf.infn.it). The change will be on Wednesday 2013/02/20 around 9:30am CET. A short interrupt may be perceived in the interface with GGUS as the server needs to be rebooted. Details in Savannah:135912
  • grid services - ntr
  • storage
    • during the weekend EOS-LHCb was unstable; after SW updates earlier today its behavior looks smoother in the monitoring

AOB:

Tuesday

Attendance: local(Alex B, Eva, Maarten, Maria D, Stefan, Steve);remote(Boris, Jeremy, Lisa, Matteo, Michael, Pepe, Rob, Rolf, Ronald, Saverio, Tiju, Wei-Jen, Xavier).

Experiments round table:

  • ATLAS reports -
    • NTR. Most probably no one from ATLAS can connect today. Sorry.

  • LHCb reports -
    • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
    • T0: NTR
    • T1:
      • IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126).
        • Rolf: the ticket is with the SAM team now; we are not aware of changes that might explain why the test works only sometimes
        • Stefan: the test is failing randomly, the cause is not yet known

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP
    • the multi-core ATLAS jobs mentioned in the KIT report are PROOF-Lite jobs
  • IN2P3
    • on March 19 there will probably be an all-day outage for electrical work, details to follow later
  • KIT
    • our Frontier Squid servers will be updated between 9-10 UTC tomorrow and the day after, should be transparent
    • single-core queues have been misused by ATLAS users submitting multi-core jobs, ATLAS are following up
  • NDGF
    • we have observed transfer errors due to a network problem, being investigated
  • NLT1 - ntr
  • OSG - ntr
  • PIC
    • complete downtime on March 26 between 5-19 UTC for electrical maintenance
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW
    • experiments will no longer be prompted to inform Maria of important tickets that are not making progress, as such tickets can just be included in the experiment reports of the bi-weekly Operations Coordination meeting
  • grid services - ntr

AOB:

Wednesday

Attendance: local(Alexei, Belinda, David, Dirk, Luca C, Maarten, Maria D, Massimo, Stefan, Steve);remote(Boris, John, Kyle, Lisa, Matteo, Michael, Pavel, Pepe, Rolf, Ron, Wei-Jen).

Experiments round table:

  • ATLAS reports -
    • Central services
      • GGUS: Problem in opening a TEAM ticket to a specific site: The site parameters haven't been synchronized correctly with GOC DB (GGUS:91634)
      • GGUS: A shifter could not open tickets with an issue with CRL (GGUS:91610)
      • SLS for ATLAS HammerCloud has been outdated (last update: 8 Feb 2013) https://sls.cern.ch/sls/service.php?id=HC.ATLAS
        • On Feb 12: "the web server that serves the SLS reports is decommissioned and I'm moving the thing to the new one."
    • T1s and network
      • FZK-LCG2: There are long standing ggus tickets for problems in transfers between FZK-LCG2 and UK sites (GGUS:87958, GGUS:91439)
      • RRC-KI-T1: ATLAS has started integrating the RU-T1 (RRC-KI-T1) in ATLAS systems. FTS3 servers at RAL and CERN were used for test fie transfers.
        • Alexei: the prototype T1 will be used in a reprocessing exercise
      • Alexei: next week a small reprocessing campaign will run at the T1 sites

  • LHCb reports -
    • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
    • T0: NTR
      • Migration CASTOR -> EOS progressing, estimated to last for another 6 weeks
    • T1:
      • IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126), logfiles of failed sam probes seem to indicate that the probe is killed by the batch system (logs uploaded to GGUS ticket)
        • Rolf: we have also involved our batch system experts

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF
    • CREAM ce01-lcg will be down Feb 21-28 for upgrade to EMI-2 on SL6
  • FNAL - ntr
  • IN2P3 - nta
  • KIT
    • 3 Frontier Squid servers were upgraded OK today, the remaining 3 will be done tomorrow
  • NDGF
    • the transfer errors reported yesterday are still being investigated
    • tomorrow there will be a short downtime of the SRM head node for security patching; it might even cure the transfer errors
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards
    • this morning the ATLAS job monitoring dashboard was affected by a DB problem, resulting in the job history having a few small gaps
  • databases
    • this morning one dashboard application was affected by a change in an Oracle query execution plan, fixed
  • GGUS/SNOW
    • The host certificate for the italian ticketing system (ticketing.cnaf.infn.it), announced last Monday, took place this morning and was successful.
    • Next GGUS Release will be in a week, on 2013/02/27.
    • An interface between GGUS and the ibergrid RT ticketing system will enter production with next week's GGUS release. The change affects PIC. In case of any problem, please open a GGUS ticket against GGUS or comment in Savannah:130314 .
  • grid services
    • there was a problem with the batch system dispatching jobs this morning, fixed
  • storage - ntr

AOB:

Thursday

Attendance: local(Alessandro, Alex B, Belinda, Luca M, Maarten, Stefan, Steve, Ueda);remote(Boris, Gareth, Lisa, Marian, Michael, Pepe, Rob, Rolf, Rolf, Ronald, Saverio, Wei-Jen).

Experiments round table:

  • ATLAS reports -
    • Central services
      • GGUS: Problem in opening a TEAM ticket to a specific site: GGUS:96634 verified:The site parameters haven't been synchronized correctly with GOC DB/GGUS.
      • GGUS: A shifter could not open tickets with an issue with CRL. GGUS:91610 in progress: the shifter was recommended to use temporarily the account which is mapped to the certificate.
      • SLS for ATLAS HammerCloud unavailable (in grey). Fixed, migration to the new hardware completed.
    • T1s and network
      • FZK-LCG2 from UK sites file transfer problems: 1340 failures "GRIDFTP_ER.:server err.500" from UKI-SCOTGRID-GLASGOW and 30 failures from UKI-NORTHGRID-LIV-HEP GGUS:87958 in progress updated.
        • Marian: we are also looking into PerfSONAR measurements

  • LHCb reports -
    • Ongoing activity as before: reprocessing (CERN, IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
    • T0:
      • ALARM ticket (GGUS:91690) for afs hosted web service which is not responding. It serves grid jobs for configuration and setup purposes
        • Stefan: it was due to an accidental DoS by one machine
      • Many failures in CASTOR->EOS migration because of different checksums in LFC and CASTOR
        • Luca: working on it, please send the list of affected files
        • Stefan: OK, ~300 files from 2008, also on tape; the issue is due to the presence or absence of leading zeroes in the checksum
    • T1: NTR

Sites / Services round table:

  • ASGC
    • downtime Feb 25 23:00 to Feb 26 18:00 UTC for upgrades of CASTOR, DPM and storage firmware
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - nta
  • NDGF
    • the transfer errors were due to a network problem, things look better now
    • today's SRM maintenance went OK
  • NLT1
    • this morning SARA had an unscheduled outage: dCache was unavailable due to a loose fiber
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • grid services - ntr
  • storage
    • CASTOR DB NAS will have a HW intervention 17:30-21:30 CET, should be transparent

AOB:

Friday

Attendance: local(AndreaS, Kate, Mike, Steve, Belinda, Stefan);remote(Xavier/KIT, Gareth/RAL, Wei-Jen/ASGC, Onno/NL-T1, Michael/BNL, Matteo/CNAF, Lisa/FNAL, Rolf/IN2P3, Rob/OSG, Pepe/PIC).

Experiments round table:

  • ATLAS reports -
    • Central services
      • GGUS: A shifter could not open tickets with an issue with CRL. GGUS:91610 solved:the issue with the certificate was fixed, the shifter is able to open/update GGUS now.
      • CERN/VOMS problems affecting ATLAS production and analysis jobs. (GGUS:91704, GGUS:91706, GGUS:91710). Thanks to Maarten for the quick action and the ALARM (GGUS:91706).

  • ALICE reports -
    • CERN: VOMS incident (see below), alarm ticket GGUS:91706 opened yesterday evening ~20:00 [Steve: something similar happened months ago]
    • CERN: EOS lost 17 files, 12 were dark data

  • LHCb reports -
    • Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
    • T0:
      • ALARM ticket (GGUS:91690) for afs hosted web service which is not responding, understood and fixed
    • T1: NTR

Sites / Services round table:

  • ASGC: had a CASTOR crash this morning
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NL-T1:
    • one dCache pool was stuck this night for six hours and restarted this morning. It happened already a few times, we hope that a kernel upgrade will fix the problem.
    • This Monday and Tuesday, SURF-SARA will be in maintenance
  • PIC: ntr
  • RAL: on Thuesday morning we declared a downtime "at risk" to reboot a network switch. The effect should be minimal.
  • OSG: ntr
  • CERN batch and grid services: VOMS incident, wrong host certificate put in place on voms.cern.ch , IncidentVOMSFeb2013 Service broken at 16:10 on Thursday, restored Friday at 07:00 this morning.
  • CERN storage services: ntr
  • Dashboards: ntr
  • Databases: ntr

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2680.0 K 2013-02-18 - 11:52 MariaDimou Final ALARM drills for the 3012/02/19 WLCG MB.
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2013-02-22 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback