Week of 121210

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local (AndreaV, LucaM, David, Peter, Torre, MariaD, Maarten); remote (Onno/NLT1, Paolo/CNAF, Gonzalo/PIC, Wei-Jen/ASGC, Lisa/FNAL, Tiju/RAL, Kyle/OSG, Zeeshan/NDGF, Dimitri/KIT; Daniela/LHCb).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • CERN-PROD: EOS source errors and several periods of EOSATLAS instability in SLS over weekend. GGUS:89328 [Luca: all was ok until Friday afternoon, then there were a few crashes of the system on Friday, Saturday and Sunday. This is understood as due to bugs in the xroot layer, a patch is available in a pre-release that will be installed this week. We would also like to do a hardware intervention that may reduce the impact of this bug.]
      • CERN-PROD: ALARM: ATLAS web server down Sat am, response in 10min, resolution in ~30. Due to power outage. GGUS:89334
    • T1
      • FZK-LCG2: Steady <8% job failure rate due to timeouts saving files to local SE, logged on reopened 2/12 ticket GGUS:89110
      • Taiwan-LCG2: Missing file needed for production. Affected by disk maintenance, recovered by site. GGUS:89332
      • PIC: failing source transfers Sat pm. Cured with SRM restart. Site is checking what caused the SRM failures. GGUS:89338

  • CMS reports -
    • LHC / CMS
      • 25ns program starts with scrubbing and machine development started and will let till Dec. 12th
      • 25 ns physics fills are not expected before Dec. 12th
    • CERN / central services and T0
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573, last updated 2012-12-07
    • Tier-1:
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, low level network investigation IN PROGRESS GGUS:88752, last updated 2012-12-06 via Footprints: "setting out for review"
    • Tier-2:
      • NTR

  • LHCb reports -
    • Reprocessing has started
    • Question to SARA: is the intervention affecting only tapes or also disks? [Onno and Maarten: it affects everything]

Sites / Services round table:

  • Onno/NLT1: downtime in progress, tapes will take longer than planned because a vendor intervention has been necessary
  • Paolo/CNAF: ntr
  • Gonzalo/PIC: due to electricity costs in Spain during winter, we would like to reduce CPU power to 70% until February, we are in contact with the experiments to discuss this. This is not the same issue discussed last week, which was about a temporary reduction in CPU power due to a cooling intervention this week.
  • Wei-Jen/ASGC: there were some job failures due to disk maintenance, but no data loss
  • Lisa/FNAL: ntr
  • Tiju/RAL: reminder, UPS tests tomorrow, we are declared at risk
  • Kyle/OSG: ntr
  • Zeeshan/NDGF: ntr
  • Dimitri/KIT: reminder, a few downtimes are planned between the 18th till the end of December (e.g. Cream CE upgrades), all are declared in GOCDB

  • Luca/Storage: will upgrade the memory of EOS head nodes, will ask Alice for a short downtime this week. [Maarten: send an email so we discuss it offline. This should be possible later this week.]
  • David/Dashboard: ntr
  • MariaD/GGUS: There will be a 'mini' Release this Wednesday 2012/12/12, followed by ALARMs' tests. Detailed development item list here and announcement on page https://ggus.eu/pages/news_detail.php?ID=474.

AOB: none

Tuesday

Attendance: local (AndreaV, David, Peter, Maarten, LucaM, Jacob, MariaD, Eva); remote (Wei-Jen/ASGC, Xavier/KIT, Paolo/CNAF, Rolf/IN2P3, Tiju/RAL, Rob/OSG, Lisa/FNAL, Ron/NLT1; Daniela/LHCb).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • NTR but thanks to SARA-MATRIX guys for tweaking space tokens today, we're working on DATADISK space issue.

  • CMS reports -
    • LHC / CMS
      • 25ns program started with scrubbing and machine development
      • 25ns physics fills are not expected before Dec. 12th
    • CERN / central services and T0
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573, last updated 2012-12-10
    • Tier-1:
    • Tier-2:
      • NTR

  • LHCb reports -
    • Reprocessing running smoothly.
    • T0:
    • T1:
      • NL-T1: Downtime SARA finished
      • PIC files lost from archive in the migration with no other replicas. Not a big issue, files were old and supposed to be deleted anyway.
      • RAL has another T2 attached now: LCG.Krakow.pl

Sites / Services round table:

  • Wei-Jen/ASGC: ntr
  • Xavier/KIT: ntr
  • Paolo/CNAF: ntr
  • Rolf/IN2P3: still in downtime as scheduled, we should manage to be back up tomorrow as announced
  • Tiju/RAL: ntr
  • Rob/OSG: maintenance today, all services will be rebooted, should be transparent
  • Lisa/FNAL: ntr
  • Ron/NLT1: downtime yesterday on the tape system, took longer than expected and required vendor intervention due to bug in mass storage system, all OK now

  • David/Dashboard: ntr
  • LucaM/Storage: scheduling an EOS intervention with ALICE, waiting for xrootd release
  • Eva/Databases: ntr
  • MariaD/GGUS:
    • new GGUS release tomorrow, alarm tests will go on (but on RAL requests those in Europe will be a bit later than the usual 7am)
    • Maarten discovered a bug in GGUS, the popup that prevents users from submitting tickets on sites that are down in GOCDB does not work for ALARM and TEAM tickets. A GGUS ticket is opened.

AOB: none

Wednesday

Attendance: local (AndreaV, Kate, Peter, David, LucaM, MariaD); remote (Gonzalo/PIC, Stefano/CNAF, Wei-Jen/ASGC, John/RAL, Lisa/FNAL, Rolf/IN2P3, Ron/NLT1, Rob/OSG; Daniela/LHCb, Jacob/CMS).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • Please comment on General Power intervention next week announcement [Kate: will check if databases are affected by this]
    • T1
      • RAL oracle blocking sessions (ELOG)

  • CMS reports -
    • LHC / CMS
      • 25ns machine development: 2 fills then more scrubbing
      • Stable beams for physics fills expected Friday evening or Saturday
    • CERN / central services and T0
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573, last updated 2012-12-11
    • Tier-1:
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, low level network investigation IN PROGRESS GGUS:88752, last updated 2012-12-11
    • Tier-2:
      • NTR

  • ALICE reports -
    • CERN: EOS intervention appears to have gone OK, thanks!

  • LHCb reports -
    • Reprocessing running smoothly.
    • T0:
    • T1:
      • NL-T1: Jobs failing to access files at SARA due to incorrect TURL resolution. Solved quickly. (GGUS:89511)
      • CERN: Pilots still failing at CERN. (GGUS:88796) submitted in November, still not resolved. Erorr: Invalid CRL: The available CRL has expired (affects only some WNs); Also VOMS not responding yesterday. (GGUS:89497) Today seems to work fine, though there was no response from the ticket.

Sites / Services round table:

  • Gonzalo/PIC: ntr
  • Stefano/CNAF: during the next few days the experiments could observe a decrease in job slots, because of WN reconfiguration with the new file system
  • Wei-Jen/ASGC: ntr
  • John/RAL: the downtime mentioned by ATLAS is actually not over yet, there are still some issues being investigated
  • Lisa/FNAL: ntr
  • Rolf/IN2P3: still in downtime but restarting now, we are around 1h behind schedule and we should be back at 5pm instead of the expected 4pm
  • Ron/NLT1: ntr
  • Rob/OSG: ntr

  • LucaM/Storage: EOS ALICE intervention completed, all ok though it took longer than expected
  • David/Dashboard: ntr
  • Kate/Databases: ntr

  • MariaD/GGUS:
    • The GGUS:89484 mentioned yesterday, on missing functionality for TEAM and ALARM tickets not showing the site availability status will be treated in the development tracker for proper testing and entry in production at the next release.
    • Release caused some email warnings in German from GGUS notifications. Being followed up in Savannah:134078#comment9 and more recent comments. We are negotiating the need for further testing. See https://ggus.eu/ws/ticket_info.php?ticket=89558#update#8. This was solved before noon. The proof in GGUS:89586.

AOB: none

Thursday

Attendance: local(Simone, Kate, Edi, Luca ); remote(Wei-Jen - ASGC, Peter Love - ATLAS, Kyle -OSG, Zeeshan - NDGF, Dimitrios - RAL, Marc -IN2P3-CC, Daniela - LHCb, Jacob - CMS , Stefano - CNAF, Ronald - NL-T1, Marian - KIT , Lisa - FNAL).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • If no information to the contrary, we will assume central services are not affected by power intervention next week.
    • T1
      • NTR

  • CMS reports -
    • LHC / CMS
      • 25ns machine development: injecting and ramping bunches, but no more collisions before physics fills
      • 25ns physics program scheduled to start midnight Friday night
    • CERN / central services and T0
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573, last updated 2012-12-11
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • LHCb reports -
    • Reprocessing: running last jobs at RAL, Gridka and Cnaf "groups", restarting with new files at 10 Dec
    • Prompt reconstruction: CERN + 5 Tier2 sites
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0: NTR
    • T1:
      • FTS transfer failures to Gridka disk from different sites (GGUS:88906)

Sites / Services round table:

  • OSG: BDII back to normal
  • RAL: one disk server still in intervention till tomorrow (ATLAS)
  • IN2P3: downtime finished, all services back in production.
  • KIT: upgrade of SGE next week (rolling and transparent). Old CEs will be drained on Sunday

AOB:

Friday

Attendance: local (AndreaV, Kate, Maarten, LucaM, Alexandre); remote (Xavier/KIT, Lisa/FNAL, Onno/NLT1, Marc/IN2P3, Paolo/CNAF, Wei-Jen/ASGC, Rob/OSG, Gareth/RAL, Christian/NDGF; Peter/ATLAS, Daniela/LHCb, Jacob/CMS).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • PIC SRM queue issues, seems OK in last hour

  • CMS reports -
    • LHC / CMS
      • 25ns physics program still scheduled to start midnight tonight, but probably will be delayed until Saturday
    • CERN / central services and T0
      • CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573, last updated 2012-12-11
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • LHCb reports -
    • Reprocessing running smoothly. All reprocessing for 2012 submitted.
    • T0:
    • T1:
      • PIC: Jobs failing to access data due to TURL resolving errors. (GGUS:89664) Reason: SRM instabilities. Huge queue of Get Requests from different experiments. Max queue length increased and SRM restarted. Problem solved quickly.

Sites / Services round table:

  • Xavier/KIT: migration of batch system is progressing as planned
  • Lisa/FNAL: ntr
  • Onno/NLT1: ntr
  • Marc/IN2P3: ntr
  • Paolo/CNAF: ntr
  • Wei-Jen/ASGC: ntr
  • Rob/OSG,: ntr
  • Gareth/RAL: ntr
  • Christian/NDGF: we had to extend the downtime at the BCCS site for the ARC upgrade
  • Gonzalo/PIC (via email after the meeting):
    • We acknowledge the reported issue that we are experiencing with SRM overloads. Our operator on duty and experts are keeping a close eye on it to minimise the possible impact in case a new episode occurs. For the moment, we have been managing to bring the SRM to stable situation fast. We are in contact with dCache experts to try and better understand the ultimate origin of this issues we are experiencing with SRM queues.
    • We have also been in contact with LHCb regarding an issue we had with an archive tape for which the data was accidentally lost during a repack procedure. In the end it seems that only 2 files out of the 930 deleted were single replicas at PIC. A SIR has been filed and uploaded to the WLCG wiki.

  • LucaM/Storage: ntr
  • Kate/Databases: problem with CMS DataGuard today due to shared memory, we applied a fix and all looks ok now
  • Alexandre/Dashboard: plan an intervention for the WLCG transfer dashboard on Monday, it should be transparent

AOB: none

-- JamieShiers - 18-Sep-2012

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2012-12-21 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback