Week of 110606

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local (AndreaV, Dirk, Massimo, Lola, Manuel, Jan, Jarka, Alessandro, Michal, Peter, Lukasz, MariaDZ); remote (Tiju, Ron, Rob, Brian, Rolf, Xavier, Ulf, Daniele, Jon, Michael, Gonzalo, Felix; Ian, Roberto).

Experiments round table:

  • ATLAS reports -
    • RAL-LCG2: Disk server gdss135 unavailable. Site investigates. "this disk server is part of the tape buffer and therefore no files are unavailable. The worst this error could cause is a few transient errors in tape recalls."
    • FZK-LCG2: 2 tape servers broken, ~190 files lost. Files declared as lost to ATLAS DDM Ops.
    • CERN-PROD: CERN-PROD_TMPDATADISK failed to contact on remote SRM GGUS:71164. GGUS solved (2nd Jun): The SRM server for EOSATLAS locked up at 4:00 this morning (cron.daily time). Unfortunately monitoring was still working to the point that the machine didn't get immediately rebooted, this only happened around 9:20. From the plots it looks like things were OK afterwards. We don't really know why SRM decided to act like this (logs are incomplete), so this could happen again (still, we've changed some settings for updatedb and log rotation).
    • CERN-PROD: Connection refused to CERN-PROD_DATATAPE GGUS:71216, evaluated as problem with data import into EOS (4th June). Under investigation.
    • CERN-PROD: destination file failed on the SRM with error [SRM_ABORTED] at CERN-PROD_DATADISK. GGUS:71226 (5th June) Transfers finished after several attempts, ticket closed by ATLAS shifter.
    • CERN-PROD_TMPDATADISK: GGUS:71227 Unable to connect to gdss352.gridpp.rl.ac.uk (RAL). Its a problem with CERN-PROD_TMPDATADISK with a mis-leading error. [Tiju: that machine was unavailable for just one night.]
    • [Jarka: ADCR database was down today, should be ok now, any comment?]
    • [Michal: problems with dashboard tests at ASGC. Felix: following up.]
    • [Andrea: high load on kerberos service at CERN due to access from POOL tools by some ATLAS users, under investigation. This could be due to a bug in xrootd, bug #82793. A ROOT patch is available and will be tested to see if it solves the issue.]

  • CMS reports -
    • LHC / CMS detector
      • Good Weekend for the LHC
    • CERN / central services
      • NTR
    • Tier-0 / CAF
      • Problem Saturday morning with the export disk buffer in P5. Lost a network switch which put the system in an inconsistent state. Stabilized and recovered, but working to make the impacted data useful
      • Backlog in copying files from Point5 to CASTOR. GGUS:71047. CMS has combined 2 Streams no backlog reported over the weekend
    • Tier-1
      • MC re-reconstruction ongoing. Restarting large scale simulation production with pile-up.
    • Tier-2
      • MC production and analysis in progress
    • AOB :
      • New CRC Ian Fisk until June 14

  • ALICE reports -
    • General information: Production went well during the long weekend ~30k jobs running. Pass-0 and Pass-1 reconstruction ongoing and a couple of MonteCarlo cycles. No issues to report.
    • T0 site: ntr
    • T1 sites: ntr
    • T2 sites: usual operations

  • LHCb
    • Experiment activities:
      • Taken in 24 hours 23pb-1.
      • Overlap of many activities going on concurrently in the system.
      • Reprocessing (Reconstruction+ Stripping + Merging) of all data before May 's TS (about of ~80 pb-1 data) using new application software and conditions (~80% completed)
      • First pass reconstruction of new data + Stripping + Merging
      • Tails of old reconstruction productions using previous version of the application (Reco09) almost completed.
      • MC productions at various non T1's centers.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services:
      • T0
        • CERN the additional disk servers in the LHCb-Tape token boosted the stripping jobs now running smoothly at CERN
      • T1
        • SARA still a huge backlog of stripping jobs due to low rate of submission of pilot jobs. Decided to move to direct submission by passing the gLite WMS and the ranking expression.
        • SARA reported some issue in staging files, most likely some tape driver problem. Local contact person in touch with Ron and Co.
        • RAL: backlog drained by moving to direct submission. Many jobs exiting immediately because data was garbage collected. Asked RAL folks to use the policy: LRU (Last Recently Used). [Tiju: following up with local experts about the policy.]
      • T2: ntr

Sites / Services round table:

  • Dashboard services: migration of ATLAS DDM dashboard to new HW this week (3 days), should be transparent.

AOB: none

Tuesday:

Attendance: local(Michal, Lukasz, Maria, Jamie, MariaDZ, Oleg, Dirk, Massimo, Alessandro, Peter, Maarten, German, Eva, Ian);remote("can't login to alcatel").

Experiments round table:

  • ATLAS reports - ATLR dashboard shows many blocking sessions for application: ATLAS_COOLOFL_SCT_W@AtlCoolCopy.exe@pc-sct-www01 [ Eva - this was caused by ATLAS offline sessions which were blocking themselves. DBAs killed sessions and now ok. Ale - should ATLAS do it by themselves next time? Eva - you must contact Marcin ]
  • Ale - Kerberos denial of service attack - some updates in Savannah - some users not sure what do do. [ Tim = we are seeing about 3 - 10 users / day launching a DOS attack on CERN Kerberos server - around 600-700 requests / second. Having to manually follow and kill the sessions. Can protect service but this requires disabling user for a certain period of time which is clearly undesirable. Root cause - not clear on. From service perspective we are just seeing effects. Dirk - some analysis between xroot security plug in but at the moment no concrete statement. A commit to the code quite some time ago which took out an infinite loop. in xroot security plug in. May not be "new" code. IT/DSS and ROOT team involved in looking for problem. Ale - ATLAS trying to understand use cases. For now appear to be all legitimate. bug:82790

  • CMS reports -
    • LHC / CMS detector
      • Continued Running
    • CERN / central services
      • NTR
    • Tier-0 / CAF
      • Caught up and back to normal running
    • Tier-1
      • MC re-reconstruction ongoing.
      • Lower quality transfers CERN to FNAL. Still succeeding, but some recurring failures
      • ASGC SRM issue that was resolved
    • Tier-2
      • MC production and analysis in progress

  • ALICE reports -
    • T0 site - Nothing to report
    • T1 sites - Nothing to report
    • T2 sites - Poznan: delegation to CREAM-CE not working. GGUS:71261

  • LHCb reports - Main activities: stripping of reprocessed data + reconstruction and stripping of new taken data. MC activities.
    • T0
      • CERN Tape system with a huge backlog as the reason of no migration taking place since yesterday afternoon (GGUS:71268) [ German - actually two problems. A misconfig in migration policies unveiled with broken tape which triggered a problem which caused some of LHCb migrations to be stuck from 09:00. Other aspect: upgrade / reconfig of tape servers which caused snowball effect; running short of tape drives. Had to reduce some internal activity such as repack. Tape drives now 90% available to experiments and problem with migration policy now understood; trying to fix.
    • T1
      • SARA: moved to direct cream submission. Part of the waiting jobs redirected to NIKHEF. Still backlog to be drained. Fixed the problem of staging files.
      • RAL: moved to LRU policy for GC

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • ASGC - our SAM tests should pass today. Guess because our CRL expired. We fix the cron table today I think should be fine now thanks.
  • IN2P3 - ntr
  • RAL - ntr
  • NL-T1 - ntr
  • CNAF - ntr
  • OSG - ntr
  • NDGF - ntr

  • KIT (by email)
    • 1 host with 5 ATLAS disk-only pools went down yesterday around 17:45 local time. Unfortunately, this was not noticed before today 08:00 , because the issues were not severe enough to trigger on-call duty during the night. After the host was rebooted, the pools came back online, too.
    • The downtime of gridka-dcache.fzk.de was successful. We changed configurations such that LHCb is not able to write outside of space tokens anymore. As a consequence, transfers that do not state a space token to be used explicitly will fail.

  • CERN DB - yesterday we had problem with ATLAS ADCR offline DB. Hung around 13:30 fully recovered at 14:00. Problem started with ASM instance triggered by two disk failures in a few hours. Looking at h/w to see what is cause of disk failures as rate high compared to others. Took some extra logs to send to Oracle and continuing investigations

AOB: (MariaDZ) A presentation on the GGUS fail-safe system development will be given at tomorrow's GDB at 10:30am CEST. Agenda with EVO connection data on: https://indico.cern.ch/events.py?tag=GDB-June11. Development details on Savannah:113831#comment0 .

  • IN2P3 - yesterday someone from GridPP asked for advice on FTS config. Please can you put contact info for GridPP into minutes? Brian Davies

  • FNAL - comment for KIT. FNAL networking people trying to contact KIT networking people. Please can you make sure that they are in contact on your end?

Wednesday

Attendance: local(Peter, Michal, Lukasz, Maria, Jamie, Massimo, Andrea V, Luca, Alessandro, Ian);remote(Michael, Xavier, Jon, Onno , Ulf, John, Rolf, Jhen-Wei, Foued, Rob).

Experiments round table:

  • ATLAS reports -
    • Central Services:
      • No US sites in GGUS TEAM ticket form, GGUS:71348
    • T1:
      • PIC LFC/FTS scheduled downtime, cloud offline in ATLAS

  • CMS reports - nothing significant to report, all running smoothly. Recovered fine from FNAL downtime. Downtime calendar red from GOCDB -> Google calendar. CERN downtime until 26 May 2026. Must be a problem somewhere... Please fix to avoid confusing shifters.

  • LHCb reports - Experiment activities:Data processing/reprocessing activities proceeding almost smoothly everywhere with failure rate less than 10% mostly for input data resolution
    • T0
      • NTR
    • T1
      • NL-T1: rolling fashion intervention to update CVMFS brought a reduced capacity with number of running jobs falling to 500. Now improving. There is a backlog there for stripping not fully recovered.
      • PIC: In DT today. They got over last days a bit too much jobs compared the real capacity of the site (they were well behaving in the past). For this reason a backlog of activities is formed over there too.

Sites / Services round table:

  • BNL - ntr
  • KIT - this morning we had some problems with shared s/w area - now problem has been fixed and shared s/w area and home directories available again
  • FNAL - yesterday was hottest day on record since 2006. Had a 2h cooling outage which affected jobs but all restarted automatically at the end. Expecting even a hotter day today!
  • PIC - reminder for PIC is not just LFC and FTS but a global downtime. Also remind experiment decommissioning LCG CE. From now on will be converted to CREAM.
  • NL-T1 - ntr
  • NDGF - ntr
  • RAL - ntr
  • IN2P3 - ntr
  • ASGC - downtime for network construction this Friday. 06:00 - 09:00 UTC
  • CNAF - ntr
  • OSG - ntr

  • CERN - AT RISK for site - intervention on CASTOR public and other instances foreseen next week.

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

  • pic: OPN transfers failure 20:00_08062011 to 8:30_09062011:
    • During the Scheduled downtime the SRM headnode was migrated to a new hardware with a new IP. The firewall rules were updated but there was an ACL that was not set for this new IP at router level, preventing OPN traffic to flow. The Manager on Duty saw traffic during the night : T2s, some T1s (IN2P3, SARA and NDGF) and functionality tests (lcg-cp's, etc.) and thought the problem was related to a DNS freshness issue. The real problem was not correlated until this morning when the network expert "discover" the missing ACL at router level for the OPN VLAN.
    • Related actions: GGUS ticket opened to investigate non-OPN route being used from NDGF, IN2P3 and SARA to PIC.

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 31-May-2011

Edit | Attach | Watch | Print version | History: r13 | r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2011-06-09 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback