Week of 120903

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Massimo, Doug, Maarten, Philippe, Marcin, Ivan);remote(Michael, Saverio, Elisabeth, Onno, Stephane, Jhen-Wei, Thierry, Tiju, Gonzalo, Vladinir, Dmitri, Rolf, Ian).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • GGUS 85704 :ALARM : Friday night : 2 issues :
        • Files produced by T0 migrated to TAPE too slowly -> backlog : Tape buffer was increased (100 TB) during the night and, on saturday, CERN-IT throttled down tape allocations to ALICE, dedicating drives to the other VOs. Backlog was exhausted in few hours. Solved.
        • One disk server went down at the same time -> 161 files could not be accessed for TAPE migration or data export. Files provided by SFO were recovered. For merge RAW and derived data produced by T0, waiting for disk server to be back (no very urgent processing requested by ATLAS for this week-end). Not solved yet
    • T1
      • FZK : GGUS 85721 : ATLAS files could not be stagein from TAPE starting saturday. Should have been solved this morning.

  • ALICE reports -
    • CERN: migrating conditions data to the corresponding name space in EOS.
    • CNAF: debugging Torrent method for SW installation.

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • New GGUS (or RT) tickets
    • T0:
      • constant rate of aborted pilot observed during last week GGUS:85385; Solved
    • T1 :
      • IN2P3: Jobs failures, cannot load shared library, site was banned for production (GGUS:85644); Under investigation
      • GRIDKA: Pilots failed at two CEs (GGUS:85716)
Sites / Services round table:
  • ASGC: ntr
  • CNAF: ntr
  • IN2P3: ntr
  • KIT: downtime "at risk" (17-SEP-2012 5:00-7:30) to intervene on the LAN. Short interruptions possible.
  • NDGF: Same problem as Friday happened a second time (power). Fixed this morning
  • NLT1: ntr
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: Working on recover the node blocking ATLAS files. The "ALARM" part of the same ticket (slow migration to tape) is actually solved. Root cause under investigation
  • Central Services: NTR
  • Data bases: The roll-out of security patches has started today and will continue instance-by-instance. In general this should be transparent if not said otherwise
  • Dashboard: Probel in making the SAM test results available. The system should be back tomorrow (it seems connected to a change of execution plan in the database backend)
AOB:

Tuesday

Attendance: local(Jamie, Steve, Stephen, Kate, Doug, Ivan, Maarten);remote(Michael, Elizabeth, Rolf, Gonzalo, Saverio, JT, Xavier, Lisa, Thierry, Jhen-Wei, Tiju, Salvatore).

Experiments round table:

  • ATLAS reports -
  • CERN/T0 Nothing to report
  • T1
    • FZK - Time out Issues staging to tape - GGUS Ticket-ID: 85721 - Few errors in (3) in last 12 hours
  • Atlas Internal
    • RAL - file deletion changed to speed up clean up of SCRATCH disk
    • TRIUMF LFC migration to CERN / Central deletion stopped

  • CMS reports -
  • LHC / CMS
    • NTR
  • CERN / central services and T0
    • Transfers to Tier-1s interrupted at weekend due to heavy user load, GGUS:85720. Should improve once users moved to Tier-2 facilities.
  • Tier-1/2:
    • NTR

  • ALICE reports -
    • CNAF: running OK with Torrent since yesterday evening.

  • LHCb reports -
  • Running user analysis, prompt reconstruction and stripping at T0 and T1s
  • Simulation at T2s

  • T0:
  • T1 :
    • IN2P3: Jobs failures, cannot load shared library, site was banned for production (GGUS:85644); Wrong version of tcmalloc with link to AFS and wrong version of gcc
    • GRIDKA: Pilots failed at two CEs (GGUS:85716); Still failed
    • PIC: Pilots aborted (GGUS:85746); Medium queues were removed, but they are still in BDII

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NTR
  • FNAL: NTR
  • IN2P3: NTR
  • KIT: NTA
  • NDGF: still problems in Denmark; same problem of electricity; elec on site; probably have to take some pools down during the night
  • NLT1: ntr
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS:
  • Central Services: ntr
  • Data bases: applying latest oracle sec patch to integration dbs - all of which should be done today; prod in 2-3 weeks
  • Dashboard: ntr
AOB:

Wednesday

Attendance: local(Massimo, Steve, Marcin, Julia, Doug, Ivan, Maarten, Maria);remote(Michael, Elizabeth, Rolf, Gonzalo, Saverio, Dmitri, Lisa, Thierry, Jhen-Wei, Tiju, Ron, Vladimir).

Experiments round table:

  • ATLAS reports -
    • CERN/T0 Nothing to report
    • T1
      • Nothing to report

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • Security Challenge/Training launched yesterday. Many sites informed of two "malicious" DNs.

  • LHCb reports -

    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • New GGUS (or RT) tickets
    • T0:
    • T1 :
      • IN2P3: Jobs failures, cannot load shared library, site was banned for production (GGUS:85644); Wrong version of tcmalloc with link to AFS and wrong version of gcc; Fixed
      • GRIDKA: Pilots failed at two CEs (GGUS:85716); Probably fixed
      • PIC: Pilots aborted (GGUS:85746); Medium queues were removed, but they are still in BDII; Fixed

Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: The problem seen by LHCb was more on the WN side. It looks it has been fixed.
  • NDGF: Power problems (DK) seems to be fixed after electricians' intervention
  • NLT1: ntr
  • PIC: ntr; Maarten commented that when a CE is removed it stays in state UNKNOWN in BDII for 12h: sw should take this behaviour into account
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: vobox rebooted on request of CMS
  • Central Services: 10% of the batch nodes have been upgraded to the latest gLite 3.2 Next week the remaining 90% will be upgraded.
  • Data bases: ntr
  • Dashboard: ntr
AOB:

  • On Tuesday 18 September the daily operations meeting will be filmed as part of a TV documentary.

Thursday

No meeting (Jeune Genevois)

Experiments summaries:

Friday

Attendance: local(Massimo, Steve, Marcin, Ivan, Stephen);remote(Michael, Rolf, Gonzalo, Saverio, Lisa, Jhen-Wei, John, Alexandre, Vladimir, Kyle, Xavier).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • Problems with EOS over night GGUS ticket : 85919 - headnode restart things seem fine
    • T1
      • Nothing to report

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • Running user analysis, prompt reconstruction and stripping at T0 and T1s
    • Simulation at T2s
    • Validation of replrocessing at T1s and selected T2s
    • New GGUS (or RT) tickets
    • T0:
    • T1 :
Sites / Services round table:
  • ASGC: Tape library problem. Vendor call opened.
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: 40k files overdue to tape. Expect to be solved beginning of next week.
  • NLT1: ntr
  • PIC: ntr
  • RAL:
    • Yesterday switch problem. SOme jobs (and some transfers) have been lost
    • New tapes for LHCb
    • CASTOR LHCb will be upgraded next week (2.1.12)
  • OSG: ntr

  • CASTOR/EOS: ATLAS request status of missing files on EOS: investigation going on (ticket GGUS:85933 et al.)
  • Central Services:
  • Data bases:
  • Dashboard:
AOB:

-- JamieShiers - 02-Jul-2012

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2012-09-07 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback