Week of 120430

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alessandro, Alexei, Luca M, Maarten, Maria D, Pablo, Raja);remote(Giovanni, Ian, Jhen-Wei, Lisa, Michael, Rob, Roger, Tiju).

Experiments round table:

  • ATLAS reports -
    • ATLAS/LHC: beam back as of this early morning. Stable beam expected as of Sunday afternoon
      • Sun Apr 29, stable beam NET Monday, most probably on Tuesday morning
    • ASGC re-included in all the activities, including RAW data export (SantaClaus)
    • HLT very urgent repro during the afternoon, no issue to report, jobs started almost all immediately (waiting time in average 8 min, max 23mins).
    • about the wget issue: we have to handle internally to ATLAS, i.e. test the Squid with different configuration. to be discussed with ATLAS central services Monday.
    • nothing else to report as of 20:00 of Saturday
    • fast reprocessing ~done, datasets are replicated to T1s and T2s

  • CMS reports -
    • LHC machine / CMS detector
      • Expect low pile-up runs sometime tonight
    • CERN / central services and T0
      • Systematic lowering of lxbatch available over the weekend. Dropped to high eighties in percentage. Was there maintenance?
        • Alessandro: ATLAS observed no issues with jobs at CERN during the weekend
        • Raja: also normal for LHCb
        • Maarten: and for ALICE
        • Ian: as CMS had low job activity, maybe the overall fraction of issues for all experiments combined exceeded a threshold?
    • Tier-1/2:
      • Job Robot failures over the weekend. Some fixed some ongoing. Looks like it could be a problem with the test infrastructure
        • Maarten: would that rather be HammerCloud instead of Job Robot?
        • Ian: yes
    • Other:
      • Ian Fisk CRC until Tomorrow, and then Stephen Gowdy

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • cpu intensive workflows successful at Tier-2s.
    • New GGUS (or RT) tickets
    • T0
    • T1
      • GridKa
        • Problem with LFC (GGUS:81734) from 1PM on 29 April. Problem solved this morning.
          • Q: Not clear why the ticket was assigned to GridKa only at 5:20 AM this morning
          • Q: Not clear why the internal monitoring of GridKa did not pick this up before this - found by failing LHCb jobs.
          • Ticket open for 1 month, requesting to introduce the LFC probes into the LHCb_CRITICAL profile (GGUS:80753). Still waiting for action on it.
            • Maarten: seems to depend on an upcoming SAM-Nagios release; Stefan is in the loop
      • PIC
        • Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
          • In contact with the LHCb-site contact. Will open a GGUS ticket if we cannot resolve this internally.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • NDGF
    • some CE failures during the weekend
  • OSG - ntr
  • RAL - ntr

  • dashboards - ntr
  • GGUS/SNOW - ntr
  • storage - ntr

AOB:

Tuesday - No meeting - May Day holiday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Wednesday

Attendance: local(Jamie, Jhen-Wei, Luca C, Maarten, Maria D);remote(Dimitri, Giovanni, Gonzalo, John, Lisa, Michael, Raja, Rob, Roger, Rolf, Ron, Stephen).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s/CalibrationT2s
      • SARA-MATRIX: transfer error to disk and tape. Site excluded from T0 export. Site is dealing with this. GGUS:81786
        • Ron: the main Chimera node had a high load due to various cron jobs; they were killed and the particular error messages no longer occurred, while remaining errors were due to attempts to overwrite existing files, which is not allowed by our dCache configuration

  • CMS reports -
    • LHC machine / CMS detector
      • Physics run ongoing
    • CERN / central services and T0
      • Yesterday some online testing of HLT was blocked by unresponsive webserver, GGUS:81783 .
      • HLTMON Express stream was running at 900Hz overnight (instead of O(30Hz)), caused some backlog in express jobs to process. Recovered after FNAL operations called Point5 and we caught up by morning.
    • Tier-1/2:
      • Some network problems between CERN and KNU (Korea). Recently connected to LHCone. Savannah:128323 . Cannot connect to lcgvoms or myproxy. Should we open a GGUS against CERN? [Has now been fixed]
    • Other:
      • NETR

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data taken overnight
    • New GGUS (or RT) tickets
    • T0
    • T1
      • PIC
        • Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
          • GGUS ticket opened (GGUS:81814). Still trying to resolve the details of the problem.
      • SARA
        • Worker node with CVMFS possibly broken (GGUS:81787). Remounted just now.
        • Jobs failing to resolve input data. Investigating if it is an issue from LHCb side.
      • RAL
        • srm not visible from outside RAL (GGUS:81816)
        • Prompt reconstruction failing due to squid proxy cache not refreshing itself. In contact with Catalin at RAL about it.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • FTS Globus version was patched to re-enable the use of GridFTP v2
      • Maarten: that patch should be applied on every FTS instance; also PIC already have it since a few days; this will be further discussed during tomorrow's T1 service coordination meeting; contact FTS support if you want the details already now
  • KIT - ntr
  • NDGF
    • 1 tape pool had a failure earlier today, fixed now
  • NLT1 - nta
  • OSG - ntr
  • PIC
    • LHCb jobs appear to be using more resources than expected; for now the wall time limit was increased accordingly; still under investigation
      • Raja: Philippe has provided further input now
  • RAL
    • looking into the issues reported by LHCb

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW
    • for tomorrow's T1 service coordination meeting please report outstanding GGUS ticket issues to Maria D
  • grid services
    • 87% SLS availability of LXBATCH over the weekend is understood as a bad monitoring script - corrected.

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 22-Mar-2012

Edit | Attach | Watch | Print version | History: r16 | r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r12 - 2012-05-02 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback