Week of 120430

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alessandro, Alexei, Luca M, Maarten, Maria D, Pablo, Raja);remote(Giovanni, Ian, Jhen-Wei, Lisa, Michael, Rob, Roger, Tiju).

Experiments round table:

  • ATLAS reports -
    • ATLAS/LHC: beam back as of this early morning. Stable beam expected as of Sunday afternoon
      • Sun Apr 29, stable beam NET Monday, most probably on Tuesday morning
    • ASGC re-included in all the activities, including RAW data export (SantaClaus)
    • HLT very urgent repro during the afternoon, no issue to report, jobs started almost all immediately (waiting time in average 8 min, max 23mins).
    • about the wget issue: we have to handle internally to ATLAS, i.e. test the Squid with different configuration. to be discussed with ATLAS central services Monday.
    • nothing else to report as of 20:00 of Saturday
    • fast reprocessing ~done, datasets are replicated to T1s and T2s

  • CMS reports -
    • LHC machine / CMS detector
      • Expect low pile-up runs sometime tonight
    • CERN / central services and T0
      • Systematic lowering of lxbatch available over the weekend. Dropped to high eighties in percentage. Was there maintenance?
        • Alessandro: ATLAS observed no issues with jobs at CERN during the weekend
        • Raja: also normal for LHCb
        • Maarten: and for ALICE
        • Ian: as CMS had low job activity, maybe the overall fraction of issues for all experiments combined exceeded a threshold?
    • Tier-1/2:
      • Job Robot failures over the weekend. Some fixed some ongoing. Looks like it could be a problem with the test infrastructure
        • Maarten: would that rather be HammerCloud instead of Job Robot?
        • Ian: yes
    • Other:
      • Ian Fisk CRC until Tomorrow, and then Stephen Gowdy

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • cpu intensive workflows successful at Tier-2s.
    • New GGUS (or RT) tickets
    • T0
    • T1
      • GridKa
        • Problem with LFC (GGUS:81734) from 1PM on 29 April. Problem solved this morning.
          • Q: Not clear why the ticket was assigned to GridKa only at 5:20 AM this morning
          • Q: Not clear why the internal monitoring of GridKa did not pick this up before this - found by failing LHCb jobs.
          • Ticket open for 1 month, requesting to introduce the LFC probes into the LHCb_CRITICAL profile (GGUS:80753). Still waiting for action on it.
            • Maarten: seems to depend on an upcoming SAM-Nagios release; Stefan is in the loop
      • PIC
        • Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
          • In contact with the LHCb-site contact. Will open a GGUS ticket if we cannot resolve this internally.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • NDGF
    • some CE failures during the weekend
  • OSG - ntr
  • RAL - ntr

  • dashboards - ntr
  • GGUS/SNOW - ntr
  • storage - ntr

AOB:

Tuesday - No meeting - May Day holiday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Wednesday

Attendance: local(Jamie, Jhen-Wei, Luca C, Maarten, Maria D);remote(Dimitri, Giovanni, Gonzalo, John, Lisa, Michael, Raja, Rob, Roger, Rolf, Ron, Stephen).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s/CalibrationT2s
      • SARA-MATRIX: transfer error to disk and tape. Site excluded from T0 export. Site is dealing with this. GGUS:81786
        • Ron: the main Chimera node had a high load due to various cron jobs; they were killed and the particular error messages no longer occurred, while remaining errors were due to attempts to overwrite existing files, which is not allowed by our dCache configuration

  • CMS reports -
    • LHC machine / CMS detector
      • Physics run ongoing
    • CERN / central services and T0
      • Yesterday some online testing of HLT was blocked by unresponsive webserver, GGUS:81783 .
      • HLTMON Express stream was running at 900Hz overnight (instead of O(30Hz)), caused some backlog in express jobs to process. Recovered after FNAL operations called Point5 and we caught up by morning.
    • Tier-1/2:
      • Some network problems between CERN and KNU (Korea). Recently connected to LHCone. Savannah:128323 . Cannot connect to lcgvoms or myproxy. Should we open a GGUS against CERN? [Has now been fixed]
    • Other:
      • NETR

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data taken overnight
    • New GGUS (or RT) tickets
    • T0
    • T1
      • PIC
        • Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
          • GGUS ticket opened (GGUS:81814). Still trying to resolve the details of the problem.
      • SARA
        • Worker node with CVMFS possibly broken (GGUS:81787). Remounted just now.
        • Jobs failing to resolve input data. Investigating if it is an issue from LHCb side.
      • RAL
        • srm not visible from outside RAL (GGUS:81816)
        • Prompt reconstruction failing due to squid proxy cache not refreshing itself. In contact with Catalin at RAL about it.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • FTS Globus version was patched to re-enable the use of GridFTP v2
      • Maarten: that patch should be applied on every FTS instance; also PIC already have it since a few days; this will be further discussed during tomorrow's T1 service coordination meeting; contact FTS support if you want the details already now
  • KIT - ntr
  • NDGF
    • 1 tape pool had a failure earlier today, fixed now
  • NLT1 - nta
  • OSG - ntr
  • PIC
    • LHCb jobs appear to be using more resources than expected; for now the wall time limit was increased accordingly; still under investigation
      • Raja: Philippe has provided further input now
  • RAL
    • looking into the issues reported by LHCb

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW
    • for tomorrow's T1 service coordination meeting please report outstanding GGUS ticket issues to Maria D
  • grid services
    • 87% SLS availability of LXBATCH over the weekend is understood as a bad monitoring script - corrected.

AOB:

Thursday

Attendance: local(Eva, Jhen-Wei, Maarten, Maria D, Mike, Raja, Ricardo, Simone, Xavier E);remote(Giovanni, Gonzalo, John, Lisa, Michael, Rob, Roger, Rolf, Ronald, Stephen, Xavier M).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s/CalibrationT2s
      • SARA-MATRIX: transfer error to disk and tape. Problem was gone after site killed some cron jobs which made load to nameserver. Ticket closed, re-include site in T0 export. GGUS:81786
      • TRIUMF: Job failed due to input file get error. DDM expert is looking into this. GGUS:81829
    • Middleware
      • ATLAS is having issues with authentication in Panda and DDM because of mod_gridsite:
        • Using the old version (1.5.19-3) some (not really understood) authentication problem when verifying the FQAN. Two users from the same CA and the same institute (Dresden) obtain different behaviors (failure in one case, success in another case)
        • ATLAS was advised to upgrade to newer version (1.7.19) but this shows a different problem: if the client host don't have DNS reverse mapping (e.g. $ host 10.153.232.180 returns "host not found"), the authentication fails. When the newer version was upgraded in production ATLAS "lost" several sites (10K cores) and therefore the change was rolled back.
        • Everything is tracked in GGUS:81757 (currently assigned to gridsite support). Not critical but "unpleasant"

  • CMS reports -
    • LHC machine / CMS detector
      • Cryo recovery and RF studies, with a fill later
    • CERN / central services and T0
      • One of the (three) frontier servers unavailable earlier today during a DNS alias move.
    • Tier-1/2:
      • NTR
    • Other:
      • NTR

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data
    • New GGUS (or RT) tickets
    • T0
      • Raja: 1 corrupted file was discovered on CASTOR; the original is still available online
      • Xavier E: it has a checksum mismatch while the size is correct; it sits on a server of an older type, which is now being drained; under investigation
    • T1
      • PIC
        • Problem with queues sorted. Queue lengths temporarily increased for LHCb.
          • Hopefully the basic cause will be fixed with the next version of our reconstruction and stripping application software.
      • RAL
        • srm not visible from outside RAL (GGUS:81816)
          • Actually a problem with FTS. GGUS ticket submitted to FTS (GGUS:81844)
        • Prompt reconstruction failing due to squid proxy cache not refreshing itself.
          • Solved.
      • GridKa
        • Failing pilots (GGUS:81846)
          • Xavier M: fixed, please check
      • CNAF : Trying to understand proxy issues with directly submitted pilots.

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • the PNFS daemon crashed again last night like almost 1 month ago, causing a brief service interruption; with the dCache developers we are investigating how to let that daemon run on a different host, i.e. not together with Chimera
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • 2 new file servers recently installed for CMS have performance issues with disk and network I/O; under investigation
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • databases
    • security patch being applied to CMS integration DB
  • GGUS/SNOW - ntr
  • grid services
  • storage
    • CASTOR Oracle DB rolling security patches:
      • Wed May 9, 14:00 CEST - ALICE + ATLAS
      • Thu May 10, 14:00 CEST - CMS + LHCb

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 22-Mar-2012

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r14 - 2012-05-03 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback