Week of 120430

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alessandro, Alexei, Luca M, Maarten, Maria D, Pablo, Raja);remote(Giovanni, Ian, Jhen-Wei, Lisa, Michael, Rob, Roger, Tiju).

Experiments round table:

  • ATLAS reports -
    • ATLAS/LHC: beam back as of this early morning. Stable beam expected as of Sunday afternoon
      • Sun Apr 29, stable beam NET Monday, most probably on Tuesday morning
    • ASGC re-included in all the activities, including RAW data export (SantaClaus)
    • HLT very urgent repro during the afternoon, no issue to report, jobs started almost all immediately (waiting time in average 8 min, max 23mins).
    • about the wget issue: we have to handle internally to ATLAS, i.e. test the Squid with different configuration. to be discussed with ATLAS central services Monday.
    • nothing else to report as of 20:00 of Saturday
    • fast reprocessing ~done, datasets are replicated to T1s and T2s

  • CMS reports -
    • LHC machine / CMS detector
      • Expect low pile-up runs sometime tonight
    • CERN / central services and T0
      • Systematic lowering of lxbatch available over the weekend. Dropped to high eighties in percentage. Was there maintenance?
        • Alessandro: ATLAS observed no issues with jobs at CERN during the weekend
        • Raja: also normal for LHCb
        • Maarten: and for ALICE
        • Ian: as CMS had low job activity, maybe the overall fraction of issues for all experiments combined exceeded a threshold?
    • Tier-1/2:
      • Job Robot failures over the weekend. Some fixed some ongoing. Looks like it could be a problem with the test infrastructure
        • Maarten: would that rather be HammerCloud instead of Job Robot?
        • Ian: yes
    • Other:
      • Ian Fisk CRC until Tomorrow, and then Stephen Gowdy

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • cpu intensive workflows successful at Tier-2s.
    • New GGUS (or RT) tickets
    • T0
    • T1
      • GridKa
        • Problem with LFC (GGUS:81734) from 1PM on 29 April. Problem solved this morning.
          • Q: Not clear why the ticket was assigned to GridKa only at 5:20 AM this morning
          • Q: Not clear why the internal monitoring of GridKa did not pick this up before this - found by failing LHCb jobs.
          • Ticket open for 1 month, requesting to introduce the LFC probes into the LHCb_CRITICAL profile (GGUS:80753). Still waiting for action on it.
            • Maarten: seems to depend on an upcoming SAM-Nagios release; Stefan is in the loop
      • PIC
        • Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
          • In contact with the LHCb-site contact. Will open a GGUS ticket if we cannot resolve this internally.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • NDGF
    • some CE failures during the weekend
  • OSG - ntr
  • RAL - ntr

  • dashboards - ntr
  • GGUS/SNOW - ntr
  • storage - ntr

AOB:

Tuesday - No meeting - May Day holiday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Wednesday

Attendance: local(Jamie, Jhen-Wei, Luca C, Maarten, Maria D);remote(Dimitri, Giovanni, Gonzalo, John, Lisa, Michael, Raja, Rob, Roger, Rolf, Ron, Stephen).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s/CalibrationT2s
      • SARA-MATRIX: transfer error to disk and tape. Site excluded from T0 export. Site is dealing with this. GGUS:81786
        • Ron: the main Chimera node had a high load due to various cron jobs; they were killed and the particular error messages no longer occurred, while remaining errors were due to attempts to overwrite existing files, which is not allowed by our dCache configuration

  • CMS reports -
    • LHC machine / CMS detector
      • Physics run ongoing
    • CERN / central services and T0
      • Yesterday some online testing of HLT was blocked by unresponsive webserver, GGUS:81783 .
      • HLTMON Express stream was running at 900Hz overnight (instead of O(30Hz)), caused some backlog in express jobs to process. Recovered after FNAL operations called Point5 and we caught up by morning.
    • Tier-1/2:
      • Some network problems between CERN and KNU (Korea). Recently connected to LHCone. Savannah:128323 . Cannot connect to lcgvoms or myproxy. Should we open a GGUS against CERN? [Has now been fixed]
    • Other:
      • NETR

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data taken overnight
    • New GGUS (or RT) tickets
    • T0
    • T1
      • PIC
        • Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
          • GGUS ticket opened (GGUS:81814). Still trying to resolve the details of the problem.
      • SARA
        • Worker node with CVMFS possibly broken (GGUS:81787). Remounted just now.
        • Jobs failing to resolve input data. Investigating if it is an issue from LHCb side.
      • RAL
        • srm not visible from outside RAL (GGUS:81816)
        • Prompt reconstruction failing due to squid proxy cache not refreshing itself. In contact with Catalin at RAL about it.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • FTS Globus version was patched to re-enable the use of GridFTP v2
      • Maarten: that patch should be applied on every FTS instance; also PIC already have it since a few days; this will be further discussed during tomorrow's T1 service coordination meeting; contact FTS support if you want the details already now
  • KIT - ntr
  • NDGF
    • 1 tape pool had a failure earlier today, fixed now
  • NLT1 - nta
  • OSG - ntr
  • PIC
    • LHCb jobs appear to be using more resources than expected; for now the wall time limit was increased accordingly; still under investigation
      • Raja: Philippe has provided further input now
  • RAL
    • looking into the issues reported by LHCb

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW
    • for tomorrow's T1 service coordination meeting please report outstanding GGUS ticket issues to Maria D
  • grid services
    • 87% SLS availability of LXBATCH over the weekend is understood as a bad monitoring script - corrected.

AOB:

Thursday

Attendance: local(Eva, Jhen-Wei, Maarten, Maria D, Mike, Raja, Ricardo, Simone, Xavier E);remote(Giovanni, Gonzalo, John, Lisa, Michael, Rob, Roger, Rolf, Ronald, Stephen, Xavier M).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s/CalibrationT2s
      • SARA-MATRIX: transfer error to disk and tape. Problem was gone after site killed some cron jobs which made load to nameserver. Ticket closed, re-include site in T0 export. GGUS:81786
      • TRIUMF: Job failed due to input file get error. DDM expert is looking into this. GGUS:81829
    • Middleware
      • ATLAS is having issues with authentication in Panda and DDM because of mod_gridsite:
        • Using the old version (1.5.19-3) some (not really understood) authentication problem when verifying the FQAN. Two users from the same CA and the same institute (Dresden) obtain different behaviors (failure in one case, success in another case)
        • ATLAS was advised to upgrade to newer version (1.7.19) but this shows a different problem: if the client host don't have DNS reverse mapping (e.g. $ host 10.153.232.180 returns "host not found"), the authentication fails. When the newer version was upgraded in production ATLAS "lost" several sites (10K cores) and therefore the change was rolled back.
        • Everything is tracked in GGUS:81757 (currently assigned to gridsite support). Not critical but "unpleasant"

  • CMS reports -
    • LHC machine / CMS detector
      • Cryo recovery and RF studies, with a fill later
    • CERN / central services and T0
      • One of the (three) frontier servers unavailable earlier today during a DNS alias move.
    • Tier-1/2:
      • NTR
    • Other:
      • NTR

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data
    • New GGUS (or RT) tickets
    • T0
      • Raja: 1 corrupted file was discovered on CASTOR; the original is still available online
      • Xavier E: it has a checksum mismatch while the size is correct; it sits on a server of an older type, which is now being drained; under investigation
    • T1
      • PIC
        • Problem with queues sorted. Queue lengths temporarily increased for LHCb.
          • Hopefully the basic cause will be fixed with the next version of our reconstruction and stripping application software.
      • RAL
        • srm not visible from outside RAL (GGUS:81816)
          • Actually a problem with FTS. GGUS ticket submitted to FTS (GGUS:81844)
        • Prompt reconstruction failing due to squid proxy cache not refreshing itself.
          • Solved.
      • GridKa
        • Failing pilots (GGUS:81846)
          • Xavier M: fixed, please check
      • CNAF : Trying to understand proxy issues with directly submitted pilots.

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • the PNFS daemon crashed again last night like almost 1 month ago, causing a brief service interruption; with the dCache developers we are investigating how to let that daemon run on a different host, i.e. not together with Chimera
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • 2 new file servers recently installed for CMS have performance issues with disk and network I/O; under investigation
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • databases
    • security patch being applied to CMS integration DB
  • GGUS/SNOW - ntr
  • grid services
  • storage
    • CASTOR Oracle DB rolling security patches:
      • Wed May 9, 14:00 CEST - ALICE + ATLAS
      • Thu May 10, 14:00 CEST - CMS + LHCb

AOB:

Friday

Attendance: local(Eva, Jhen-Wei, Maarten, Raja, Ricardo, Stephen, Xavier E);remote(Alain, Catalin, Gonzalo, John, Michael, Onno, Paolo, Rolf, Ulf, Xavier M).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s/CalibrationT2s
      • ntr

  • CMS reports -
    • LHC machine / CMS detector
      • Cryo recovery and collisions this evening
      • Fill 2587 (852x852): 97.2% efficiency by luminosity recorded
    • CERN / central services and T0
      • Logged 10kHz into express data for 10 minutes due to misconfiguration in online test. Didn't cause any problems.
    • Tier-1/2:
      • T2_US_UCSD lost cooling water yesterday evening. They hope they'll be back soonish. We have two CRAB Servers there for glideinWMS.
    • Other:
      • NTR

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data
    • New GGUS (or RT) tickets
    • T0
      • Corrupted file (RAW data) re-transferred and migrated
        • Raja: the checksum was OK right after the transfer
        • Xavier E: indeed, the corruption appears to have occurred shortly afterwards; a HW problem is suspected; the machine was already scheduled for retirement because of its age
    • T1
      • CNAF : Still trying to understand proxy issues with directly submitted pilots.
    • Others
      • FTS : Problem with FTS proxy expiration. GGUS ticket submitted to FTS by RAL FTS admins (GGUS:81844)
        • Maarten: the main ongoing issue with the FTS 2.2.8 is the intermittent failures due to expired proxies, so far only seen at some of the T1; the developers are investigating this with high priority, but the cause may be hard to pin down

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF
    • CREAM proxy issue for LHCb has been fixed
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC- ntr
  • RAL
    • Monday May 7 is a bank holiday in the UK, RAL will not connect

  • databases
    • Oracle released a new security patch that is particularly urgent for listeners exposed to the internet; next week it will be tested in integration, the week after it should be deployed in production; as the patch is rolling, downtimes are not expected; exact dates and times will be agreed with the experiments
  • grid services
    • Raja: are there any improvements in the fair share algorithm foreseen?
    • Ricardo: with LSF we currently cannot do much better in that area
    • Maarten: other batch solutions will be looked into, but they are not for this year
    • Raja: we still see some issues, will follow up offline
  • storage - nta

AOB:

  • Alain: can RAL provide details on the network issue affecting the GOCDB?
  • John: there was a problem with a core router that did not affect the T1; fixed now, details to follow (see below)

GOCDB downtime details

The GOCDB system experienced an unexpected downtime between 03.05.12 and 04.05.12. During this time parts of the system were inaccessible to end users and programmatic tools. The details of this downtime are below:

  • A network outage caused by a configuration problem on a router at STFCs RAL site caused complete inaccessibility between 15:42 and 16:58 UTC on 03.05.12
  • At around 21:40 UTC on 03.05.12 we experienced an issue with the storage attached to the virtual machine hypervisor that hosts the GOCDB VM instances
    • This issue resulted in the read/write portal instance (gocdb4.esc.rl.ac.uk) being completely inaccessible
    • The read only web portal (goc.egi.eu) was unaffected, as was the programmatic interface provided by the read only portal
  • The storage issue caused the filesystems of both hosts to be mounted as read only. The cause of the problem was not initially clear but once diagnosed and the fault cleared we rebooted both hosts to re-mount the filesystems in read/write mode and both hosts were back in production at 10:45 UTC.
  • The GOCDB team maintain a read only failover instance with colleagues in Germany to address outages. However we didn't fail over in this instance as our read only instance was unaffected during the majority of this downtime.

-- JamieShiers - 22-Mar-2012

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2012-05-04 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback