Week of 100705

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Gavin, Marie-Christine, Jamie, Jarka, Jean-Philippe, Alex, Miguel, Luca, Nilo, MariaDZ, Stephane);remote(Jon, Federico Stagni (LHCb), Angela, Rolf, Ale, Ron, Gang, John, Vera, Alessandro (CNAF))

Experiments round table:

  • ATLAS reports -
    • No major issue to report.
    • Over the weekend ATLAS was taking data, mostly short runs at high data rate.
    • Some ATLAS users have problem submitting jobs to some of the EGEE sites, while they are able to run at BNL. This issue is being investigated.
    • Reprocessing test with data running from tape - test is for all 10 ATLAS T1s + CERN. Schedule being build - a few T1s at a time. ATLAS weekly OPS meeting today where final schedule will be exposed - some sites will start this week.

  • CMS reports -
    • Issues
      • AFS problems
      • SSO fails with Safari - see CERN Site Status board entry.
      • CRAB server seems to be broken; working on it
    • T0 Highlights
      • T0 extremely slow in creating workflows, see GGUS ticket opened by Stephen Gowdy https://gus.fzk.de/ws/ticket_info.php?ticket=59728 (One fileserver (AFS) extremely overloaded - discussion on whether to replicate or not. Also affecting users on lxplus who have this area in their environment. )
      • T0 monitor new tool in production
    • T1 Highlights
      • skimming workflows
    • T2 Highlights
      • MC production

  • ALICE reports - GENERAL INFORMATION: Important Grid activity during the weekend with a large number of jobs running at all sites
    • T0 site
      • GGUS:59637: bad performance of the ce201 and ce202 CREAM services at CERN. The issue has been quickly managed by the system experts although slower answered by the ticket submitter. ce201is still down. It seems as if the initial proxy is no longer transferred to the worker node. The generated script submitted to LSF is lacking the required directive. A new downtime has been added until Monday evening. ce202 and ce203 can be used in the meantime * GGUS:59702: aliprod login was disabled on voalicefs01 to 05. Since these are the ALICE master condition DB repositories, the situation can become rather serious. ticket solved and validated by the experiment experts
      • VOBOX support ticket CT695499: asked for the replacement of voalice07 that will be shortly out of warranty
    • T1 sites
      • SARA: GGUS:59639 (reopened): writing operation failing in both disk and tape backends
      • CNAF: Unstable performance of both CREAM systems during the whole weekend. At a certain moment both systems were down with the interruption of the production at the site. At this moment both cream systems seem to be in good shape
    • T2 sites
      • Kolkata: out of production due to an unstable behavior of both VOBOX and cream system during the weekend
      • TriGrid: Failing on Friday evening, site recovered during the weekend, back in production
      • Bratislava: On Friday it was reported the resource BDII of the CREAM system was showing bad results concerning the status of the local queue. Problem solved and site in production since yesterday night

  • LHCb reports -
    • Decided to run new (test) reconstruction and stripping productions to process the new events (about 130M).
    • MC productions stopped due to an application problem.
    • T0 site issues:
      • Problems accessing files with RFIO at CERN. That was due to a peak of queued transfers on the default service class on CASTOR. GGUS:59632 (Miguel - Users trying to more transfers than pool can take. Can introduce some limits but this is a decision for LHCb on how many transfers a user can do on this pool. We can advise)
    • T1 site issues:
    • T2 sites issues:
      • UK sites issue uploading to CERN and T1: we don't see uploading problems from Glasgow at the moment. Still few issues from the other sites. (Brian - why is UK more susceptible to this issue? To do with TCP settings which we would have thought would affect most sites... )
      • Could not determine shared area at site RO-15-NIPNE. GGUS:59688
      • Shared Area problem at dangus.itpa.lt ITPA-LCG2. GGUS:59691
      • Jobs aborted at ce.reef.man.poznan.pl PSNC. GGUS:59693
      • Jobs aborted at cert-15.pd.infn.it INFN-PADOVA. GGUS:59706
    • CREAM CE
      • Issue with CREAMCE originally discovered at CERN with GGUS:59559 now requires intervention of developers: reassigned to CREAM-BLAH.

Sites / Services round table:

  • FNAL - had to restart FTS server last night - completely hung for a few hours. Paged and restarted.
  • KIT - on Saturday we had a problem with CMS head nodes for dCache - down for about 3 hours. H/W failure - tomorrow we will switch h/w.
  • NL-T1 - had this morning an issue with one pool node - f/s problems. Recovered situation - vendor looking at it, looks like have lost 37 files.
  • RAL - scheduled intervention went well this morning. Unfortunately, we had network problems due to faulty card in a router so entire site is in downtime until tomorrow!
  • NDGF - ntr
  • ASGC - ntr
  • CNAF -

  • CERN DB - upgrade of Apex to V4 on CMS and ATLAS; replication of PVSS streaming CMS online to offline stopped for 1/2 hour - had to restart on a different node.

AOB:

  • Calls - continue for the rest of the week, despite workshop!

Tuesday:

Attendance: local(Peter, Maria Chirstie, Luca, Gav, Maria D.);remote(Federico[LHCb], Xavier[KIT], Jon[FNAL],Onno(NLT1], Gonzalo[PIC], Gang[ASGC], Kyle[OSG], Jeremy[RAL], Vera[NDGF], Rolf [IN2P3], Gareth[RAL], Michael[BNL], Alesandro[CNAF] ).

Experiments round table:

  • ATLAS reports -
    • ATLAS continues data taking this week
    • Reprocessing from Tape, FZK/IN2P3 today, RAL/TRIUMF end of week
    • SARA disk outage mitigated with DATATAPE write/fetch enabled, other endpoints excluded from transfer activity
    • No trouble seen at RAL following scheduled Castor intervention or router outage.
    • 4 lost files seen on CERN-PROD_PHYS-GENER: hardware fault on raid controller
    • Issue with SARA ALARM tickets resolved

  • CMS reports -
    • CMS running only on 1 WebTools front-end server due to failure of vocms105 . Post mortem analysis is on going. Ticket https://gus.fzk.de/ws/ticket_info.php?ticket=59750
    • Automatically extending datasets with crab stand alone and server is not completely functional in CRAB_2_7_3 and users have reported inconsistent results. Working on it.
    • Transfer problems at KIT: dcahe headnode rebooted - looks OK no (Xavier: hardware problem that the vendor could not recover - took until today to prepare replacement machine)

  • LHCb reports -
    • Experiment activities: Only user analysis running now
    • T1: Problems with SARA storage: since there's the possibility for data corruption, some space token have been excluded. LHCb can still write on tape, and stage from it. We can write on T1D1 (100TB will not be available). User disk is not available. Issue with SARA not getting GGUS mails - now solved.
    • T2: Certificate not updated on a couple of sites. GGUS:59734 GGUS:59735 (solved now)

Sites / Services round table:

  • KIT, OSG, GridPP, NDGF, PIC, FNAL, BNL, IN2P3: NTR
  • RAL: as mentioned in the ATKLAS report: site outage yesterday to do with DNS server issue. Resolved in afternoon.
  • ASGC: transfer failures - issue on SRM storage backend DB (backup process eating memory). Unstable for some hours but now recovered. Monitoring.
  • NL-T1: Problems of silent data corruption on Infinband storage. 7 of 11 racks at risk - now offfline. Checking all parameters - vendor is aware and following with high priority. All files will be checksummed after the issue is understood.
  • CNAF: Intervention on batch system - no details but sounds like problem last week. issue openend on CREAM CE: solved with quick patch, proper solution pending new release
  • CERN DB: glitch on CMS online hosting PVSS apps - got stuck for 30 mins. Recovered after reboot.

AOB:

Wednesday

Attendance: local(Peter, Luca, Gav);remote(Federico[LHCb], Xavier[KIT], Markus[CMS], Jon[FNAL],Alexander(NLT1], Gang[ASGC], Kyle[OSG], Vera[NDGF], Rolf [IN2P3], Michael[BNL], Alesandro[CNAF] ).

Experiments round table:

  • ATLAS reports -
    • BNL outage affected transfers for 4-5hours, recovered quickly
    • SARA SRM not contactable, all endpoints remain excluded from DDM awaiting ticket update
    • srm-atlas.cern.ch looks busy, getting increasing t0-export transfer errors
    • PIC -> CNAF transfer issue - probably issue at PIC.

  • CMS reports -
    • CMS running only on 1 WebTools front-end server due to failure of vocms105 . Post mortem analysis is on going. Ticket GGUS:59750 - now closed.
    • T0 worker nodes are now reading software from distinct AFS area which has reduced the overload on the AFS server in question - will continue to watch.

  • ALICE reports -
    • T0 site
      • GGUS:59637 concerning the performance of ce201 and ce202: CLOSED. Experts applied some fixes to ce201 suggested by the CREAM developers. Systems back in production
    • T1 sites
      • Nothing to report
    • T2 sites
      • Glenoble: Site out of production: local VOBOX not reachable. Reported to the Alice contact person at the site

  • LHCb reports -
    • Experiment activities: Just Launched new reconstruction production, on a selected number of runs (only T1 sites involved)
    • T0/T1 site issues:
    • T2 sites issues:
      • Closed tickets regarding certificate not updated on a couple of sites. GGUS:59734 GGUS:59735
      • sites WEIZMANN-LCG2 and RO-07-NIPNE put back in production mask.

Sites / Services round table:

  • KIT, FNAL, ASGC, OSG, IN2P3, NDGF: NTR

  • SARA: situation still the same with 7/11 diskserver down. ATLAS: another issues with SRM seen this morning - SARA will look into it.

  • BNL: power outage yesterday - UPS tripped due to 20% drop in supply voltage - flywheel / diesel backup didn't kick in. Affected mostly ATLAS storage servers. BNL working on connecting ATLAS storage to dual UPS (expected ready in 2 weeks).

  • CNAF: investigated PIC -> CNAF transfer issue - likely issue at PIC.

  • CERN-PROD services: As per MyProxyFTSRetire, the server myproxy-fts.cern.ch will be retired tomorrow. It has been removed from GOCDB.

  • CERN_PROD DB: NTR

AOB: None.

Thursday

Attendance: local(Peter, Luca, Gav, Miguel, Marcin, Jacek, Ignacio);remote(Federico[LHCb], Markus[CMS], Xavier[KIT], Jon[FNAL], Onno(NLT1], Gang[ASGC], Kyle[OSG], Vera[NDGF], Rolf [IN2P3], Michael[BNL], Alesandro[CNAF] ).

Experiments round table:

  • ATLAS reports -
    • ALARM ticket to castor support, srm-atlas.cern.ch degraded. Bad process killed related to rsyslog. 3 tickets, latest: GGUS:59888
    • ALARM ticket to castor support, T0MERGE pool out of service. In progress GGUS:59850
    • CNAF->PIC transfer failures GGUS:59791 in progress (since Wed morning)
    • SARA T0 data export subscriptions stopped
    • Miguel (CERN): Related to rsyslog problem - a bug where there are a lot of threads on the rsyslog daemon which cause anything talking to it to slow down. Both issues (srm-stlas and T0MERGE were caused by this problem).
      • Changed configuration of rsyslog and added monitoring (which should also fix the issue if it re-occurs). Verify to see if it has the desired effect. Tickets on hold for now.

  • CMS reports -
    • We saw problems with the srm-cern earlier today which caused transfers to CERN to fail.
      • Seems to be solved. Don't know why and how at time of writing.
    • Question about problems with the MCDB (database?) service for Monte Carlo - not clear if this is a CERN IT service or a CMS internal one - Markus will check.

  • ALICE reports -
    • T0 site
      • GGUS:59868: Strange information reported by the resource BDII of the local CREAM-CEs. Large number of running jobs reported while Alice observes a much lower number of jobs
    • T1 sites
      • CNAF needs to update some software in the tape servers; this means stopping the service for a few hours. Asking ALICE experts for the green light to perform the operation as soon as possible.
    • T2 sites
      • GRIF_IPNO CREAM service is failing due to some proxy expiration. Checking the system in detail this morning

  • LHCb reports -
    • T0 site issues:
      • CREAM CE still aborting many pilots, more investigations requested. GGUS:59559 reopened
    • T1 site issues:
      • observed degradations in IN2P3 shared area. GGUS:59880 T2 sites issues:
      • NTR

Sites / Services round table:

  • IN2P3, OSG, BNL, ASGC: NTR

  • KIT: ATLAS only pools were down for 2 hours today with controller problem - fixed now.

  • NDGF: Issue with LFC service affecting ATLAS (fixed within 4 hours - back online). Coincidental monitoring issue caused by switch problem slowed response.

  • NL-T1 (SARA): SRM outage extended until Monday - difficult to pinpoint and reproduce it, so setting up test with on-site vendor (vendor suspects firmware issue). ATLAS: was the firmware upgraded recently> Still to understand the root cause.

  • NL-T1(NIKHEF): Failed disk on DPM - replaced and back online.

  • FNAL: dCache hang last night - had to restart (typicvally happens once per month). VOMS(?) server on vocms113 is not responding causing CMS jobns to crash - CERN will look at.

  • CERN DB: Today after scheduled intervention on PIC the apply process was not enabled as it should be causing a problem in replication from Atlas to PIC. We requested to enable the apply process and, after enabling it, replication started and caught up.

  • CERN Castor: nothing more that already in ATLAS section.

  • CERN PES: myproxy-fts.cern.ch was finally retired this morning, after years of sterling service.

AOB:

  • ATLAS: request to PIC / CNAF to push on the transfer problem ( GGUS:59791 )

Friday

Attendance: local(Gav[Grid/Batch], Jacek[DB], Ignacio[Storage]);remote(Roger[NDGF], Onno[NLT1], Jon[FNAL], Michael[BNL], Gang[ASGC], Federico[LhCb], Peter[ATLAS], Rolf[IN2P3], Kyle[OSG], Alessandro[INFN], Jos[KIT] ).

Experiments round table:

  • ATLAS reports -
    • CNAF->PIC transfer failures GGUS:59791 in progress (since Wed morning)

  • LHCb reports -
    • Experiment activities:
      • Test reconstruction production stopped, new test production launched on smaller files. No MC ongoing.
    • T0/T1 site issues: NTR
    • T2 sites issues: * Shared area quota not sufficient at INFN-TORINO, site now out of the mask: GGUS:59912 * Shared area problem at INFN-LNS, site out of the mask: GGUS:59917

Sites / Services round table:

  • NDGF, FNAL, BNL, In2P3, OSG, ASGC: NTR

  • NL-T1: Update on SRM: running tests on 2 servers (write, flush, checksum). No errros after several hours - will continue tests over weekend. Issues with mail-server - you may have problems getting emails through to SARA.

  • KIT: Q. for ATLAS: On Tuesday, the minutes mentioned ATLAS would be reprocessing from tape at KIT - we didn't see anything - were there any problems? Peter: will check with responsible.

  • CERN services: NTR

AOB:

  • Shortest meeting on record?

-- JamieShiers - 02-Jul-2010

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2010-07-09 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback