Week of 120402

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alex, Daniele, Dirk, Jarka, Maarten, Massimo, Simone, Stefan);remote(Alexander, Gonzalo, Jeremy, Jhen-Wei, Lisa, Paolo, Rob, Roger, Tiju, Xavier).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • NTR
    • T1s
      • CNAF: FTS problem reported in the evening of Friday (/var/log full). Fixed immediately after.

  • CMS reports -
    • LHC machine / CMS detector
      • Fri A: collisions (no stable beams). Computing OK. Only a minor problem to a T0 machine (high load) fixed before collisions came
      • Mon M: access
      • Tue: first STABLE BEAMS 3x3 foreseen with 2 bunches colliding in CMS
    • CERN / central services
      • CASTOR_SRM-CMS SLS availability drops in SLS (Saturday, plus twice this morning). No ticket.
        • Massimo: errors are not understood, some are cured by a restart, looking into it
    • Tier-0:
      • CMS T0 availability drop in SLS (here). CMS expert confirmed all the T0 subcomponents are up and running. Investigating.
        • Daniele: issue was on the SLS side, now fixed
    • Tier-1:
    • Tier-2:
      • business as usual

  • LHCb reports -
    • Stripping jobs (both Re-Stripping 17b & Pre-Stripping 18) are ~complete at all sites except GridKa
    • MC simulation at Tiers2 ongoing
    • New GGUS (or RT) tickets
    • T0
      • NTR
    • T1
      • Gridka: staging progressing very slow since last week (GGUS:80794)
        • Xavier: LHCb staging rate is as it was last year, will look into possible increase
        • Stefan: one third below target?
        • Xavier: no, 150 MB/s being delivered as agreed last year
        • Stefan: should be OK for normal operation, only the ongoing backlog processing is impacted, but that is expected to finish by Easter
        • Xavier: please submit any stage requests as early as possible for the highest optimization
      • Upgrade of LFC server versions needed at GRIDKA (GGUS:80777), RAL (GGUS:80775), SARA (GGUS:80782)
      • Gridka conditions DB not accessible by CERN/SLS sensors (GGUS:79800)
        • Stefan: Oracle server version cannot be retrieved

Sites / Services round table:

  • ASGC - ntr
  • CNAF
    • FTS 2.2.8 upgrade went OK last week
  • FNAL - ntr
  • GridPP - ntr
  • KIT
    • at-risk downtime tomorrow 08:30-10:00 CEST for FTS 2.2.8 upgrade
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC
    • LFC for LHCb upgraded to 1.8 this morning
    • big downtime tomorrow because of yearly electrical maintenance, draining tonight
  • RAL
    • batch service was not starting new jobs this morning; cured by restart, except for CMS; looking into it

  • dashboards - ntr
  • GGUS
    • service was down on April 2 between 04:00 and 06:00 UTC for emergency update of core routers at KIT
  • grid services - ntr
  • storage - ntr

AOB:

Tuesday

Attendance: local(Alessandro, Alex, Daniele, Jarka, Maarten, Maria D, Massimo, Michail, Przemek, Stefan);remote(Gonzalo, Jhen-Wei, Lisa, Marc, Paolo, Pavel, Rob, Roger, Rolf, Ronald, Tiju).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
    • T1s/CalibrationT2s
      • INFN-NAPOLI-ATLAS GGUS:80831 . Answer from the site: it was not a site error, the problem was in LHCOne routing at GARR. Data transfers in other FTS channels did not fail, now also transfers from CERN to Napoli are going well.

  • CMS reports -
    • LHC machine / CMS detector
      • Today: loss maps at injection, chromaticity at flat top and squeeze, 50 ns injection commissioning
      • Tomorrow: Stable Beams rescheduled for the morning. Instensity ramp up will follow immediately
    • CERN / central services
      • VOMS SLS status down to 25% then 0% (from Apr 2, ~6pm, to Apr 3, ~10am). The same (related) observed for CRAB servers
      • we read on the IT-SSB (here) about an incident: "Oracle Test and Dev database and application services not accessible". Not sure if we should have, but FYI we have not been collecting any evidence we have been impacted by it.
    • Tier-0:
      • CMS T0 availability drop in SLS observed yesterday (here). CMS experts confirmed that all T0 subcomponents are up and running. Investigations spotted it was a false positive, caused by a problem with the cmsprod account webservice not responding (error: "Attempting to reach the xml files through the browser returns a 101 error: Error 101 (net::ERR_CONNECTION_RESET): The connection was reset.") and therefore preventing the SLS pages to fetch the update xmls (SNOW here, SOLVED). All back to normal now.
    • Tier-1:
      • [follow-up] FNAL: errors in transfers to Imperial College T2 (Savannah:127552). Update: they were doing some clean-up which caused transient errors. SOLVED.
      • [follow-up] RAL: JobRobot errors, possibly related to BLAH (Savannah:127566). Update: there were problems with the batch server overnight which affected the CMS JobRobot jobs. Everything returned to normal this morning, and there have been many successful JobRobot jobs since. SOLVED.
      • [follow-up] IN2P3: JobRobot errors, same as above (Savannah:127569, GGUS:80972). Update: at the Computing Ops meeting yesterday, it became clear that CMS is mistakenly sending JR jobs to the T1-colocated T2, and a fix is needed on the CMS side. Issue not to be addressed to IN2P3 T1, we will follow up.
    • Tier-2:
      • business as usual

  • LHCb reports -
    • Final validation of 2012 workflows starting
    • Stripping jobs (both Re-Stripping 17b & Pre-Stripping 18) are ~complete at all sites except GridKa
    • MC simulation at Tiers2 ongoing
    • New GGUS (or RT) tickets
    • T0
      • NTR
    • T1
      • Gridka: staging progressing very slow since last week (GGUS:80794)
      • Upgrade of LFC server versions needed at GRIDKA (GGUS:80777), RAL (GGUS:80775)
      • Gridka conditions DB not accessible by CERN/SLS sensors (GGUS:79800)
      • PIC in downtime today, site banned

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • FTS 2.2.8 upgrade went OK
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC
    • Scheduled intervention at PIC is progressing as planned. After the maintenance tasks, power and cooling have been restored in the building about an hour ago. We are now starting powering on all the services. In about two hours all of them should be online.
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • FTS
    • each T1 should ensure they have checked the known issues for FTS 2.2.8 and apply workarounds as described
    • fixes will come incorporated in EMI-1 Update 15
    • Alessandro: BNL upgrade to v2.2.8 went OK yesterday, only the monitoring info is not published yet
      • Michail: they should contact FTS Support for that
  • GGUS/SNOW - ntr
  • grid services - ntr
  • storage
    • CASTOR ALICE disk servers really full
      • big reprocessing campaign ongoing
    • 1 LHCb disk server died during a short intervention; list of affected files will be made available and the admins will try to rescue the data

AOB:

Wednesday

Attendance: local(Alessandro, Alex, Ignacio, Jarka, Luca C, Maarten, Maria D, Massimo);remote(Burt, Daniele, Giovanni, Jhen-Wei, Joel, Marc, Michael, Rob, Roger, Ron, Tiju, Xavier).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • LFC registration is slower than usual. Issue tracked in snow https://cern.service-now.com/service-portal/view-incident.do?n=INC118478 . Number of threads per FE set to 90.
        • Ignacio: 4th node added; the DB can handle the additional connections; could there be a problem with leaked sessions?
          • after the meeting: the load balancer configuration will be changed to have the best 3 nodes behind the alias (was 2)
        • Alessandro: we are trying to work around a glibc misfeature; on the central services we already resolve the alias ourselves and pick a random host; could we let the load balancer run more frequently?
          • probably not
      • CERN-PROD srmbringonline 0.1% times return not valid identifier to then be able to query the bringonline request. GGUS:80865
        • Massimo: new SRM version to be deployed in June, is it early enough?
        • Alessandro: issue does not look critical so far
      • BNL-ATLAS VOMS server not accessible GGUS:80902
    • T1s/CalibrationT2s
      • TRIUMF tape system writing got stuck: from the site: "Thanks for notice. I had a quick look. System looks ok, but just one tape reached nearly too full, it's still below the threshold to mark the tape as 'full', but the coming files are too big. Thus, the writing got stuck. Now, everything looks ok to me."
        • ATLAS suggests Tier1s to put particularl attention to their tape system now that the datataking is restarting: this TRIUMF issue is the second ATLAS observed in the past 15 days. Tier1s please check the request that has been done at the WLCG MB to monitor tape metrics
      • FZK-LCG2 FTS server errors. Error message "proxy expired" (already appeared at RAL GGUS:80471) GGUS:80899

  • CMS reports -
    • LHC machine / CMS detector
      • First fill with stable beams 3x3 has been moved to tonight. Plan is to keep it for 30 min / 1 hr
    • CERN / central services
      • CRAB server availability drop in SLS down to 70% this morning at ~9am, then again at ~1pm. Fixed once, but it re-occurred. Need more checks.
    • Tier-0:
      • An rfstat hanging on Castor caused the CMS T0 to hang for a while. Alarm ticket GGUS:80905 opened. Now fixed.
      • CMS T0 availability drops in SLS observed again (here), now in CMSTO-permanent-failures and in CMSTO-long-jobs. Again, not really correlated to anything wrong in the T0 crucial components. Let's have the first collisions started and we will focus later on on hunting more root causes for these "false positives"
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • LHCb reports -
    • Final validation of 2012 workflows starting
    • Stripping jobs (both Re-Stripping 17b & Pre-Stripping 18) are ~complete at all sites except GridKa
    • MC simulation at Tiers2 ongoing
    • New GGUS (or RT) tickets
    • T0
      • CERN : loss of one disk server. No notification to LHCb.
        • Massimo: the vendor changed a controller, to no avail; we will continue our attempts until some time tomorrow and stay in touch with LHCb
        • Joel: OK, we would like to resolve the situation (recovery or loss) before the long weekend; more than 40k files affected...
    • T1
      • Gridka: staging progressing very slow since last week (GGUS:80794)
      • Upgrade of LFC server versions needed at GRIDKA (GGUS:80777), RAL (GGUS:80775)
      • Gridka conditions DB not accessible by CERN/SLS sensors (GGUS:79800)
    • T2
      • IN2P3-T2 : number of jobs per user on GE too high !!!! (GGUS:80850 )
        • Marc: will discuss issue with batch system experts and update ticket

Sites / Services round table:

  • ASGC - ntr
  • BNL - nta
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • last night the batch system developed a problem; at 11:00 CEST today it was restarted and currently looks stable
  • KIT - nta
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW
    • see AOB
  • grid services - ntr
  • storage - ntr

AOB: (MariaDZ) The interface developers of local ticketing systems were informed about some GGUS fields being withdrawn (Savannah:127148) and others becoming mandatory (Savannah:127146). Countries/organisations concerned are DE, ES, FR, IT, CERN and OSG. The changes will take effect at the next GGUS Release on 2012/04/25. GGUS issues for tomorrow's T1SCM a.s.a.p. please.

Thursday

Attendance: local(Alessandro, Alex, Andrea, Eddie, Ignacio, Jan, Maarten, Stefan, Stephane);remote(Andreas M, Daniele, Gareth, Jhen-Wei, Joel, Kyle, Lisa, Marc, Michael, Paolo, Roger, Ronald).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • LFC registration is much better now thanks to the other machine included and thanks to the fact that now 3 machines are returned from the LB.
        • Alessandro: to lower the impact of the ordered alias resolution by glibc, the central services resolve the LFC alias themselves to pick a random host explicitly; investigation of further improvements will continue, e.g. in the DB usage
    • T1s/CalibrationT2s
      • BNL-ATLAS FTS issue GGUS:80912
        • Michael: as the issue occurs very rarely (20 failures for 25k transfers) and an affected transfer then succeeds on retry, ATLAS operations do not suffer; the experts will have a look

  • CMS reports -
    • LHC machine / CMS detector
      • First stable beams with 3x3 and 2 bunches colliding in CMS last night at 00:38.
      • This morning at 4:50 stable beams again, with 48 bunches.
    • CERN / central services and T0
      • rfstat stuck caused issues at the CMS T0 (GGUS:80905)
        • Daniele: ticket kept open for followup details
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • Final validation of 2012 workflows starting
    • Stripping jobs (both Re-Stripping 17b & Pre-Stripping 18) are ~complete at all sites except GridKa
    • MC simulation at Tiers2 ongoing
    • New GGUS (or RT) tickets
    • T0
      • CERN : loss of one disk server. No notification to LHCb (GGUS:80973)
    • T1
      • Gridka: staging progressing very slow since last week (GGUS:80794)
      • Upgrade of LFC server versions needed at GRIDKA (GGUS:80777), RAL is done (GGUS:80775)
      • Gridka conditions DB not accessible by CERN/SLS sensors (GGUS:79800)
      • Joel: all T1 should please check if their file class or tape family definitions are OK for this year's data, to prevent that file collections may get scattered over many tapes
    • T2

Sites / Services round table:

  • ASGC - ntr
  • BNL - nta
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • unscheduled downtime 13:30-15:30 UTC for operations on the batch system
    • LHCb ticket GGUS:80850 already verified, hence we could not add that the maximum on the total number of jobs per user is to protect the batch system
      • Joel: it is OK now, we have adjusted the DIRAC configuration for those CEs accordingly
  • KIT - nta
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • RAL - ntr

  • dashboards - ntr
  • GGUS/SNOW - ntr
  • grid services - ntr
  • storage
    • CMS ticket GGUS:80905 is not specific to CMS: 1 Name Service daemon was in a funny state, which led to the loss of 3 CMS files; incident report in preparation
    • LHCb disk server #1: vendor intervention did not happen so far; checkpoint at 16:00 CEST, else after Easter
    • LHCb disk server #2: short downtime for battery replacement in BBU

AOB:

Friday - No meeting - CERN closed for Easter

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 22-Mar-2012

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2012-04-05 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback