Week of 111107

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Massimo, MariaDZ, Ricardo, Doug, Jhen-Wei, David, Jacek);remote(Rolf, Onno, Lisa, Michael, Mette, Gonzalo, Gareth, RobQ, Dimitri, Paolo).

Experiments round table:

  • ATLAS reports -
    • T0
      • CERN-PROD transfer failures: "SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] [srm2__srmPrepareToGet". GGUS:76031. Recovered within one hour and no more error appeared. Maybe a temporary problem that threads busy during that time period.
      • CERN-PROD Slow LSF response time. GGUS:76039. Hardware problem affecting the LSF master machine. Now running on the secondary master while the raid array is rebuilding on the primary master. The response times to be commands looks good already.
      • Transparent upgrade of CASTOR 2.1.11-8 on CASTOR Atlas. 7 Nov. 10:00 - 16:00 UTC. Completed at 10:30 UTC.
      • Patching of ATLAS online production database (ATONR) - rolling intervention. 7 Nov. 10:30 - 12:30 UTC. Completed at 10:45 UTC.
      • WLCG production database (LCGR) rolling intervention. 7 Nov. 11:00 - 13:00 UTC.
    • T1 sites
      • BNL downtime scheduled downtime - Three days of facility maintenance for all services. 7-9 November.
      • Taiwan-LCG2 scheduled downtime - Two days maintenance for Castor upgrade. 7-8 November.
    • T2 sites
      • ntr

  • LHCb reports -
    • Experiment activities
      • Reprocessing is now being ramped up
      • Stripping is continuing at CERN:
    • T1
      • PIC : GGUS:76028: FTS transfer problem. Fixed
      • PIC : problem of space on pool behind LHCb-tape space token.

Sites / Services round table:

  • ASGC: ntr
  • BNL
    • Intervention at BNL started at 13:00 UTC. Services at US ATLAS Tier-1 expected to be restored by Wednesday, November 9 at 18:00 UTC.
  • FNAL: ntr
  • PIC: LHCb's tickets fixed but the one still open is being monitored as FTS transfers appeared blocked due to a dCache problem. dCache developers are contacted.
  • NDGF: There was a network problem today with a fibre that cut off the University. It seems to be solved now.
  • NL_T1: Atlas dark data found in SARA which seem to be old data and now being cleaned up.
  • CNAF: ntr
  • RAL: A patch was installed this morning to fix the over-reporting of tape capacity.
  • KIT: There will be a scheduled intervention next Monday to Wednesday on the Atlas LFC instances and, next Tuesday, specifically, changes will be applied to Atlas dCache.
  • OSG: A maintenance session will take place tomorrow Tue between 14:00-18:00 hrs UTC that will affect several OSG services. One of the resulting improvements will be the possibility ton include ticket attachments (important for the interface with GGUS).
  • IN2P2: There are 4 CMS tickets on network issues. Performance numbers should be added in the tickets from CMS to help debugging!!! GGUS:75514 is waiting for reply from LHCb. There is a problem with the IN2P3 ticketing system's interface to GGUS tracked via GGUS:76041.
  • CERN:
    • Grid Services: Busy checking Atlas' reported LSF problem.
    • Dashboards: ntr
    • Storage Services: There is an intervention today and one scheduled for this Wednesday, concerning the name service that should be transparent.
    • Databases: Oracle Security patches being applied now.

AOB: (MariaDZ) ALARM drills for 4 weeks attached at the end of this page. GGUS-SNOW interface was not working since last Friday and until this morning. Tracked via GGUS:76052. SIR in preparation.

Tuesday:

Attendance: local(Adam, Douglas, Eddie, Maarten, Maria D, Massimo, Nicolo, Steve);remote(Burt, Gareth, Gonzalo, Jhen-Wei, Joel, Kyle, Mette, Michael, Paco, Paolo, Rolf, Xavier).

Experiments round table:

  • ATLAS reports -
    • Tier-0
      • Rolling upgrades of Oracle machines today, effecting the ATLR and ADCR databases. Upgrades happened between 10-11:30am, and no problems reported.
      • Problems with SRM access to CERN-PROD, failures like "failed to contact on remote SRM [httpg://srm-eosatlas.cern.ch:8443/srm/v2/server]". This was a problem yesterday, and was fixed with a system reboot at 10:50pm, but the problem has returned this afternoon. This is a daily occurrence this week, and was also reported a few times each in the previous few weeks. GGUS:76123.
    • Tier-1
      • BNL outage continues for today. Offline status correctly set in ATLAS systems, no problems to report at this time.
      • Taiwan CASTOR outage continues today, status is correct set in systems, nothing to report.
        • Jhen-Wei: more time is needed, downtime extended until Wed 11:00 UTC

  • CMS reports -
    • LHC / CMS detector
      • Technical stop
      • Preparing for HI run
    • CERN / central services
      • LSF master node issues affecting T0 during the weekend, GGUS:76045 TEAM opened. VOBOXes used for CMS T0 submission were reconfigured to contact LSF secondary master on Monday. Ticket escalated to ALARM when LSF stopped responding again around 16:00, understood to be caused by switching back to the primary master. T0 recovered after that, ticket closed.
      • No issue observed during T0 FTS upgrade
      • CASTORSRM degraded: https://sls.cern.ch/sls/history.php?id=CASTOR-SRM_CMS&more=availability&period=24h - but minimal impact on transfers.
    • T1 sites:
      • Processing and/or MC running at all sites.
      • [T1_IT_CNAF] 2 files waiting for migration to tape for 26 days, GGUS:76090
      • [T1_DE_KIT]: Debugging network issues with US T2s GGUS:75985
      • [T1_FR_IN2P3]: investigation on networking issues in progress
    • T2 sites

  • LHCb reports -
    • Experiment activities
      • Reprocessing is now being ramped up
      • Stripping is continuing at CERN:
    • T0
    • T1
    • Steve: last Friday's FTS authorization problem was due to a simple configuration input mistake, viz. LHCb not being in the list of VOs to support!

Sites / Services round table:

  • ASGC - nta
  • BNL
    • maintenance work progressing according to schedule
    • Steve: notice - when the CERN-BNL FTS channel is switched on again, it will be using the upgraded FTS agent for the first time; all other channels are looking OK so far
    • Nicolo: in PhEDEx the FTS agent upgrade was invisible
  • CNAF
    • CMS ticket: main issue fixed, residual matter in progress
  • FNAL - ntr
  • IN2P3
    • various tickets waiting for reply:
    • unscheduled downtime of tape robot tomorrow between 8:00 and 15:00 UTC to apply fixes for issues that caused Oct 31 incident (internal switch + microcode update); file staging will be unavailable
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • SARA dark data mentioned yesterday affects both ATLAS and LHCb; it is being cleaned up
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • CASTOR/EOS
    • transparent upgrade of CASTOR name server front-end nodes on Thu
    • CMS SLS issue: for each experiment the SLS requests now go to the proper SRM and the default pool - maybe a better pool can be decided per experiment
      • Nicolo: the t1transfer pool would make SLS show better what CMS experiences as the state of the CASTOR SRM
  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW - ntr
  • grid services - ntr

AOB:

Wednesday

Attendance: local(David, Douglas, Jhen-Wei, Maarten, Massimo, Nicolo, Ricardo, Steve);remote(Dimitri, Gonzalo, Joel, John, Lisa, Mette, Michael, Paolo, Rob, Rolf, Ron).

Experiments round table:

  • ATLAS reports -
    • Tier-0
      • SRM problems at CERN-PROD happened yet again, ticket was updated with new errors. Stability issues are becoming a more than daily problem. GGUS:76123.
        • Massimo: the EOS SRM was restarted after some time was spent to try and understand why it was stuck; we also are looking into a more robust configuration for it; more news in the coming days
      • Upgrades to Oracle continue today, with the ATLARC database servers.
      • Dashboard intervention today, and service is out from 14:00-15:00.
      • VOMS DB intervention to fix current ATLAS VO ability to update the database. Haven't heard if this is done, or how well it went?
        • Steve: the operation is ongoing, the progress is fine; more news later today
        • Maarten: can the other experiments also run into the problem?
        • Steve: CMS already did and already got the same fix recently, viz. a dump + restore of their DB to get an automatic counter reset, after which the DB should again be good for a few years; ALICE and LHCb still are very far below the limit in question
    • Tier-1
      • BNL outage continue today, but should end this evening at 21:00 CERN time. Is this still the case?
        • Michael: looking OK so far
      • Taiwan CASTOR outage is over today, ATLAS systems are set to on-line for the site.
        • Jhen-Wei: will check the current state
      • The name for the site IN2P3 was changed from "LYON" to "IN2P3-CC" in internal ATLAS systems today.

  • CMS reports -
    • LHC / CMS detector
      • Technical stop
      • Cosmic runs
    • CERN / central services
      • NTR
    • T0
      • Deployed config for HI running, in testing
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing starting.
      • [T1_IT_CNAF] 2 files waiting for migration to tape for 26 days, GGUS:76090 - fixed. Several more files are stuck in migration because they were produced by CMS with wrong LFN pattern: asked CNAF to define migration policy for these files.
      • [T1_IT_CNAF]: Transient SRM SAM test failures tonight, GGUS:76154 - closed.
      • [T1_DE_KIT]: Debugging network issues with US T2s GGUS:75985
      • [T1_FR_CCIN2P3]: investigation on networking issues in progress GGUS:75983 GGUS:75919 GGUS:71864
      • [T1_FR_CCIN2P3]: GGUS:75397 about FTS channel config: configuration of shared FTS channel now looks OK, network issues on slow links to be followed up in other tickets, can be closed.
      • [T1_FR_CCIN2P3]: GGUS:75829 about proxy expiration in JobRobot test: happened again on Nov 5th on cccreamceli02.in2p3.fr
      • [T1_TW_ASGC]: GGUS:75377 about read errors on CASTOR: pending, investigation to resume after end of downtime.

  • LHCb reports -
    • Experiment activities
      • Reprocessing is progressing well
      • Stripping ongoing
    • T0
    • T1
      • IN2P3: close GGUS:75703 ?
        • Rolf: let's follow up in the ticket
      • KIT: CVMFS deployment plans?
        • Dimitri: will ask and report tomorrow
      • SARA: CVMFS deployment plans?
        • Ron: planned for the next 2 weeks

Sites / Services round table:

  • ASGC - nta
  • BNL
    • expect to be back on time
    • tried one extra network intervention (multiple spanning trees), but ran into problems and had to back out
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • robot downtime proceeding OK
    • network problems affecting CMS transfers: Renater and Geant network providers are investigating, there appear to be packet losses, cause unknown; transfers to a Japanese T2 are affected as well
    • incomplete storage dumps for LHCb: the missing space tokens are the ones that were removed at the request of LHCb?
      • Joel: will check and resolve the confusion in the ticket
  • KIT - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • CASTOR/EOS - nta
  • dashboards
    • upgrade of ATLAS DDM dashboard went OK
  • grid services - nta

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS reports -
    • Tier-0
      • No problems seen with the CERN-PROD SRM today.
    • Tier-1
      • BNL outage continues today, but should be over very soon.

  • CMS reports -
    • LHC / CMS detector
      • Technical stop
      • Cosmic runs
    • CERN / central services
    • T0
      • Deployed config for HI running, in testing
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing starting.
      • [T1_IT_CNAF] GGUS:76090 - files stuck in migration because they were produced by CMS with wrong LFN pattern: asked CNAF to define migration policy for these files.
      • [T1_IT_CNAF] GGUS:76175 - degraded imports from other T1s in PhEDEx Debug instance
      • [T1_DE_KIT]: Debugging network issues with US T2s GGUS:75985
      • [T1_DE_KIT]: JobRobot failures - jobs expired after staying in WMS queue with "no compatible resources" - GGUS:76191
      • [T1_FR_CCIN2P3]: investigation on networking issues in progress GGUS:75983 GGUS:71864
      • [T1_FR_CCIN2P3]: GGUS:75829 about proxy expiration in JobRobot test: happened again on Nov 5th and Nov 9th on cccreamceli02.in2p3.fr - also affecting T2_BE_IIHE SAV:124597 and SAV:124471 - possible CREAM bug?
      • [T1_TW_ASGC]: GGUS:75377 and GGUS:76204 about read errors on CASTOR

  • LHCb reports -
    • Experiment activities
      • Reprocessing is progressing well
      • Stripping ongoing
    • T0
    • T1
      • IN2P3 : (GGUS:75158) : migration of files to the correct space token.
      • GRIDKA : (GGUS:75851) : problem of benchmark on some nodes

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 31-Oct-2011

Topic attachments
I Attachment HistorySorted ascending Action Size Date Who Comment
PowerPointppt ggus-data_MB_20111108.ppt r1 manage 2508.0 K 2011-11-07 - 14:41 MariaDimou  
Edit | Attach | Watch | Print version | History: r17 | r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r13 - 2011-11-10 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback