Week of 111107

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Massimo, MariaDZ, Ricardo, Doug, Jhen-Wei, David, Jacek);remote(Rolf, Onno, Lisa, Michael, Mette, Gonzalo, Gareth, RobQ, Dimitri, Paolo).

Experiments round table:

  • ATLAS reports -
    • T0
      • CERN-PROD transfer failures: "SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] [srm2__srmPrepareToGet". GGUS:76031. Recovered within one hour and no more error appeared. Maybe a temporary problem that threads busy during that time period.
      • CERN-PROD Slow LSF response time. GGUS:76039. Hardware problem affecting the LSF master machine. Now running on the secondary master while the raid array is rebuilding on the primary master. The response times to be commands looks good already.
      • Transparent upgrade of CASTOR 2.1.11-8 on CASTOR Atlas. 7 Nov. 10:00 - 16:00 UTC. Completed at 10:30 UTC.
      • Patching of ATLAS online production database (ATONR) - rolling intervention. 7 Nov. 10:30 - 12:30 UTC. Completed at 10:45 UTC.
      • WLCG production database (LCGR) rolling intervention. 7 Nov. 11:00 - 13:00 UTC.
    • T1 sites
      • BNL downtime scheduled downtime - Three days of facility maintenance for all services. 7-9 November.
      • Taiwan-LCG2 scheduled downtime - Two days maintenance for Castor upgrade. 7-8 November.
    • T2 sites
      • ntr

  • LHCb reports -
    • Experiment activities
      • Reprocessing is now being ramped up
      • Stripping is continuing at CERN:
    • T1
      • PIC : GGUS:76028: FTS transfer problem. Fixed
      • PIC : problem of space on pool behind LHCb-tape space token.

Sites / Services round table:

  • ASGC: ntr
  • BNL
    • Intervention at BNL started at 13:00 UTC. Services at US ATLAS Tier-1 expected to be restored by Wednesday, November 9 at 18:00 UTC.
  • FNAL: ntr
  • PIC: LHCb's tickets fixed but the one still open is being monitored as FTS transfers appeared blocked due to a dCache problem. dCache developers are contacted.
  • NDGF: There was a network problem today with a fibre that cut off the University. It seems to be solved now.
  • NL_T1: Atlas dark data found in SARA which seem to be old data and now being cleaned up.
  • CNAF: ntr
  • RAL: A patch was installed this morning to fix the over-reporting of tape capacity.
  • KIT: There will be a scheduled intervention next Monday to Wednesday on the Atlas LFC instances and, next Tuesday, specifically, changes will be applied to Atlas dCache.
  • OSG: A maintenance session will take place tomorrow Tue between 14:00-18:00 hrs UTC that will affect several OSG services. One of the resulting improvements will be the possibility ton include ticket attachments (important for the interface with GGUS).
  • IN2P2: There are 4 CMS tickets on network issues. Performance numbers should be added in the tickets from CMS to help debugging!!! GGUS:75514 is waiting for reply from LHCb. There is a problem with the IN2P3 ticketing system's interface to GGUS tracked via GGUS:76041.
  • CERN:
    • Grid Services: Busy checking Atlas' reported LSF problem.
    • Dashboards: ntr
    • Storage Services: There is an intervention today and one scheduled for this Wednesday, concerning the name service that should be transparent.
    • Databases: Oracle Security patches being applied now.

AOB: (MariaDZ) ALARM drills for 4 weeks attached at the end of this page. GGUS-SNOW interface was not working since last Friday and until this morning. Tracked via GGUS:76052. SIR in preparation.

Tuesday:

Attendance: local(Adam, Douglas, Eddie, Maarten, Maria D, Massimo, Nicolo, Steve);remote(Burt, Gareth, Gonzalo, Jhen-Wei, Joel, Kyle, Mette, Michael, Paco, Rolf, Xavier).

Experiments round table:

  • ATLAS reports -
    • Tier-0
      • Rolling upgrades of Oracle machines today, effecting the ATLR and ADCR databases. Upgrades happened between 10-11:30am, and no problems reported.
      • Problems with SRM access to CERN-PROD, failures like "failed to contact on remote SRM [httpg://srm-eosatlas.cern.ch:8443/srm/v2/server]". This was a problem yesterday, and was fixed with a system reboot at 10:50pm, but the problem has returned this afternoon. This is a daily occurrence this week, and was also reported a few times each in the previous few weeks. GGUS:76123.
    • Tier-1
      • BNL outage continues for today. Offline status correctly set in ATLAS systems, no problems to report at this time.
      • Taiwan CASTOR outage continues today, status is correct set in systems, nothing to report.
        • Jhen-Wei: more time is needed, downtime extended until Wed 11:00 UTC

  • CMS reports -
    • LHC / CMS detector
      • Technical stop
      • Preparing for HI run
    • CERN / central services
      • LSF master node issues affecting T0 during the weekend, GGUS:76045 TEAM opened. VOBOXes used for CMS T0 submission were reconfigured to contact LSF secondary master on Monday. Ticket escalated to ALARM when LSF stopped responding again around 16:00, understood to be caused by switching back to the primary master. T0 recovered after that, ticket closed.
      • No issue observed during T0 FTS upgrade
      • CASTORSRM degraded: https://sls.cern.ch/sls/history.php?id=CASTOR-SRM_CMS&more=availability&period=24h - but minimal impact on transfers.
    • T1 sites:
      • Processing and/or MC running at all sites.
      • [T1_IT_CNAF] 2 files waiting for migration to tape for 26 days, GGUS:76090
      • [T1_DE_KIT]: Debugging network issues with US T2s GGUS:75985
      • [T1_FR_IN2P3]: investigation on networking issues in progress
    • T2 sites

  • LHCb reports -
    • Experiment activities
      • Reprocessing is now being ramped up
      • Stripping is continuing at CERN:
    • T0
    • T1
    • Steve: last Friday's FTS authorization problem was due to a simple configuration input mistake, viz. LHCb not being in the list of VOs to support!

Sites / Services round table:

  • ASGC - nta
  • BNL
    • maintenance work progressing according to schedule
    • Steve: notice - when the CERN-BNL FTS channel is switched on again, it will be using the upgraded FTS agent for the first time; all other channels are looking OK so far
    • Nicolo: in PhEDEx the FTS agent upgrade was invisible
  • CNAF
    • CMS ticket: main issue fixed, residual matter in progress
  • FNAL - ntr
  • IN2P3
    • various tickets waiting for reply:
    • unscheduled downtime of tape robot tomorrow between 8:00 and 15:00 UTC to apply fixes for issues that caused Oct 31 incident (internal switch + microcode update); file staging will be unavailable
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • SARA dark data mentioned yesterday affects both ATLAS and LHCb; it is being cleaned up
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • CASTOR/EOS
    • transparent upgrade of CASTOR name server front-end nodes on Thu
    • CMS SLS issue: for each experiment the SLS requests now go to the proper SRM and the default pool - maybe a better pool can be decided per experiment
      • Nicolo: the t1transfer pool would make SLS show better what CMS experiences as the state of the CASTOR SRM
  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW - ntr
  • grid services - ntr

AOB:

Wednesday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 31-Oct-2011

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data_MB_20111108.ppt r1 manage 2508.0 K 2011-11-07 - 14:41 MariaDimou  
Edit | Attach | Watch | Print version | History: r17 | r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2011-11-09 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback