Week of 110103

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Tuesday:

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Wednesday

Attendance: local(Alessandro, Cedric, David, Edoardo, Jamie, Jan, Julia, Maarten, Manuel, Maria G, Massimo, Peter, Roberto, Stefan);remote(Daniele, Dimitri, Jeremy, John, Jon, Michael, Onno, Rob, Roger, Rolf).

Experiments round table:

  • ATLAS reports -
    • Xmas break main issues
      • RAL-LCG2_MCDISK full. UK cloud set to brokeroff.
      • SRMV2STAGER errors : many errors reported by shifters. Actually Site Services staging machine do not show success "ONLINE GUID", but only failures "FAILED GUID", often due to too short retries.
      • Some files lost on Castor at CERN after power-cut
    • Jan 05 (Wed)
      • Disk space problem on RAL-LCG2_MCDISK.
        • Deletion backlog exhausted. More deletion will be submitted soon to get more space (ESD that were copied on tape before Xmas break).
        • In parallel ~200 TB moved from DATADISK to MCDISK
      • One Site Services bug quickly fixed.
    • David: all the CE tests appear as if they are not running
      • Alessandro: the tests actually run, but the results are not propagated; to be investigated by the SAM team; no major issues seen at the sites

  • CMS reports -
    • Experiment activity
      • Shutdown activities
      • Computing Operations and Shifts were on reduced modus during the CERN XMas break. Have resumed with normal operations today.
    • CERN and Tier0
      • Tier-0 still down
      • CMS Dashboard : issue with CMS Site Status Board not working properly, affecting CMS severely, in particular to get Site Readiness tables/plots updated (first reported on Dec 23, see Savannah:118483)
        • Identified by Dashboard team as caused by high update rate of some CMS metrics in the ORACLE-DB SSB table, affecting the update efficiency of all CMS metrics in the SSB
        • Data collector was temporarily modified during the break by Pablo, but long term solution will be to have separated DB tables for each metric to be monitored in the CMS SSB
        • Not clear if the current fix is satisfactory for CMS : at least the CMS CRC is not able to monitor the usual historical Site Readiness plots (to be further discussed on the above ticket)
        • CMS Computing and Dashboard team need to materialize the plans discussed on Dec 1, 2010 : re-enforcement of the critical coverage of CMS Dashboard production instances + coordinated upgrade schedule
    • Tier1
      • Good Site Availability during the break : CMS T1 SAM dashboard
        • Monitoring Issue : FNAL is marked as in Maintenance in the CMS Site Status Board however this is fake ! See Savannah:118536
          • Julia: SAM returns FNAL SRM as being in maintenance
          • Jon: the SRM that is in maintenance is not for production, but only for tests/debugging
      • Data and MC Production (CMSSW 3_9_7) during XMas break :
        • Data Re-Reco/Skimming : 755 M events produced
        • MC Re-Reco : >700 M events produced
    • Tier-2
      • Stable Site Availability situation during break (only a few sites with issues or in downtime) : CMS T2 SAM dashboard
      • MC Production during XMas break :
        • Spring11 8TeV GEN-SIM production is ongoing (CMSSW 3_10_0) : 228/333 M event produced
        • Fall10 production is closed, last workflows finished up over XMas

  • ALICE reports -
    • T0 site
      • AliEn v2.19 upgrade launched Dec 15-17 presented some issues that left the grid unusable for ALICE users during the Xmas break (production did run). The remaining issues are being worked on with high priority as people return from the break.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Huge MC production during Xmas period (pace of 40K jobs per day). Suffering problem with the logSE and in general with MC space tokens at CERN and different T1s. To be redone the stripping of a couple of streams and still to be completed the reprocessing of 2010 data (30 problematic files still to be processed).
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • Intervention on the downstream capture database (impacting the replication of LFC to T1). Done this morning. Tomorrow a patch of the whole RAC serving CASTOR, LFC, ConditionDB
        • During the closure period a problem with all our VOBOXes due to the netlog process filling up the /var partition with zilions of messages. Joel will follow this up with IT people.
      • T1 site issues:
        • NTR
      • Roberto: IN2P3 now look close to the solution for the AFS problem affecting the "setup project" command; this would not solve the SW installation problem, though

Sites / Services round table:

  • BNL
    • upgrades of dCache, network, power, ... foreseen for Sat Jan 15 --> Tue Jan 18, overlapping with quiet time for ATLAS DB reorganization at CERN Jan 17-18
  • FNAL
    • see CMS report
  • GRIDPP
    • working on space issues at ATLAS T2 sites
    • APEL grid accounting broker at RAL currently down
  • IN2P3 - ntr
  • KIT
    • downtime of internet connection on Jan 24 for some hours in the morning
    • full-day maintenance foreseen for Jan 26, affecting many services
    • Thu Jan 6 is a holiday in Germany, only critical matters may get a response
  • NDGF - ntr
  • NLT1
    • just before Xmas ATLAS ESD files were copied between T1 sites at high rates causing overload of the storage system at SARA and the DATATAPE space to fill up; the transfer rate to tape has been increased now, but we need to work with ATLAS to avoid this kind of problems
      • Alessandro: DATATAPE enlargements will be needed at all T1, numbers will be determined in the near future
    • in some circumstances data can get corrupted when files are stored on tape due to a problem with the intermediate file system buffer (CXFS/DMF); checksums now are calculated from disk, not any more from memory
  • OSG
    • BDII v5 upgrade blocked by graceful restart not working?! awaiting information from Ricardo and/or Laurence
  • RAL
    • see ATLAS report

  • CASTOR
    • still some open tickets since the power cut
    • tomorrow (Thu Jan 6) between 09:00 and 12:00 there will be a non-transparent intervention to upgrade the DB for the name server
    • next week there will be non-transparent interventions for CMS and the tape library, all have been agreed with the affected parties
  • dashboards
    • see ATLAS and CMS reports
  • grid services - ntr
  • networks
    • Mon Jan 10 one central router will be stopped, Tue Jan 11 it will be replaced with a new router; should be transparent, else mail "neteng" or phone Edoardo at 160046

AOB:

  • happy new year!

Thursday

Attendance: local(Alessandro, Cedric, David, Jamie, Maarten, Maria G, Miguel, Nicolo, Peter, Roberto, Stefan);remote(Gareth, Jeremy, Jon, Michael, Rob, Rolf, Ronald).

Experiments round table:

  • ATLAS reports -
    • RAL completely back online (was brokeroff)
    • A few T2s problems affecting Storage : RO-14-ITIM GGUS:65872 , INFN-ROMA1 GGUS:65854. Files unavailable at CA-ALBERTA-WESTGRID-T2 GGUS:65869
    • David: SAM CE tests OK now
      • Alessandro: 1 submission machine needed a fix to support the BNL VOMS server, on the other machine the framework was stuck

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Tier-0 still down
      • CMS Hypernews Server migrated to SLC5, hence HN service down for the whole day (and surprisingly low email traffic on our email browsers wink ...)
      • CMS Dashboard issues :
      • CASTORCMS :
        • lost 641 "temp" files (one complete filesystem) from the "CMSCAF" pool after a RAID failure. CMS needs to evaluate internally if relevant, but in principle OK since it is production data that should also have another custodial site.
    • Tier1
      • Busy Data and MC Re-Reco processing (CMSSW_3_9_7)
      • No outstanding issues
    • Tier-2
      • Busy Spring11 8TeV production (CMSSW 3_10_0)
      • Growing analysis activities
      • No outstanding issues

  • ALICE reports -
    • T0 site
      • Ongoing efforts to make AliEn v2.19 work for users. Production continues.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Ongoing MC productions
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • Intervention on Oracle DB today affecting CASTOR and other Oracle based services (BKK, LFC)
      • T1 site issues:
        • NTR

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • NLT1 - ntr
  • OSG
    • awaiting reply from Ricardo about BDII v5 restart issue
  • RAL
    • yesterday there was a 10 minute network outage affecting the T1 services
    • yesterday's APEL problem has been fixed

  • CASTOR
    • name server DB intervention finished OK around noon, all looks OK
      • Cedric: there were intermittent problems around 13:00 CET
      • Miguel: there was a rush of transfers, some must have timed out
  • dashboards
    • see ATLAS report
  • grid services
    • 1 LFC daemon was blocked after this morning's DB intervention and needed to be restarted; being investigated

AOB:

Friday

Attendance: local(Alessandro, Cedric, David, Jamie, Julia, Maarten, Manuel, Peter, Roberto, Simone, Stefan);remote(Alessandro, Andreas, Jeremy, John, Jon, Michael, Onno, Rolf).

Experiments round table:

  • ATLAS reports -
    • Early afternoon a problem started appearing with the FTS at KIT affecting certain transfers
      • Andreas: 1 agent refuses to start after an Oracle hiccup, FTS support have been contacted directly and via GGUS:65927

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Tier-0 still down
      • CMS Dashboard issues : 3rd day reporting (sorry), to underline it is critical for CMS and we need to get the various issues understood and fixed asap, namely :
        • FNAL wrongly tagged as being in Maintenance in CMS SSB : answer by the SAM/SFT group : apparently the concerned downtime (and 3 others) was inserted by mistake into the SAM DataBase 1 year ago (2010-02-17T16:26:25Z) and never deleted ! ... The SAM/SFT admins are checking with OSG to make sure why that happened (see GGUS:65868).
          • Jon: for FNAL the source of downtime information is OSG, which does not have the site/service in downtime; after fixing the SAM side also the monthly availability calculations need to correct for this matter; it is a long-standing problem that downtime deletions are not transmitted from OSG to SAM, this needs to be addressed
          • GGUS:65868 has been solved now, availabilities will be corrected
          • RFE opened for deletion propagation: https://tomtools.cern.ch/jira/browse/SAM-1124
        • Issue with SSB Collectors affecting CMS Site Readiness monitoring still not solved : here we probably suffer from Pablo's absence, however this is a rather big issue for CMS since we can't evaluate sites via our usual monitoring tools, see Savannah:118483
        • T2_RU_RRC_KI is now (correctly) back in Maintenance in the CMS SSB. However it is not yet clear what happened, see Savannah:118569 / BUG:76760
    • Tier1
      • Busy Data and MC Re-Reco processing (CMSSW_3_9_7)
      • No outstanding issues
    • Tier-2
      • Busy Spring11 8TeV production (CMSSW 3_10_0)
      • Growing analysis activities
      • No outstanding issues

  • ALICE reports -
    • T0 site
      • AliEn v2.19 expected to work for users early next week. Production continues and CAF is available for analysis on limited, pre-staged data sets.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • MC productions only.
      • DIRAC job output distribution policies are being investigated, because some sites have become full while others have space
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • NTR
      • T1 site issues:
        • NTR

Sites / Services round table:

  • BNL - ntr
  • CNAF - ntr
  • FNAL
    • see CMS report
  • GridPP - ntr
  • IN2P3
    • problem with LHCb job setup timeouts is considered solved:
      • WN with 24 logical cores now have only 21 job slots
      • AFS client was upgraded to a development version
      • looks OK also for ATLAS
      • SIR will follow
      • Roberto: the setup timeout will be kept at 1 hour, we can check the improvements through the SAM tests
    • problem with LHCb SW installation is not yet solved:
      • for now we suggest LHCb handle AFS at IN2P3 as at CERN
      • CVMFS approach is being investigated and looks promising
      • Jon: CVMFS deployment at FNAL has proved very efficient
      • Maarten: also NIKHEF is happy with it
  • KIT
    • see ATLAS report
  • NLT1 - ntr
  • OSG
    • SAM team have been informed that the old downtimes mentioned in the CMS report can be deleted
    • no progress on the BDII v5 issue yet
  • RAL - ntr

  • dashboards
    • see CMS report
  • grid services - ntr

AOB:

-- JamieShiers - 14-Dec-2010

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2011-01-07 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback