Week of 140929

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions to join the phone conference can be found here.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Andrea Sciabà, Hervé Rousseau (IT-DSS), Eric Vaandering (CMS), Alberto Rodríguez Peón (IT-PES), Tsung-Hsun Wu (ASGC)
  • remote: Alessandro Di Girolamo (ATLAS), Dea-Han Kim (KISTI), Dmitry Nilsen (KIT), Lisa Giacchetti (FNAL), Michael Ernst (BNL), Onno Zweers (NL-T1), Sonia Taneja (CNAF), Rolf Rumler (IN2P3-CC), Tiju Idiculla (RAL), Ulf Tigerstedt (NDGF), Vladimir Romanovskiy (LHCb), Rob Quick (OSG)
Experiments round table:

  • ALICE -
    • NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr. Congratulations to CERN for its 60th anniversary!
  • OSG: ntr
  • PIC: we had some instabilities in the dcache doors last Friday, and this resulted in some SAM tests failing here (occasionally). We think this is because high load from ATLAS, and this is being investigated here and corrections are being applied.
  • RAL: ntr
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services:
    • CVMFS mounts within CERN are today disrupted. It's being investigated.
    • After successful time in QA, site-bdii.cern.ch, lcg-bdii.cern.ch & sam-bdii.cern.ch will be upgraded to 5.2.23. SSB Entry.
    • WMS decomission happening on Wednesday.
  • CERN storage services: Upgrade of Castor to 2.14-14 has been completed successfully with the namespace update today
  • Databases:
  • GGUS: New release on the 1st of October. The service is in downtime from 6:00 to 8:00 am. This release includes several new EGI Support Units, and several enhancements to the CMS ticket submit form. The test alarms will be sent after the new release is deployed. Alarms for FNAL will be executed on 16:00 UTC
  • Grid Monitoring:
  • MW Officer:
AOB:

Thursday

Attendance:

  • local: Stefan (SCOD), Akos (Grid Services), Lorena (DB), Alessandro, Herve (DSS), Andrea (MW Office), Maarten (ALICE), Christophe (CMS), Pablo (GGUS+Monitoring), Maria (WLCG), Felix (ASGC), Maarten (ALICE)
  • remote: Andrej (ATLAS), Dennis (NL-T1), Lisa (FNAL), John (RAL), Michael (BNL), Jeremy (GridPP), Rolf (IN2P3), Sang Un (KISTI), Rob (OSG), Lucia (CNAF), Vladimir (LHCb), Dimitri (KIT), Pepe (PIC), Pavel (KIT)

Experiments round table:

  • ATLAS
    • NTR

  • CMS
    • CMS dashboard apps have some problems, b/c of monalisa -> see also monitoring report below

  • ALICE
    • NTR

  • LHCb
    • MC and User jobs.
    • T0: NTR
    • T1: NTR
    • Services: NTR

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NTR
  • FNAL: NTR
  • GridPP: NTR
  • IN2P3: NTR
  • JINR: NR
  • KISTI: NTR
  • KIT: NTR
  • NDGF: NR
  • NL-T1: NTR
  • OSG: Question, new VOMS servers not available due to some firewall protection? Is the deadline towards end of November? Maarten, we shall discuss this in the WLCG ops coordination meeting at 15.30 (see minutes there)
  • PIC: NTR
  • RAL: Disk service outage for ATLAS now back in production
  • RRC-KI: NR
  • TRIUMF: NR

  • CERN batch and grid services:
    • WMS decommissioning was done on Wednesday.
    • CvmFS within CERN was very faulty on Monday into Tuesday morning.
      • RAL stratum 1 was hanging from time to time.
      • One site squid server at CERN was faulty (full partition)
        • It is unclear at this time. which of the above was the underlying cause.
      • In addition a reconfiguration of CvmFS clients at CERN to remove RAL from the list of stratum one results in a cvmfs_config reload being killed after 5 minutes by puppet. This resulted in many stale cvmfs mounts. (Normal time for reload is < 1 minute) The hang was always on /cvmfs/atlas.cern.ch in particular.
      • Bugs are open to with CvmFS developers. Clients appeared to hang rather than fail against RAL (through CERN proxy).
  • CERN storage services:
  • Databases: Had to stop the standby DB for LCGCR, DB will be restarted within 2 weeks on new hardware
  • GGUS:
    • Release done on 1st of October. Test alarms sent to all the T1, and some took quite some time to get acknowledge ( KR-KISTI-GSDC-01 22 hours, and RRC-KI-T1 3,5 hours ). GGUS is not allowed to send to INFN-T1
    • Scheduled outage for the 16th of October, 7:30 to 9:30 to switch to the failover instance of REMEDY
  • Grid Monitoring:
    • CMS Job Monitoring information was missing from Friday until Wednesday morning. Issue has been understood and solved
    • At the end of the month, SAM3 will replace SAM2. There is also one important change in the availability formula: if a site provides several storages, all of them have to be up to consider the site available.
  • MW Officer:
    • For sites still running dCache 2.2.x: dCache 2.6.x is working fine with Enstore tape backend (PIC is running it).
    • issue preventing the installation of many grid components ( CREAM, WN, WMS, L&B), cause one dependency (classAds) has been removed from epel ( no maintainer). The missing dep is going to be included in EMI/UMD third party repos for the moment waiting for a new maintainer.

AOB:

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2014-10-02 - StefanRoiser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback