Week of 150720

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria Alandes (Chair), Maarten Litmaath (ALICE), Nacho Barrientos (Grid and Batch), Jan Yven (Storage), Emil Pilecki (DB)
  • remote: Ulf Tigerstedt(NDGF), Alexander (NL-T1), Michael Ernst (BNL), Salvatore Tupputi (CNAF), Eric Vaandering (CMS), Jose Flix (pic), Pavel Weber (KIT), Renato Santana (LHCb), Di Qing (TRIUMF), Gareth Smith (RAL), Dario Barberis (ATLAS), Rob Quick (OSG)

Experiments round table:

  • CMS reports ( raw view) -
    • DB problems at Wigner do not appear to have affected us
    • Quiet week otherwise

  • ALICE -
    • CERN: raw data copies from Point 2 to CASTOR are timing out since the night
    • NDGF reported inefficient data transfers and noise in their logs
      • this is due to failed attempts with 2 methods before the 3rd succeeds
      • to be followed up with the dCache devs

Jan reports that stuck transfers in CASTOR are due to offline transfers using 3rd party copies. The issue is still under investigation.

  • LHCb reports ( raw view) -
    • Data Processing: "new data" automatic production in place. User and MC jobs
    • T0
      • NTR
    • T1
      • NTR
    • T2
      • NTR

Sites / Services round table:

  • ASGC: NA
  • BNL: NTR
  • CNAF: NTR
  • FNAL: NA
  • GridPP: NA
  • IN2P3: NA
  • JINR: NA
  • KISTI: NA
  • KIT: NTR
  • NDGF: it is very likely that ALICE report on inefficient data transfers is due to a bug in dCache. It would be very interesting to get the command line commands that could reproduce the problem in NDGF testing environment.
  • NL-T1: NTR
  • NRC-KI: NA
  • OSG: NTR
  • PIC: A downtime to upgrade dCache is scheduled next week. Exact dates will be entered in GOCDB
  • RAL: NTR
  • TRIUMF: NTR

  • CERN batch and grid services: NTR
  • CERN storage services:
    • A restart of Castor head nodes is scheduled next Thursday 23rd July from 9h00 to 16h00 due to some changes on the DB backend. This should be transparent. See ITSSB for more details.
    • AFS servers are affected by Wigner issue. No new AFS users can be created at the moment. See ITSSB for more details.
    • Old CASTOR SRM machines will be retired. This should be transparent. An entry in SSB listing the affected nodes will be created so users can check, but in principle this should be transparent since nobody should be using these nodes.
  • Databases:
    • Problem with Wigner is affecting standby DBs. Only CMS could be affected since it uses Data Guard. The other experiments only use these DBs for taking backups. Production DBs are not affected. Connectivity is partially restored but not everything working fine yet. Backups are being synchronised with production. See ITSSB for more details.
    • No Oracle DB patching is scheduled in the next 3 months.
  • GGUS: NA
  • Grid Monitoring: NA
  • MW Officer: NA

AOB: NA

Thursday

Attendance:

  • local: Maria Dimou (Chair), Olvier Gutsche (CMS), Albero Peon (T0 Grid Services), Stefan Roiser (LHCb), Maarten Litmaath (ALICE), Andrea Manzi (MW Officer), Jan Iven (T0 Storage).
  • remote: Ulf Tigerstedt (NDGF), David Bouvet (IN2P3), Dennis van Dok (NL_T1), Lisa Giacchetti (FNAL), Thomas Hartmann (KIT), John Kelly (RAL), Salvatore Tuppiti & Antonio Falabella (CNAF), Michael Ernst (BNL), Dario Barberis (ATLAS), Elisabeth Prout (OSG), Di Qing (Triumf), Josep Flix (PIC).

Experiments round table:

  • ATLAS reports ( raw view) -
    • Nothing special to report
      • EOS was down Tuesday evening for 3 hours. SNOW and then GGUS alarm tickets sent. Quick reaction from IT.
      • Unannounced change to SOAP library for e-groups on Tue 21st broke overnight authorisation group synchronisation scripts. Fixed on the ATLAS side changing SOAP calls and output parsing in the scripts. Good luck that developer is around end of July. Related ticket SNOW:RQF0482624.

  • CMS reports ( raw view) -
    • quiet week
    • GGUS:115120 : AI servers "slow" for T0 jobs: ticket opened 17. July, last commented on 20. July with questions to admins if they can observe symptoms observed by ticket creator. Maria suggested that Oli escalates the ticket.
    • old: GGUS:114712: "xrdcmsglobal01.cern.ch hangs on attempts to contact via xrootd", opened 29. June, re-opened 3. July, detailed answer with instructions received 7. July, following up with CMS to reply and/or close

  • ALICE -
    • high activity
    • CERN: job submissions became really slow in the course of Mon (team ticket GGUS:115153)
      • new submissions could not keep up with old jobs finishing
      • the site got steadily drained of ALICE jobs
      • the Argus service is the main suspect, but no clear evidence was found
      • 2 nodes were again made available for use in the alias (now 5/6, was 4/6)
      • the problem was gone by Tue evening
Maarten and Jan agreed that work to fix GGUS:115145, reported on Monday is on-going.

  • LHCb reports ( raw view) -
    • Data Processing: "new data" stripping. User and MC jobs
    • T0
      • NTR
    • T1
      • IN2P3: (GGUS:115169): issues on reading RAW/MDF files. David confirmed that the 'Related issue' appearing in the GGUS ticket, is an identical instance in the french ticketing system. The 2 systems are synchronised.
    • T2
      • NTR

Sites / Services round table:

  • ASGC: not connected
  • BNL: ntr
  • CNAF: Updated machines hosting webdav,gridftp,xrootd to SL6. They will do the same for Atlas and Cms after agreeing on which version of the services to install.
  • FNAL: ntr
  • GridPP: not connected
  • IN2P3: nothing to add
  • JINR: not connected
  • KISTI: not connected
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • NRC-KI: not connected
  • OSG: ntr
  • PIC: The next scheduled downtime will be on Monday morning to upgrade dCache. All VOs are informed. Oli is asked to check with CMS disk area getting full fast and often.
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services: NTR
  • CERN storage services:
    • Apologies for the EOS-ATLAS crash last Tuesday. It was hoped that the restart would last ~30 minutes only, but some issues were overlooked.
    • CASTOR instances have been restarted today, it appears to have been transparent as planned.
    • Afs servers crashed, they are back now, but some hosting user directories might still be a bit slow.
  • Databases:
  • GGUS: There will be a 'summer' release next Wednesday 29th of July with ALARM tests as usual.
  • Grid Monitoring:
  • MW Officer: The end of support for dCache 2.6.x was May 2015. The deadline for decommissioning is 21/09/2015 and starting from 31/08/2015 sites still running dCache 2.6.x will be ticketed. ( more details at https://wiki.egi.eu/wiki/Software_Calendars#dCache_v._2.6.x). ~20 instances still running 2.6.x ( no T1s).

AOB:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2015-07-24 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback