Week of 140728

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Maria Alandes (chair, minutes), Nacho Barrientos (Grid&Batch), Maria Dimou (GGUS), Alessandro Di Girolamo (ATLAS), Kate Dziedziniewicz-Wojcik (Databases), Felix Lee (ASGC), Andrew McNab (LHCb),
  • remote: Sang-Un Ahn (KISTI), Michael Ernst (BNL), Kyle Gross (OSG), Lisa Giacchetti (FNAL), John Kelly (RAL), Matteo Manzali (CNAF), Dmitry Nilsen (KIT), Christian Sottrup (NDGF), Emmanouil Vamvakopoulos (IN2P3), Alexander Verkooijen (NL-T1),

Experiments round table:

  • ATLAS reports (raw view) -
    • Tier0/1
      • FZK-LCG2: scheduled network maintenance ended with network hardware failure (Sat-Sun), fixed now GGUS:107262
      • RRC-KI-T1: all transfers fail as source and destination, started late Sunday, GGGUS:107271

Alessandro also wants to remind that GGUS:107259 describing tape problems at SARA is still unsolved. Alexander explains that this is now waiting for an answer from the dCache developers as explained in the ticket. MariaA asks whether the dCache developers are aware of this and have acknowledged the problem, and Alexander confirms that this is the case.

  • CMS reports (raw view) - (Tomasso Boccali sent the report and excused his absence)
    • No major issues, processing and production is continuing
    • CSA14 undergoing, major part is analysis tests with CRAB3
    • Only real issue was GGUS not availability (~10 am - 2 pm Sat), not foreseen (at least to me). As discussed in June, we tried to page KIT emergency number (I think it was +49 721 608 43369), but it was close to when the problem was resolved so I am not sure it worked. As in June, we also tried to write Support@ggus.eu,helpdesk@ggus.org, but we got the mail bounced back with a not very useful "please use web interface".

MariaD reminds that when it is not possible to open GGUS/ALARMS tickets due to network problems, users can contact KIT emergency number that is available in GOCDB (The one currently published is +49 721 608 28383). MariaD also wants to remind users that email to GGUS has been disabled, as announced in previous meetings many times already.

  • ALICE - (Not present)
    • KIT: ongoing network issues since Sat

  • LHCb reports (raw view) -
    • MC and User jobs mostly
    • T0:
      • Problem with another VO box, lbvobox11, (Alarm GGUS:107269) ongoing, due to overloading.
    • T1: Overnight problem with transfers to RRC-KI (DNS related?) were ended by a scheduled downtime in the morning.
    • We are having a one-hour downtime tomorrow (29 Jul) at 2:30pm CERN time. Database migration and some other updates. Almost all our central services will be unavailable for part of this time.

Andrew adds that LHCb would like to know whether it is possible that IT Operators are allowed to restart VMs to avoid the same problem they faced with lbvobox11 in the future. This could be very useful when the authorised users are on holidays. MariaA will follow this up.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NTR
  • FNAL: NTR
  • GridPP: Not present
  • IN2P3: NTR
  • JINR: Not present
  • KISTI: NTR
  • KIT: Dmitry explains that the scheduled intervention on network maintenance that took place during Saturday ended up with some issues as already reported by the experiments. Most of the services are now back (around 90%) but some services are still having issues. This will be hopefully fixed in the next hours.
  • NDGF: NTR
  • NL-T1: NTR
  • OSG: NTR
  • PIC: Not present
  • RAL: NTR
  • RRC-KI: Not present
  • TRIUMF: Not present

  • CERN batch and grid services:
    • FTS3 Pilot FTS fts3-pilot.cern.ch migration to new database completed in the last 10 minutes.
  • CERN storage services: Not present
  • Databases: NTR
  • GGUS:
    • MariaD reports that a JIRA ticket GGUS-1292has been opened to GGUS developers to reach a consensus on ALARM test times across american sites.
    • MariaD also explains there is now a JIRA ticket GGUS-1299 to track some search engine problems reported by Maria Alandes where system returns no results.
  • Grid Monitoring: Not present
  • MW Officer: Not present

AOB:

Thursday

Attendance:

  • local: Maria Alandes (chair, minutes), Maria Dimou (GGUS), Kate Dziedziniewicz-Wojcik (Databases), Luca Mascetti (Storage), Andrew McNab (LHCb)
  • remote: Sang-Un Ahn (KISTI), Jeremy Coles (GridPP), Dennis Van Dok (NL-T1), Michael Ernst (BNL), Kyle Gross (OSG), Lisa Giacchetti (FNAL), Thommas Hartmann (KIT), John Kelly (RAL), Christian Sottrup (NDGF), Emmanouil Vamvakopoulos (IN2P3),

Experiments round table:

  • ALICE - (Not present)
    • CERN: short EOS incident this afternoon has impacted ALICE jobs
    • KIT: network issues resolved since Tue morning

  • LHCb reports (raw view) -
    • MC and User jobs mostly
    • T0:
      • Problem with loss of sandboxes physical machine (volhcb15: INC0612499 - disk failure?) led to loss of many jobs; now migrated to VM based replacement with an OpenStack managed volume for the sandboxes. (If sandboxes aren't accessible, many user jobs have effectively failed.)
      • Also loss of physical machine for lhcb-logs.cern.ch (INC0612499 - disk failure again? ) has disrupted our monitoring; again migrating to a VM instance as a replacement.
      • We've increased the number of LHCb ops team people with admin access to the OpenStack tenancy following the lbvobox11 scenario which CERN ops can't yet handle themselves.
    • T1:
      • Tuesday downtime of central LHCb DIRAC services for database migration etc completed successfully.
    • Follow up on Monday's request to know whether operators will have the rights to reboot VMs: Tim Bell reported that IT is in the process of adding the support for the operator to be able to reboot servers based on a ticket. This will also be available to sysadmins. No planned delivery date yet (due to absences). Tim suggests that people running the LHCb cloud server management are added to the administration e-group concerned. Functions such as rebooting, power on/off and viewing the console log are available through self-service using either the CLI or the web GUI so that there is no need for a ticket. This is especially useful for problem determination such as in GGUS:107269 where the console logs showed that the problem was due to an out of memory condition and listed some of the largest processes.

Sites / Services round table:

  • ASGC: Not present
  • BNL: NTR
  • CNAF: Not present
  • FNAL: NTR
  • GridPP: NTR
  • IN2P3: There will be a small intervention on the tape management system on Tuesday 5th August from 14h to 18h. This will be a complete outage and tape won't be available during this time. Access to disk will remain available though.
  • JINR: Not present
  • KISTI: NTR
  • KIT: After the network problems that happened during the past weekend, all services are now back and running. The emergency contact details have been verified and are up to date in GOCDB.
  • NDGF:
    • Problems with network access to DBs in Norway (now fixed) GOCDB downtime
    • Ongoing problems with RAID controllers in Copenhaguen. ATLAS and ALICE data may be unavailable. Expected to be fixed within 1h GOCDB downtime
  • NL-T1: NTR
  • OSG: Ongoing HW problems affecting CVMFS server that remains unavailable. Fix expected before tomorrow.
  • PIC: Not present
  • RAL: NTR
  • RRC-KI: Not present
  • TRIUMF: Not present

  • CERN batch and grid services: (Not present)
    • FTS2 TERMINATION: As per this ITSSB entry the FTS2 services at CERN will be terminated tomorrow Friday 1st of August. This includes: fts22-t0-export.cern.ch , fts-t1-import.cern.ch, fts-pilot-service.cern.ch, fts-t2-service.cern.ch and fts-monitor.cern.ch. For the sake of clarity this has nothing to do with either fts3.cern.ch or fts3-pilot.cern.ch.
    • Maria Dimou suggests to avoid commissioning or decommissioning services on a Friday, if possible, so in case there are problems, these do not happen over the weekend where support is more complicated. Maria Alandes reminds that in any case this has been announced for quite some time now.
  • CERN storage services: EOS-ALICE was unavailable from 13h15 to 14h30 after a namespace crash. It was restarted and recovered.
  • Databases: NTR
  • GGUS: The JIRA ticket GGUS-1299 on GGUS search engine allowing for more results to be returned was discussed yesterday, accepted and put in the pipeline for the next release. There will be no GGUS release in August.
  • Grid Monitoring: Not present
  • MW Officer: Not present

AOB:

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2014-07-31 - MariaALANDESPRADILLO
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback