Week of 140127

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Stefan (SCOD), Alexandre (Monitoring), Felix (ASGC), Raja (LHCb), Jan (Storage), Maarten (ALICE)
  • remote: Sang-Un (KISTI), Kai (ATLAS), Dimitri (KIT), Onno (NL-T1), Michael (BNL), Tiju (RAL), Matteo (CNAF), Pepe (CMS/PIC), Rolf (IN2P3)

Experiments round table:

  • ATLAS
    • Central services/T0
      • RAL switched to FTS3 this morning again
        • Transfers failed shortly due to the switch
        • Is working now
      • Cern-prod_SL6: Jobs dying due to memory limits (GGUS:100531)
        • One fix to fast-kill procedure: jobs might get killed when swap is full due to other jobs
          • Fix will take some days to get active
        • There might be another failure: still under investigation
    • T1
      • RRC-KI-T1:postgreSQL cluster problems leading to failing transfers (GGUS:100540)
        • Fixed
      • Nikhef: Some inaccessible files: under investigation (GGUS:100542)
      • Still some instabilities on TW T1 SE, renaming campaign is interfering with the regular activity (GGUS:100268)
        • but decreasing since this night
        • Site can optimize only for renaming OR normal transfers but not both

  • CMS
    • Central services/T0
      • INC480945 "Degraded E-mail server quality" during the weekend. We saw a degradation on the SLS alarm. However, there was a problem on the SLS monitoring agent that was not displaying correctly the availability. There was no issue with the mail service, and this was fixed by CERN team.
      • GGUS:100538 (" ARGUS not responding"): Problems observed with the ARGUS service during the weekend. There was a high load on the ARGUS servers at around 6pm 25th/Jan which matched the time stamp of the CMS error report. Some errors were seen in the log files, but the service recovered by itself in a few hours. Ticket closed, but to be followed up.
    • T1/T2/Others: Bussiness as usual. Smooth running.

  • ALICE -
    • KIT
      • more than 700 file corruptions discovered on disk SE
      • problem looks correlated with 4 new server machines
      • they have been set read-only late Thu evening
      • the matter is being investigated further
      • Dmitri: problem under investigation, latest news from storage experts, sometimes happens that softlinks pointing to the same file produces problems, real source of the problem not yet known.

  • LHCb
    • Mostly simulation and user jobs. Smooth running over most of the grid.
    • T0: NTR
    • T1: Brief scheduled downtime of IN2P3 for "node reconfiguration"
    • T2: Downtime of CBPF (Brazil) due to powercut. Admins still trying to bring up services there.

Sites / Services round table:

  • ASGC: vacation of site personnel from Jan 30th - Feb 4th (chinese new year), the response time is expected to be slower during that time
  • KISTI: NTR
  • KIT: NTA
  • NL-T1: NTR
  • BNL: NTR
  • RAL: - FTS3 version was upgraded, - Some 512 bit proxies needed to be deleted
  • OSG: Received email from Alessandro DiGirolamo & Maria Alandes that some ATLAS sites are not published in the CERN BDII, the problem is being sorted out right now,
  • CNAF: NTR
  • PIC: NTR
  • IN2P3: NTR
  • Monitoring: NTR
  • Storage: NTR
  • GGUS: The monthly GGUS Release will be released this Wednesday including the usual "test alarm tickets". Announcement is done in GOCDB and the GGUS homepage.

AOB:

Thursday

Attendance:

  • local: Simone (SCOD), Alex, Pablo (CERN Monitoring), Felix (ASGC), Ben (CERN Grid Services), Xavi (CERN Storage), Raja (LHCb)
  • remote: Michael (BNL), Eric (CMS), Dennis (NL-T1), Kyle (OSG), Saverio (CNAF), Pepe (PIC), Jeremy (GridPP), Gareth (RAL), Rolf (IN2P3), Alessandro (ATLAS)

Experiments round table:

  • ATLAS reports (raw view) -
    • CERN-PROD: observed low % of failures due to memory exceeded for various workflow. Trying to get in touch with CERN-PROD experts to understand the error

  • CMS reports (raw view) -
    • T1/T2/Others: Business as usual. Smooth running.
    • Preparing for DBS (data catalog) upgrade on Feb 10. That week will see little to no central processing
  • ALICE -
    • KIT
      • occasional data corruption was due to silent PFN clashes when multiple Xrootd servers create files concurrently
        • PFN now based on a time stamp (at least by default), whereas it used to be a hash of the LFN
        • the Xrootd servers at KIT all see the whole name space
      • affected ~0.9% of data written since new servers were configured in Sep
        • 47,773 out of 5,303,211 files
        • ~1.05 TB
      • new servers read-only for now
      • cleanup of SE and catalog in preparation
      • various solutions being checked and compared

  • LHCb reports (raw view) -
    • Mostly simulation and user jobs. Smooth running over most of the grid.
    • T0: NTR
    • T1: Brief problem at SARA on 28th when two rogue worker nodes caused a lot of jobs to fail (GGUS:100576, GGUS:100577). Fixed quickly.
    • T2: Failed pilots at ARAGRID-CIENCIAS (Spain - GGUS:100625).

Sites / Services round table:

  • ASGC: since today in Taiwan it is Chinese new year holiday (1 week)
  • NL-T1: issue with tape backend with SARA. Currently impossible to read from and write to tape. Under investigation. Downtime has been declared.
  • OSG: RSV2SAM shutdown yesterday for the integration instance. Rob is on leave for the next 1 month, Kyle will represent OSG at WLCG meetings.
  • RAL: problem on the CMS CASTOR instance. Issue spotted in the results of the SAM tests and some large number of internal timeouts in CASTOR have been observed. Under investigation.
  • CERN: LFC/FTS/VOMS issue due to central DB. It affected also dashboards.
  • Storage: need to schedule downtime for CASTOR for hardware upgrade on the Oracle DB. An email will be sent to the CASTOR announcement mailing list, experiments please reply. The intervention at the namespace and stager DBs can not be merged in the same day. Proposed schedule:
    • 4th march 8 to 12: CASTOR nameserver (all instances)
    • 11th of march from 8 to 12: CMS instance
    • 12th of march from 8 to 12: ATLAS instance
    • 13th of march from 8 to 12: Alice instance
    • 25th of march from 8 to 12: LHCb instance
  • Dashboards: DB outage affected dashboards and SSB instances. For SSB up to 3 hours of data lost for some metrics, for dashboards, no data will be lost, but crons are still catching up backlog.

AOB:

  • Release of GGUS yesterday, everything smooth. All alarms were acknowledged.
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2014-01-30 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback