Week of 140825

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Stefan (SCOD), Maarten (ALICE), Tsung-Hsun Wu (ASGC), Luca (Storage), Belinda (Storage), Akos (Grid Services),
  • remote: You-Jin (KISTI), Philippe (LHCb), Rolf (IN2P3), Alexey (ATLAS), Onno (NL-T1), Jeremy (GRIDPP), Michael (BNL), Thomas (NDGF), Pepe (CMS & PIC), Dimtri (KIT), Rob (OSG), Lisa (FNAL),
  • apologies: RAL

Experiments round table:

  • ATLAS
    • Central Services - Tier0/1 issue
      • Nothing to report today
      • Luca: Today around 1pm EOS-ATLAS was degraded, cause not known yet. EOS-ATLAS was in read-only mode until 2pm, while investigating the issue also a compacting of data was done. Issue needs to be understood. Back to read-write since 2pm.

  • CMS
    • No major issues, processing and production is continuing
      • CSA14 undergoing (extended by two weeks until mid of September - details provided at last week WLCG OpsCoord. meeting).
      • MWGR5 expected at the end of this week (Wed-Fri).
    • Degraded data transfer qualities from CNAF-T1 and FNAL-T1. A few files have been detected as corrupted. File invalidation/re-transfers made (GGUS:107836 [done], GGUS:107851 [done]).
    • KIT degraded transfer exports, most likely due to high load on the site (signigicant high fraction of gridftp transfer timeouts - GGUS:107580 [under investigation])
    • netstat.cern.ch was unresponsive during the weekend (GGUS:107831). HTTPD restart solved the issue.
    • Note: transition from Savannah to GGUS (CMS Computing Operations): September 1st - Disable submission of new tickets; September 30th - Close Savannah (still open issues will be transferred to GGUS)

  • ALICE -
    • NTR

  • LHCb
    • MC and User jobs: still low level due to holidays (few MC requests)
    • T0: Oracle intervention tomorrow morning. We shall do nothing but warn users. A few jobs may fail, but nothgin worth taking drastic actions if the intervention is short (~2mn)
    • T1:

Sites / Services round table:

  • ASGC: NTR
  • BNL: GGUS ticket 107789 was filed last Thursday against BNL concerning staging errors. The issue was investigated but was found not to be a site issue b/c ATLAS had sent more than 40k requests rapidly which could not be completed within the time limits defined in DDM. The staging performance was found to be rather good, i.e. 1 file / s or 134k files in 24h staged. The GGUS ticket was closed on Friday, A JIRA ticket was created, the DDM ops team is asked to look into the timeout issue that affected 4 ATLAS Tier-1 centers.This Twiki has been updated (M. Ernst). Two plots showing the staging performance at BNL have been attached to this page (below).
  • CNAF: NR
  • FNAL: NTR
  • GridPP: Bank holiday today in the UK
  • IN2P3: NTR
  • JINR: NR
  • KISTI: NTR
  • KIT: NTR
  • NDGF: Tomorrow DT for electric maintenance, some dCache pools for ATLAS & ALICE will be unavailable 9am-12am.
  • NL-T1: NTR
  • OSG: NTR
  • PIC: NTR
  • RAL: NR
  • RRC-KI: NR
  • TRIUMF: NR

  • CERN batch and grid services: NTR
  • CERN storage services: NTR
  • Databases: NR
  • GGUS: NR
  • Grid Monitoring: NR
  • MW Officer: NR

AOB:

Thursday

Attendance:

  • local: Stefan (SCOD), Akos (Grid Services), Felix (ASGC), Maria (WLCG), Andrea (WLCG MW), Belinda (Storage),
  • remote: Philippe (LHCb), Lisa (FNAL), Dennis (NL-T1), Michael (BNL), Eugene (KISTI), Rolf (IN2P3), Tiju (RAL), Thomas (NDGF), ThomasH (KIT), Jeremy (GridPP), Stefano (CMS), Rob (OSG), Saverio (CNAF)

Experiments round table:

  • ATLAS
    • Central services/T0
    • T1
      • high number of transferring jobs to FZK-LCG2 (ELOG:50846)

  • CMS
    • No major issues, processing and production is continuing
    • CSA14 undergoing

  • ALICE -
    • very low activity during the last days; new productions are being prepared

  • LHCb reports (raw view) -
    • MC and User jobs: average 15,000 concurrent jobs, peaks at 35,000
    • Data transfers: DM operations for cleaning dataset placement (transfers and removals)
    • T0: NTR
    • T1: we again have problems with file transfers for users with a Brazilian certificate at two sites: GridKa and PIC. It works at other dCache sites. dCache developers are involved, as it seems related to the usage of UTF-8 by the Brazilian CA. It would be worth comparing the releases used e.g. at SARA or IN2P3 with that used at GridKa and PIC...
    • Services: not receiving any GOCDB eMail since August 7th... Problematic for operations team (GGUS:107812)
      • Rolf: The same problems are observed with information sent to ALICE

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NR
  • FNAL: NTR
  • GridPP: NTR
  • IN2P3: Pre-announcement of all day outage on Sept 23rd most services will be affected.
  • JINR: NR
  • KISTI: NR
  • KIT: NTR
  • NDGF: NTR
  • NL-T1: NTR
  • OSG: Monday 1 Sept is holiday, will connect again next Thu
  • PIC: NTR
  • RAL: NTR
  • RRC-KI: NR
  • TRIUMF: NR

  • CERN batch and grid services: NTR
  • CERN storage services: NTR
  • Databases: NR
  • GGUS: NR
  • Grid Monitoring: NR
  • MW Officer: NTR
  • Information System: A validation campaign is being carried out for ATLAS sites to compare BDII vs SRM values. All the details can be found in this twiki. After opening some GGUS tickets to T1s and T2s, sys admins have already helped to understand some of the differences between BDII and SRM values. In some cases this was due to bugs in the comparison script and the fact that SRM values are produced once per day several hours before the comparison script is run, which may explain a few TBs of difference. The comparison script will be in sync with the SRM values to get rid of this issue. Some of the opened tickets are being closed due to these known issues. Thanks to the sys admins for their feedback and clarifications.

AOB:

  • Several remote attendees experienced problems connecting to the Alcatel phone conference system. Ticket INC:0627357 opened
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng BNL-Staging-Performance-Aug-21.png r1 manage 30.1 K 2014-08-25 - 17:36 MichaelErnstExCern The graph illustrates the staging performance of BNL in response to ATLAS requests associated with a reprocessing task
PNGpng BNL-Staging-Performance-Aug-22.png r1 manage 24.2 K 2014-08-25 - 17:35 MichaelErnstExCern  
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2014-08-28 - StefanRoiser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback