Week of 140804

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Maria Alandes (chair, minutes), Zbigniew Baranowski (Databases), Belinda Chan Kwok Cheong (Storage), Ben Jones (Grid&Batch), Andrew McNab (LHCb)
  • remote: Sang-Un Ahn (KISTI), Stefano Belforte (CMS), Thomas Belleman (NDGF), Jeremy Coles (GridPP), Michael Ernst (BNL), Kyle Gross (OSG), Tiju Idiculla (RAL), Dmitry Nilsen (KIT), Emmanouil Vamvakopoulos (IN2P3), Alexander Verkooijen (NL-T1)

Experiments round table:

  • ALICE - (Not present)
    • NTR

  • LHCb reports (raw view) -
    • MC and User jobs mostly
    • Due to allowing more than one job manager, had 55000 jobs running concurrently over the weekend.
    • T1:
      • PIC CRLs problem fixed.
      • More RRCKI DNS problems but again fixed.
      • Unable to use RAL currently because of ARC CE client library fixes needed on our side.

Sites / Services round table:

  • ASGC: Not present
  • BNL: NTR
  • CNAF: Not present
  • FNAL: Not present
  • GridPP: NTR
  • IN2P3: NTR
  • JINR: Not present
  • KISTI: There was 1h outage due to a network interventation that resulted in a fiber cut with Chicago. This is now fixed.
  • KIT: NTR
  • NDGF: A series of issues over the weekend mainly affecting ATLAS data (maybe also some ALICE data):
    • The two broken RAID controllers reported last Thursday are still waiting for replacement and the associated disk pools are only accesible in Read-only mode for the time being.
    • Two tape pools had problems during the weekend. One of them was related to a SELinux problem that has been understood and fixed. The other one is not yet understood.
    • dCache nodes were overloaded due to a problem not yet fully understood but for which experts have found a workaround and it's now fixed.
  • NL-T1: NTR
  • OSG: NTR
  • PIC: Not present
  • RAL: NTR
  • RRC-KI: Not present
  • TRIUMF: Not present

  • CERN batch and grid services:
    • FTS2 service now fully decommissioned as announced last week. See IT SSB entry for more details.
    • SLC5 ce20[1-7] removed from the BDII due to the reduction of SLC5 capacity at CERN. Note that it is still possible to submit jobs to SLC5 from SLC6 CEs.
    • CVMFS stratum 0 services upgrade for AMS and SFT. See IT SSB entry for more details.
  • CERN storage services: NTR
  • Databases: CMS Offline DB has now restricted connections from outside CERN as requested by CMS. There is a white list of nodes that can connect from outside CERN. Stefano adds that this has been announced at the PhEDEx operational list and in case someone has problems should get in touch with them or oracle support at CERN. In principle all PhEDEx agents have been included.
  • GGUS: Not present
  • Grid Monitoring: Not present
  • MW Officer: Not present

AOB:

Thursday

Attendance:

  • local: Maria Alandes (chair, minutes), Zbigniew Baranowski (Databases), Belinda Chan Kwok Cheong (Storage), Maria Dimou (GGUS), Andrew McNab (LHCb), Alessandro di Girolamo (ATLAS), Tsung-Hsun Wu (ASGC), Kuo-Hao Ho (ASGC), Lorena Pardavila (Databases)
  • remote: Sang-Un Ahn (KISTI), Thomas Belleman (NDGF), Jeremy Coles (GridPP), Michael Ernst (BNL), Rob Quick (OSG), John Kelly (RAL), Thomas Hartmann (KIT), David Bouvet (IN2P3), Dennis van Dok (NL-T1), Burt Holzman (FNAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • SAM tests from 19th July till 6th August for CE for US sites were in "UNKNOWN" status. This was due to an upgrade to the SAM nodes not properly performed. It has been fixed. Availability/Reliability numbers should be checked to avoid they are affected.
    • Tier0/1
      • NDGF-T1: one disk server unavailable GGUS:107442 under resolution.
      • RRC-KI-T1 file transfer/job failures GGUS:107447 solved. Due to DNS issues.

Alessandro explains that the absences of ATLAS in the last Ops meeting were because ATLAS central services were experiencing a lot of problems and had to put all their efforts to solve them.

  • ALICE - (Not present)
    • NTR

  • LHCb reports (raw view) -
    • MC and User jobs mostly
    • T1:
      • Our ARC CE support fixed, so RAL in use again.
      • Users have been reporting problems with timeouts from files on dCache at T1s. Investigating one by one with help of LHCb contacts at T1s. Will write tickets about any remaining problems.

Andrew gives more details on the dCache problems that do not seem to have any correlation with a particular dCache version. Even different endpoints in the same site could fail or work OK. This is all related to xrootd endpoints and some sites have solved the issues that seem to be caused by misconfigurations on their site. LHCb is following up on this and may give a more detailed report at the next Ops Coord meeting if necessary.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: Not present
  • FNAL: NTR
  • GridPP: NTR
  • IN2P3: NTR
  • JINR: Not present
  • KISTI: NTR
  • KIT: A problem with a mail server has caused mails to be delivered with some delay. The problem is not yet understood but it has been fixed and a backlog of mails will be delivered in the next hours.
  • NDGF:
    • The broken RAID controllers could not be replaced by the manufacturer until 25th August at least according to the manufacturer's estimation. The solution for the time being is to move data to two identical machines. In principle this should be ready by next Monday. See GOCDB downtime for more details.
    • There has been also a problem with a tape that was unreadable. This is now fixed.
  • NL-T1: NTR
    • Alessandro explains that high memory queues for ATLAS will be requested to some T1s that are still missing to provide them. Alessandro would like to know whether this would be an issue for NL-T1. Dennis replies that he doesn't see any problem but it would be better to contact NIKHEF and SARA for this matter.
  • OSG: NTR
  • PIC: Not present
  • RAL: NTR
  • RRC-KI: Notr present
  • TRIUMF: Not present

  • CERN batch and grid services: Not present
  • CERN storage services: NTR
  • Databases:
    • ATLAS PVSS archive replication between online and offline was unavailable from 17:30 on 06.08 to 12:00 on 07.08 due to a user mistake. More details in IT-SSB.
  • GGUS: General email problem at KIT around 1pm CEST today affecting GGUS email notifications as well.
  • Grid Monitoring:
  • MW Officer:
    • New EMI Update released today affecting APEL, caNl library, CREAM GE, dCache and DPM ARGUS. All details can be found in the release notes.
      • It is worth mentioning the dCache fix whereby NFS protocol is no longer being published in the BDII. This is something that was giving problems in the past to LHCb, as reported in previous meetings, and was tracked as a known MW issue.
    • A new minor release of several Information System packages is now available for Readinness Verification. More details in the release notes. The release is expected to be included in the next EMI Update planned in September.

AOB:

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2014-08-07 - AleDiGGi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback