Week of 130909

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Simone (SCOD), Tommaso (CMS), Stefan (LHCb), Luca (CERN-DB)
  • remote: Ulf (NDGF), Michael (BNL), John (RAL), Rolf (IN2P2), Xavier (KIT), Lucia (CNAF), Lisa (FNAL), Ron (NL-T1), Wei-Jen (ASGC), Rob (OSG), Maria (GGUS), Sang-Un Ahn (KISTI)

Experiments round table:

  • ATLAS reports (raw view) -
    • T0
      • CERN-PROD GGUS:97124: Fatal in <TROOT::InitSystem>: HOME directory not set
    • T1:
      • NDGF: UD powercut has downed a SAN
      • NDGF: suffering from checksum mismatch (files produced at cyfronet) many files declared bad
      • RAL-LCG: GGUS:97117 SRM failure during weekend: Solved
      • SARA-Matrix: GGUS:97092 certificate error for connection to FZK: Solved

  • CMS reports (raw view) -
    • Quiet activity during week time, a problem during the weekend.
    • Weekend: Frontier Launchpads went down ~ Sat 7 am. They were restored a couple of hours after, but Squids kept dying minutes after the restart. In the end tracked down to large Madgraph MC input files being sent to virtually all CMS sites. We asked for a stop to that particular processing workflow, and since then (~ Sunday early morning) no more issues. Investigations occurring, site availability will need manual fixes for all the sites.
    • GGUS tickets still open:
      • GGUS:96816 ("Debug transfer failing from RAL_Disk to Warsaw", lcg-support@gridppNOSPAMPLEASE.rl.ac.uk now involved, network checks being done)
      • GGUS:96843 ("Failing transfer from Wisconsin to IIHE", SSL handshakes being investigated... could be a peculiar SE SSL configuration)
      • GGUS:96912 ("CVMFS problems", waiting for reply, assigned to ICM) - NO REPLY AT ALL (even after pinging)
      • GGUS:96990 ("Bristol->RAL transfers") - in progress seems just high load on a single Gftp, a second is being added.
    • New GGUS
      • GGUS:97141 ("Seems an additional CVMFS black node case") - T2_RU_SINP, just opened
      • GGUS:97154 ("Zero sized files in RAL-LCG SE Castor for VO cms ") - just opened. At first glance the 0 sized file can all be deleted, but waiting for official reply by CMS.
    • We still have a SNOW ticket opened : INC:365019 on CERN Argus servers having overload problems and not properly balancing. Solution ready to be deployed, final confirmation needed.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Main activity are MC productions
    • T0:
      • NTR
    • T1:
      • GRIDKA: interaction with Site FTS server failing (Alarm ticket opened yesterday GGUS:97119). After restart of the web service at the Gridka instance, which had a problem reading CRLs, the issue is fixed.
        • KIT: the FTS configuration was changed to increase the log level to "debug" for some investigation. The verbosity was such that it filled up the local file system.
      • RAL: failing to create runtime environment on several worker nodes, b/c of missing environment variable $HOME (GGUS:97122)

Sites / Services round table:

  • KIT: on friday evening the dCache write buffer in front of tape for CMS got full (because writing to tape was not working). Fixed this morning, now backlog of tape migration is being digested.
  • NDGF: one data center lost power today and was in Scheduled Downtime. Another downtime (scheduled) tomorrow for the upgrade of the SRM DB. Expect higher performance after.
  • RAL: not really clear what was the problem for ATLAS. Still investigating.
  • ASGC: upgrade of WNs to SLC6 started and basically done. However, only one slc6 creamCE available and the site is trying to add two other slc6 creamCE soon.

AOB:

Thursday

Attendance:

  • local: Simone (SCOD), Maarten (Alice), Ueda (ATLAS), Robert (CERN Dashboards), Luca (CERN Databases), Luca (CERN Storage), Vitor (CERN-PES)
  • remote: Tommaso (CMS), Ulf (NDGF), Xavier (KIT), Sang-Un (KISTI), Michael (BNL), David (CMS), Wei-Jen (ASGC), Ronald (NL-T1), Jeremy (GridPP), Rolf (IN2P3-CC), John (RAL), Rob (OSG), Lisa (FNAL), Pepe (PIC)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • NTR
    • T0
      • CERN-PROD (GGUS:97124) Fatal in <TROOT::InitSystem>: HOME directory not set. The HOME env is set in the pilot log, but probably ROOT cannot get uid or something - it could be nodes with bad /etc/passwd or ldap causing the problem. So far no response from CERN since 2013-09-08, jobs are still failing
    • T1
      • INFN-T1: Prod jobs failing with "Grid proxy not valid" (GGUS:97255)
      • BNL-OSG2: Log area for the FTS server is full (GGUS:97261)

  • CMS reports (raw view) -
    • Mostly quiet
    • Weds afternoon two incidents at CERN -- affecting CMS SiteDB, CouchDB and production server for a couple hours
    • New(ish) GGUS tickets:
    • Old still open tickets:
      • GGUS:96990 Bristol->Ral transfers -- new server
      • GGUS:96912 CVMFS problems due to not quite taken out of service CE at Warsaw. Reply from site received, asking CMS site support to follow up on their questions.
      • GGUS:96843 IIHE transfers to Wisconsin -- in progress
      • GGUS:96816 Transfers between RAL_Disk & Warsaw -- still in progress, now GGUS:96989 added

  • ALICE -
    • KIT: the long-standing instability of the local SE has been resolved since a few days, after switching off a configuration option that looked unrelated!

Sites / Services round table:

  • CNAF: regarding the upgrade to SL6, 50% of resources have been migrated, no problems observed. Migration will be done by the end of the month.
  • ASGC: network uplink now at 20 Gbps
  • OSG: LBNL accounting issue is being worked on.
  • PIC: next monday there will be the dCache upgrade. Batch system and FTS queues will be drained in advance, all published in GOCDB.

  • CERN DB: Tuesday there was a problem with the CASTOR DB, needed a reboot. On the public stager on Wednesday a query needed to be fixed because generating high load.
  • Dashboards: some CREAM CE direct job submission missing between 7th and 9th for ALICE at RAL. Maarten will investigate. The ATLAS SRMdel test were failing INFN-T1 but other ATLAS SRM tests (put and get) at the same site were working. Alessandro will investigate.
  • CERN Grid Services: 2 additional CEs at CERN pointing to SL6 WNs have been installed

AOB:

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2013-09-12 - SimoneCampana
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback