Week of 150810

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Luca (Storage + SCOD), Andrea V, Jiri (ATLAS), Sabine (ATLAS), Asa (ASGC), Cheng-Hsi (ASGC), Akos (Batch&Grid)
  • remote: Mark (LHCb), Michael (BNL), Sebastian (IN2P3), Dimitri (KIT), Onno (NLT1), Rob (OSG), John (RAL), Di Qing (TRIUMF)

Experiments round table:

  • ALICE -
    • [ these items were uploaded Sun late evening and may not cover this Monday ]
    • CERN: raw data reading by jobs was finally OK late last week
      • the number of activity slots per disk server was greatly increased by the CASTOR team
      • to be seen if that is OK also during data taking
    • New record of 95k concurrently running jobs reached on Sun

  • LHCb reports (raw view) -
    • Data Processing: data "stripping" productions ongoing almost complete. Ready to process new data. User and MC jobs
    • T0:
      • Nothing to report
    • T1
      • RAL: Network issues over the weekend
      • GRIDKA: Problems over the weekend getting TURLs from SURLs. Overloaded SRM? Difficult to track down as the jobs retry/go to failover and complete.

Sites / Services round table:

  • ASGC: Due to the typhoon in Taiwan the cooling system was down, this cause an overheating of the systems. During the weekend the many services were not available but now are recovered.
  • BNL: NTR
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3: NTR
  • JINR:
  • KISTI:
  • KIT: NTR
  • NDGF:
  • NL-T1: ATLAS stage pool is affected by a problem of dCache version 2.10.35. dCache disables a pool sooner when it finds problems on it, also for a certain specific kind of single file problem. In this particular case when the pool is under some load some time out disable the pool. The site is in touch with dCache developers and is considering a possible downgrade.
  • NRC-KI:
  • OSG: NTR
  • PIC:
  • RAL: Network outage Saturday morning from 7:30 to 10:00 (local time), problem on a core-switch. During this period GOCDB was unavailable.
  • TRIUMF: NTR

  • CERN batch and grid services: NTR
  • CERN storage services: NTR
  • Databases:
  • GGUS:
  • Grid Monitoring: NR
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Luca (SCOD), Michal (ATLAS), Akos (Batch&Grid), Andrea M (Middleware)
  • remote: Vladimir (LHCb), Asa and Cheng-Hsi (ASGC), Michael (BNL), Renaud (IN2P3), Thomas (KIT), John (RAL), Di Qing (TRIUMF), Rob (OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • Taiwan-LCG2 - deletion, pilots, transfers recovered from typhoon
    • NDGF - network issue on Monday, now solved
    • FZK - queues not online for a few days because of SSL issues

  • ALICE -
    • high activity
    • a record of 97k concurrently running jobs was briefly reached on Mon
      • taking advantage of resources normally occupied by other VOs
    • CERN: 0.4% of raw data files could not be read from CASTOR by reconstruction jobs
      • possibly due to a new host that should not have been running CASTOR daemons yet (fixed now)
      • the failed jobs will be resubmitted

  • LHCb reports (raw view) -
    • T0: nothing to report
    • T1:
      • GRIDKA: Still we have problems getting TURLs from SURLs.

Sites / Services round table:

  • ASGC: Nothing to report, services recovered after the typhoon
  • BNL: NTR
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3: NTR
  • JINR:
  • KISTI:
  • KIT:
    • found the root cause of the network issue, a segment of the network was losing packets creating problem with SSL handshake. The routing was rearranged to fix the issue.
    • the site is asking LHCb to add more information in the ticket to help them debug the SRM issue
  • NDGF:
  • NL-T1:
  • NRC-KI:
  • OSG: NTR
  • PIC:
  • RAL: NTR
  • TRIUMF: NTR

  • CERN batch and grid services: NTR
  • CERN storage services: NTR
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2015-08-13 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback