Week of 150921

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria Dimou (SCOD), Maria Alandes (WLCG Ops), Alessandro Fioro (T0 Storage), Prassanth Kothuri (T0 DB).
  • remote: Dmytro Karpenko (NDGF), Francesco Noferini (CNAF), John Kelly (RAL), Lisa Giacchetti (FNAL), Sang Un Ahn (KISTI), Rolf Rumler (IN2P3), Kyle Gross (OSG), Cristoph Wissing (CMS), Dimitri (KIT), Onne Zweers (NL_T1).

Experiments round table:

  • ALICE -
    • [ These items have been uploaded late Sunday evening and may not cover this Monday. ]
    • Normal to high activity
    • Since 10+ days the reco jobs have been downloading raw data files from CASTOR instead of streaming them.
      • Looks good so far.
    • Further meetings with CASTOR and EOS experts to discuss short- and longer-term ideas to stabilize usage of CASTOR.
      • Various scenarios are available to help ensure a successful heavy-ion data taking period.
    • ALICE operations experts are participating in the workshop on network evolution in Asia at KISTI this week.

Sites / Services round table:

  • ASGC: not connected
  • BNL: not connected
  • CNAF: We planned (to be confirmed on Thursday WLCG meeting) the replacement of the broken piece of one of the two electrical supplies affected by the fire on next Tuesday (29/9/2015). During the operation we planned, according with LHCb, the down only for the LHCb filesystem which is considered on risk.
  • FNAL: NTR
  • GridPP: not connected
  • IN2P3: Scheduled downtime tomorrow 22/9. dCache will be down for 1 hours 10-11hrs am. Batch and mass storage systems will be down for the whole day.
  • JINR: not connected
  • KISTI: NTR
  • KIT: The site observes CMS test jobs failing since last week and seek experiment help to understand the reasons. Christoph promised to open a GGUS ticket and select Notify site: KIT to debug this.
  • NDGF: The Oslo site is down for maintenance today and tomorrow. Data will be read only and access to the data might be slow.
  • NL-T1: NTR
  • NRC-KI: not connected
  • OSG: NTR
  • PIC: not connected
  • RAL: There will be an Oracle intervention on the standby databases tomorrow that shouldn't affect production. A problem with the latest FTS3 version is observed with the production system but not with the test one which is used by ATLAS. The developer sees a memory leak and is still debugging. Maybe the issue is due to a non-LHC VO that uses FTS3 but it is too early to say.
  • TRIUMF: not connected

  • CERN batch and grid services: not present
  • CERN storage services: NTR
  • Databases: NTR
  • GGUS: not present
  • Grid Monitoring:
    • Final Availability reports for August sent to the lcg office
  • MW Officer:
    • An issue has been opened in GGUS:116274, reporting that the new version of HTCondor ( 8.4.0) cannot be installed on WN running UMD3 because of dependency problems, the ticket has been assigned to UMD support for investigation.

AOB:

Thursday

Attendance:

  • local: Maria Dimou (SCOD), Alessandro Fioro (T0 Storage), Gavin McCance (T0 Grid Services).
  • remote: Dmytro Karpenko (NDGF), Matteo (CNAF), Gareth Smith (RAL), Lisa Giacchetti (FNAL), Sang Un Ahn (KISTI), Rolf Rumler (IN2P3), Chris Pipes (OSG), Cristoph Wissing (CMS), Pavel Weber (KIT), Andrew Pickford (NL_T1), Zoltan (LHCb), Michael Ernst (BNL), Di Qing (Triumf).

Experiments round table:

  • ATLAS reports (raw view) -
    • Apologies - Software and Computing week
    • BNL:
      • FTS issues with 1 of 3 servers after updating machine certs on all three machines. Site responsible in contact with FTS developers.
Michael further explained at the meeting the BNL setup. There are 3 servers load-balanced and sharing the same database. One of these 3 hosts behaves badly by causing core dumps on URL copying. The experts at BNL are in touch with the developer at CERN.

  • ALICE -
    • the workshop on network evolution in Asia at KISTI this week has concluded successfully
    • CERN:
      • timeouts and failures hampered data transfers from P2 to CASTOR starting Mon afternoon
      • data taking had to be stopped Tue morning when all P2 disks were full
      • the trouble was due to new CASTOR disk servers having an incorrect setup
        • quickly fixed
      • we thank the CASTOR team for their prompt actions and followup!

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of the proton-proton and proton-helium data has been started.
      • Data processing, MC and user jobs
    • T0
      • Investigation of slow worker ongoing is followed up internally the ticket is closed (GGUS:116023)
      • Permission denied on eos for some users (in progress) (GGUS:116243)
      • Failed transfers to CERN-RAW (GGUS:116321)

Sites / Services round table:

  • ASGC: not connected
  • BNL: nothing to add
  • CNAF: Intervention is now on-going to add memory to the WNs. They should all be back tomorrow and part of a LSF9 cluster. The planned downtime for Tuesday 29/9 is now confirmed.
  • FNAL: Scheduled downtime on Tuesday 29/9 to upgrade the GUMS service. It will start at 10am CST and will probably last for about one hour but it is announced till 3pm in GOCDB to be on the safe side.
  • GridPP: not connected
  • IN2P3: NTR
  • JINR: not connected
  • KISTI: NTR
  • KIT: Scheduled downtime on Tuesday 29/9 for the whole GridKa site from 2am UTC till Thursday Oct 1st. By default the jobs running at the beginning of the intervention will be aborted. If experiments wish the queues to be drained instead *please open a GGUS ticket and select "Notify Site: KIT".
  • NDGF: NTR
  • NL-T1: (Content by Onno Zweers) On Tuesday, SURFsara updated the SRM component of the dCache cluster to fix a security vulnerability. The SRM door had to be restarted, which may have interrupted some transfers or SRM operations.
  • NRC-KI: not connected
  • OSG: Now completing a series of planned OS upgrades and reboots with success and no service interruptions.
  • PIC: not connected
  • RAL: FTS3 looks OK now at RAL. Detailed report further down on this page. Last Tuesday the site was put at risk to perform a CASTOR database upgrade. There will be more such inverventions, all published in GOCDB.
  • TRIUMF: NTR

  • CERN batch and grid services: FTS3 at CERN run out of inodes latest days, due to many logs files on the disk. PES cleaning them, reduced the history to 3 days, as per the report further down on this page.
  • CERN storage services: NTR
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer: Text provided by the FTS3 developer Alejandro Alvarez Ayllon:
    • BNL and RAL upgraded to latest version 3.3.1. Almost at the same time, RAL production started to run out of memory, while RAL test didn't, even when it was also upgraded. FTS3 at BNL didn't seem to have this symptom. Initially thought this was related to the upgrade, but later FTS3 at CERN, without the upgrade, started having memory exhaustion too.
      • After some debugging, some new heavier usage of the monitoring seemed to be the culprit
      • Django stores query results in memory, never manage to return to the system probably because of fragmentation
      • A new smaller update with configuration changes trigger a kill & respawn of a separate Django process to return memory to the system. This solved the situation at RAL, memory is back to normal
      • So, in summary, the outages are not caused by the upgrade, but happened at the same time
Also, FTS3 at CERN run out of inodes latest days, due to many logs files on the disk. PES cleaning them, reduced the history to 3 days

AOB:

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2015-09-24 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback