WLCG Operations Coordination Minutes - Virtual meeting April 23rd, 2015

Agenda

Attendance

  • local: Maria Dimou (minutes & organisation. This is a virtual meeting)
  • remote: This is a virtual meeting

Operations News

WLCG Workshop at Okinawa & CHEP-specific feedback

Those who were present, please edit this section or send it to Maria Dimou for homogeneous editing.

Middleware News

  • MW Issues:
    • NTR

  • T0 and T1 services
    • CERN
      • Planned to upgrade FTS3 to v 3.2.33 next week
    • CNAF
      • All STORM Instances moved to the new virtualized environments
    • JINR-T1
      • Minor dCache upgrade to 2.10.25
    • NDGF
      • Minor dCache upgrade to 2.12.5
    • TRIUMF
      • Major dCache upgrade to 2.10.24

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • normal to very high activity
    • new job records on Sunday Apr 19: 82k maximum, 80k+ for a few hours!
    • taking advantage of opportunistic resources
  • CASTOR at CERN: file access instabilities for re-reco jobs (GGUS:113106)
    • ad-hoc cure applied by CASTOR operations team
    • code fix being implemented

ATLAS

CMS

  • Main production activities
    • DIGI-RECO of Upgrade Monte Carlo - Tier1s
  • reached the quota limit on EOS unmerged area, since now also the T0 writes there. Understanding and setting new quotas.
  • some sites are complaining about "hot files" being requested much more than the average. This is understood at our SW level: the first pile-up file is always opened, in order to read metadata. - Processing agent patched to randomize the opening
  • Some problems with CASTOR at CERN, likely saturating number of connections
  • CMS streamer files now are written to EOS (and not to CASTOR)
  • Tickets about mis-behaving CEs (see action items)
    • CMS Site Status Board marks site only red, when all CEs are failing
    • Site can go red, when CEs are red and others are 'unknown' (question of implemented logic)
    • Ambious people still might send tickets, when they see individual CEs

LHCb

  • Operations dominated by Monte Carlo productions and user analysis
  • Few operational issues:
    • RAL network problems on 8/9 April
    • NIPNE FTS transfer failures out of the site

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • gLExec in PanDA:
    • testing campaign covers all ATLAS analysis sites that currently have gLExec (61 out of 94)
    • the issues at RAL and RALPP appear to be related to the use of the ARC CE and/or Condor
      • similar errors appeared at TW-FTT when they switched to such a setup
      • more debugging enhancements have been submitted for inclusion in the next pilot release

SHA-2

  • REMINDER: the old VOMS server aliases (lcg-)voms.cern.ch are to be removed in a few days!
  • Rationale:
    • The only reason for keeping the old VOMS server aliases was to allow the old VOMRS registration URLs to forward users to the new VOMS-Admin services:
    • While the lcg-voms.cern.ch alias still exists (and voms.cern.ch too), we need to have special exceptions in the CERN routers to prevent that remote clients might hang if they still try to access the VOMS daemons that used to run on those hosts.
    • Of course, nothing should be doing that anymore, but you can be sure there still are plenty of places where the old VOMS endpoints are configured along with the new ones. In such cases we want clients to fail over quickly, instead of hanging and timing out.
    • We now have made the transition from VOMRS to VOMS-Admin, and although everything is not perfect yet, we will not be needing to go back.
    • Therefore we finally would like to remove those old aliases and get rid of those routing exceptions.

Machine/Job Features

  • contacts to more sites who would like to try out MJF on their batch clusters, but all at low prio
  • currently preparing a technical report on MJF for the HSF

Middleware Readiness WG

Multicore Deployment

IPv6 Validation and Deployment TF

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report this week -- one of the main developers is going on paternity leave soon

Network and Transfer Metrics WG

HTTP Deployment TF

Action list

  • The network incident (degradation) between Triumf and RAL reported by ATLAS will be a case to test the procedure put in place by the network metrics WG.
  • CMS instructions to shifters to be changed so that tickets are not opened if just one CE is red.
  • Maarten to follow-up with the experiments that the dates for removing the VOMS servers' aliases, as reported in the SHA-2 TF section above, are kept.

AOB

  • The next meeting is on May 7th.

-- MariaDimou - 2015-04-21

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r25 - 2015-04-27 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback