Week of 150302

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Alessandro (ATLAS), Jan (storage), Maarten (SCOD + ALICE), Steve (grid services), Zbyszek (databases)
  • remote: Christian (NDGF), Christoph (CMS), Di (TRIUMF), Felix (ASGC), Lisa (FNAL), Onno (NLT1), Pepe (PIC), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI), Tiju (RAL), Vladimir (LHCb)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Central Services/T0/T1
      • FTS3 servers upgrade: we would like that BNL, RAL and CERN to upgrade to the latest FTS server release to fix the activity share issue.
        • Steve: the downtime will be less than 5 min; do the upgrade tomorrow (Tue)?
        • Christoph: OK for CMS
        • Vladimir: OK for LHCb
        • Alessandro: we will only re-enable the shares once all 3 FTS for ATLAS have been upgraded; we will coordinate this through the FTS-3 mailing list

  • ALICE -
    • high activity
    • RAL now allow up to 6k concurrent ALICE jobs when other VOs have little activity - thanks!

  • LHCb reports ( raw view) -
    • Distributed computing dominated by Monte Carlo and user activities.
    • T0: NTR
    • T1: NTR
    • Vladimir: there are fewer jobs than usual due to a minor issue that appeared after the migration to the DIRAC File Catalog and is being worked on

Sites / Services round table:

  • ASGC: ntr
  • BNL:
  • CNAF:
  • FNAL:
    • the issue of GGUS alarm tickets paging the wrong number has been resolved
  • GridPP:
  • IN2P3:
    • reminder of the downtime tomorrow; the ATLAS tape buffer space will be more than sufficient during the intervention
  • JINR:
  • KISTI: ntr
  • KIT:
  • NDGF:
    • downtime tomorrow morning for dCache head nodes upgrade
    • there was a power supply problem at the Copenhagen site; the problem has disappeared, but the root cause has not been found yet
  • NL-T1:
    • on Mon March 9 the dCache upgrade from 2.6 to 2.10 will start; the downtime has been declared for 2 days because the DB upgrade might not finish in 1 day
  • NRC-KI:
  • OSG: ntr
  • PIC:
    • on Tue March 10 there will be a 5h downtime for network upgrades and maintenance as well as a dCache 2.10 patch
  • RAL: ntr
  • RRC-KI:
  • TRIUMF: ntr

  • CERN batch and grid services:
    • the migration from VOMRS to VOMS-Admin is proceeding as planned; the new service is expected to be available by 17:00 CET (and it is)
  • CERN storage services:
    • CASTOR-LHCb will be down for an upgrade this Wed
    • CASTOR-ALICE will be down for an upgrade next Mon
    • a bunch of rolling DB updates will soon be announced; they should be fairly transparent from the client side; some daemons may need to be restarted on the server side
  • Databases:
    • tomorrow there will be rolling updates of WLCG integration DBs (INT6R and INT11R)
      • the corresponding production DB updates will be done in 1 or 2 weeks
    • tomorrow there will be a full-day transparent intervention on back-end storage serving half of all DBs
      • CASTOR, WLCG and experiments are affected in principle
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Alberto (grid services), Andrea M (MW Officer), Joel (LHCb), Maarten (SCOD + ALICE)
  • remote: Christian (NDGF), Dennis (NLT1), Di (TRIUMF), Felix (ASGC), John (RAL), Jose (CMS), Lisa (FNAL), Michael (BNL), Pepe (PIC), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI), Thomas (KIT), Vladimir (LHCb)

Experiments round table:

  • CMS reports ( raw view) -
    • Nothing to report. No significant problems since Monday.

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Distributed computing dominated by Monte Carlo and user activities.
    • T0: NTR
    • T1:
      • GRIDKA: problem with pilot submission; WARNING Downtime
        • see KIT report below
      • IN2P3: Last downtime did not include SEs, but they were unavailable
        • Rolf: the SE services should have been available as far as we could see. We did notice some LHCb disk space getting full during the downtime and alerted our LHCb contact.
        • Joel: the user space got full indeed, but our issue was not with that.
        • Vladimir: the SE remained affected after all supposedly relevant downtimes had finished. We did not open a ticket yet. It probably is OK now.
        • Rolf: we will look further into it.
    • Joel: the VOMRS to VOMS-Admin migration went OK and the new service was working fine until Tue around lunchtime, when DIRAC was no longer able to talk to it.
      • The service had picked up a new version of Java that disables SSLv3 by default.
      • On the DIRAC side TLSv1 was tried, but that failed with another error.
      • Proposal: try downgrading Java or re-enabling SSLv3 on the old voms.cern.ch as a test?
        • Update: re-enabling SSLv3 did the trick for now. The DIRAC developers will be asked to look into the SSLv3 dependency with some urgency.

Sites / Services round table:

  • ASGC: ntr
  • BNL:
    • a complex intervention on Tue was completed successfully: maintenance on dCache, network and SE HW
  • CNAF:
  • FNAL: ntr
  • GridPP:
  • IN2P3:
    • the scheduled downtime for batch took much longer than planned:
      • an unplanned emergency power cut interfered
      • the new version could not be made to work and had to be rolled back in the end
      • now the site is working again (still with the previous version)
  • JINR:
  • KISTI:
    • KISTI will have a scheduled downtime from 9 March 2015 00:00(UTC) to 10 March 2015 09:00 (UTC) for re-cabling work on nodes. Tier-1 services are supposed not to be affected by this intervention.
  • KIT:
    • there have been network issues lasting ~2 days:
      • a second 10G link to CERN was added
      • this caused intermittent routing issues
      • they were thought to be fixed yesterday evening, but further problems were seen this morning
      • for now the traffic only uses the first link
      • Vladimir: your site looks OK now for LHCb
  • NDGF:
    • the dCache intervention on Tue went OK
    • still power problems in Copenhagen:
      • 1 phase is OK and the SE nodes only use that one
      • the other phases are used for CE services
  • NL-T1: ntr
  • OSG: ntr
  • PIC: ntr
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services:
    • VOMS-admin migration happened on Monday.
      • The migration went well, but a few groups of users lost certificates during the migration. We were able to recover most of them but there is still work in progress.
      • Please, report any strange behavior to GGUS:110227
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer: ntr

AOB:

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2015-03-05 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback