WLCG Operations Coordination Minutes - November 20th, 2014

Agenda

Attendance

  • local: Andrea Sciabà (chair), Nicolò Magini (secretary), Tsung-Hsun Wu (ASGC), Marian Babik, Andrea Manzi (MW Officer), Luca Canali (IT-DB), Ignacio Reguero (Tier-0), Maarten Litmaath (ALICE), Stefan Roiser (LHCb), Alessandro Di Girolamo (ATLAS), Simone Campana (ATLAS), Maria Dimou, Alberto Aimar
  • remote: Alessandra Forti, Alessandro Cavalli (INFN-T1), Andrej Filipcic (ATLAS), Catherine Biscarat, Christoph Wissing (CMS), Di Qing (TRIUMF), Frederique Chollet, Jeremy Coles (GridPP), John Kelly (RAL), Michael Ernst (BNL), Renaud Vernet (CCIN2P3), Thomas Hartmann (KIT), Ulf Tigerstedt (NDGF-T1), Yuri Lazin (NRC-KI-T1), Burt Holzman (FNAL)

Operations News

  • Workshop on the future of Argus on December 11 (agenda)
  • WLCG critical services: updated information received from three experiments. Propose to discuss all input received at the next meeting.
  • WLCG survey: finalising the web form.
  • First meeting in January: on the 8th or the 22nd?

  • Agreed to hold a VIRTUAL meeting on January 8th.

Middleware News

  • the WLCG repository is signed since Nov 11
    • the change has been announced to all WLCG sites
    • it is backward-compatible: old configurations ignore the added signatures
    • no issues were reported

  • Baselines:
    • new release of UMD (3.9.0):
      • gfal2/gfal2-utils (replacement for gfal/lcg-utils) are available
      • Bdii Core 1.6.0, Containing small fixes for GLUE2 and configuration changes to enhance performances in top bdii
      • WN 3.1.0 and UI 3.1.0 metapackages also available

  • MW Issues:
    • RHEL6.6 kernel fuse bug affecting CVMFS NFS installations. This affect all sites move to SL6.6 and using CVMS with NFS installation. Problem has been discovered at CERN and a patch has been sent to RedHat. We recommend all sites with this type of installation to not upgrade to SL6.6
    • gridftp logging too verbose on DPM 1.8.9: The latest version of DPM has an issue with gridftp logging which is always at the maximum level even when disabled. The problem was discovered in the context of the MW readiness activity and the issue has been fixed and is in EPEL-testing. In parallel EGI already broadcasted last week to wait for a new SAM update ( end of the month) to upgrade to DPM 1.8.9 cause there is an incompatibility with a SAM probe, so the problem hit ( from the info on the BDII) 21 installations.

  • Andrea Manzi explains that the EGI ops SAM probe is failing against DPM 1.8.9 because DPM is now publishing WebDAV and the probe tries to test it with SRM. This does not affect the experiment SAM probes.

  • T0 and T1 services:
    • NDGF
      • dCache upgrade to 2.10.10, planning to update to next dCache with xrootd fixes
    • JINR-T1
      • dCache upgrade from 2.2.27 to 2.10.10
      • Updated xrootd configuration for AAA to meet EU privacy policy.
    • CERN
      • FTS upgrade to v 3.2.30

Oracle Deployment

  • IT-DB new hardware installations in: CERN computer centre and Wigner.
  • Timeline: testing in October, production move - by the end of 2014. Schedule will be updated accordingly.
  • Following table includes only those DB services that concern WLCG

Database Comment Destination Upgrade plans, dates
ATONR Data Guard for Atlas Online Wigner Done
ATLR Data Guard for Atlas Offline Wigner Done
ADCR Data Guard for ADC Wigner Done
CMSR Data Guard for CMS offline Wigner Done
LHCBR Data Guard for LHCB Offline Wigner Done
LCGR Data Guard for WLCG Wigner Done
CMSONR Data Guard for CMS Online Wigner Done
LHCBONR Data Guard for LHCB Online Wigner Done
CASTORNS Data Guard for Castor Nameserver Wigner Done
ATONR Active Data Guard for Atlas Online CERN CC Done
ALIONR Active Data Guard for Alice Online CERN CC Done
ADCR Active Data Guard for ADC CERN CC Done
CERNDB1 CERNDB1/EDMSDB/ACCDB CERN CC Done

  • Luca Canali explains that - as discussed with Dave Dykstra - the CMSONR Data Guard in Wigner can be queried by FronTier as fallback of the CMSONR Active Data Guard in Meyrin, though performance will be lower.

Tier 0 News

  • Ignacio Reguero adds that the new myproxy host is already in the alias as of today, so it will be automatically exposed to real load.
  • Alessandro Di Girolamo comments that Alessandro De Salvo (ATLAS VO manager) already provided feedback months ago. Maarten asks to try again since the critical issues uncovered at the time were fixed.
  • ACTION on Alberto Peon and IT-PES to provide a summary of the experiment feedback on voms-admin at the next meeting.

Tier 1 Feedback

  • NDGF-T1 2 new tape systems are getting deployed (Oslo and Copenhagen) with serious loadtesting. At the same time one old is getting retired (Stockholm). Net effect is +1 Alice tape system and +1 Atlas tape system.

  • Ulf Tigerstedt asks about current ALICE activity. Maarten answers that ALICE is going to continue writing to tape at NDGF-T1 for the next 2 weeks.

  • Renaud Vernet (CCIN2P3) asks about the issue affecting SAM availability for many sites on Nov 18th. Marian Babik answers that availability will be recomputed for all sites&experiments; there were some not fully understood network issues between the SAM hosts and the sites.

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity during the last 2 weeks
  • RAL ARC CEs:
    • SAM tests go via the RAL WMS for the time being
    • direct job submission being worked on by ALICE colleague from UA-BITP
      • good progress so far
  • KIT:
    • raw data reprocessing was affected by staging issues
    • OPN link to CERN thus saw high traffic for remote reading instead

  • Maarten confirms that the SAM probe for direct submission to ARC-CE under development is generic and can also be used by LHCb.

ATLAS

  • ATLAS Central Service status (migration to AI)
    • quattor decommissioning ongoing, we are on time respect the agreements with CERN of beginning of October: we have produced a list of nodes that won't be migrated for now (will be destroyed at the end of the year) which are the old ATLAS DDM - DQ2, plus few other nodes in ProdSys-1.
    • migration to AI monitoring for all the ATLAS Central Services: ongoing. Setup a series of coffee chat with other experiments and various IT groups experts. The goal is to produce, together with various others "customers", a list of requirements for the new dashboards based on kibana.
  • ProdSys2 and Rucio migration
    • the timeline for the migration has been defined and agreed. This should have not impact for the sites (but it does have impact for ATLAS)
    • migration of first analysis users foreseen for today, then ramp up in the next days
    • stopping the production (prodys1) - low no of jobs for next ~2weeks
  • New VOMS servers have been fully tested and now activated for ATLAS since 19th of Nov, no issues observed. Good!
  • Multicore - passing Job requirements news: see MultiCore TF

CMS

  • Processing overview:
    • Phys14 DIGIRECO mainly on T1 sites
    • MC production for Upgrade (2023) configurations (T1 and T2)
    • Production campaign of MINIAOD (Tier-1 and Tier-2 sites) in the tails
  • Data transfer test T0->Tier-1 last week
    • Rather successful - target rates met
    • Continue this week
  • Tape staging exercise end of November or beginning of December
    • Details will be communicated to sites via CMS contacts (and relevant CMS HN)
  • Testing of new VOMS server infrastructure
    • Quite some testing concluded successfully: CRAB3 servers, central Phedex
    • Still ongoing: Glidein infrastructure
  • In progress of moving CRAB and central production into a single global Condor pool
  • VOMS-admin testing feedback
    • "so.. works (as expected) but a bit less smooth than I'd like." (S. Belforte)
    • "we should look for a smooth transition over a period of time, rather then try to be ready to switch on a given day." (S.Belforte)
  • Operational Issue
    • Nov 12th - FTS3 pilot shortly unavailable affecting Phedex Debug transfers
    • Some not understood failures of SAM test submission beginning of this week - condor_g submission issue?
  • Pushing for some site configurations
    • Adapt site-local-config.xml to include <phedex-node value=Tx_CO_Site{_type}"/> in the <local-stage-out> section and the same format (but the PhEDEx name for the fallback endpoint) in <fallback-stage-out>
    • Phedex Space monitoring: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SpaceMonSiteAdmin
    • Will open (low priority) tickets in a few weeks to track progress

  • The SAM test issue was a network issue as discussed earlier.

LHCb

  • MC and user jobs in the last 2 weeks
  • Stripping 21
    • validation revealed a problem which delays the start of the campaign to early next week
    • still continuing to pre-stage input data onto BUFFER disk only storage
  • VOMS2 servers
    • LHCbDIRAC released with all 4 VOMS servers (old+new), all functionality has been executed and works properly also with the new servers (as seen in the service logs)
  • Finalizing the update of critical services for Run2,

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • gLExec in PanDA:
    • testing campaign ongoing
    • some issues (being) debugged at a few sites

Machine/Job Features

  • NR

Middleware Readiness WG

The WG met yesterday Nov 19th. Agenda http://indico.cern.ch/e/MW-Readiness_7. Summary:
  • The discovery of DPM 1.8.9 bug via the MW Readiness verification process was the proof that this effort is needed, useful and actually working as it should.
  • When a MW package version is proved to work via the workflow of a given experiment and a new version is out for verification, other experiments which started later should go directly to the most recent version at hand.
  • FTS3 is listed as desired product to verify for Readiness for ATLAS and CMS but as it runs in very few sites, it is not a priority because, even if a bug is discovered in production, there is no risk to cause trouble in operation to a big part of the community.
  • Tasks overview also linked from the agenda.
  • Following technical discussions between the MW Package Reporter and Pakiti developers and the WLCG and EGI Security responsibles, a technical solution of common agreement was adopted by which each site will be given the option to enable pakiti only, the Package Reporter or both. Thus security concerns are addressed and the site independence is respected. A release along these lines is expected during the 1st quarter 2015.

  • Maria Dimou notes that for the sites only Jeremy Coles (GridPP) was connected at the Nov 19th meeting and encourages more participation from the sites.

Multicore Deployment

  • Passing parameters to the batch system: GDB report.
    • Currently completing the table of parameters from the batch systems and testing the CREAM-CE capabilities. The final summary of this and whatever we decide will be in this page.
      • FZK tested SGE scripts with what is in the current script and it works.
      • Manchester tested sending random strings to torque: they do get accepted with direct job submission, so potentially we could use Glue2 for CREAM-CE if we really want. Tested also with different operators >= and == and blah does does append a _Min suffix and nothing in the second case. So we can restrict just to use == as it should provided the scripts are adapted.
        • Adapting the scripts and how to distribute them hasn't been discussed yet.
      • CERN script for LSF is heavily used but adopted a different method grouping all the requests in one LSFResource parameter with no reference to any Glue schema.
      • ARC-CE using RSL directly no Glue
      • HTcondor CE not developed yet. The developers joined the TF this week.
  • Accounting: WLCG MB report
    • EGI is now looking into accelerating the progress of the new development portal
    • To have the correct accounting sites also should:
      • EMI-3 CREAMs have to enable multicore support
      • Sites using SSM1.2 should move to SSM2
      • Sites using DGAS should move to use the APEL client

  • Andrea Sciabà reminds that the sites also need to enable a new option in APEL for the multicore accounting, as documented in the MB presentation.

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • LHCb done
    • ATLAS done
    • CMS - almost there...
  • retirement plans for the old servers
    • the old servers can be used until Nov 26, 14:00 UTC
      • the maximum proxy lifetime for the old servers will be as low as 2 days by then
    • the old VOMS ports will refuse connections from then on
    • we foresee that remote clients will not hang when trying the old servers
      • voms-proxy-init should immediately try the next server listed in its configuration files
    • experiments should get the old servers removed from the VOMS client configuration on the UI instances they use
      • for lxplus that will happen "automatically"
    • a broadcast will be done in the first week of Dec to inform sites that only the new servers need to be supported by then
    • VOMRS will keep running on the old lcg-voms for the time being
      • all registration URLs keep working
      • we want to move to VOMS-Admin in the near future...

  • Maarten clarifies that the VOMS servers cannot provide proxies with a longer lifetime than their own host certificate.
  • Agreed to switch off the VOMS daemons on the old VOMS servers even earlier (e.g. on Monday), to avoid getting proxies with a lifetime shorter than ~4 days. Alessandro Di Girolamo and Stefan Roiser confirm that this is OK for ATLAS and LHCb; Christoph Wissing also agrees for CMS pending the final tests.

  • Andrea Manzi asks about gridmapfile generation. Maarten answers that voms-admin can be kept running on the old servers for a while, to allow edg-mkgridmap to keep using the old servers (together with the new ones). The sites will be asked to remove the old servers from their configurations, but that may take a while to get done. This will be followed up.

  • The TF will provide a final report after the VOMS server migration at the next meeting.

IPv6 Validation and Deployment TF

  • NR

Squid Monitoring and HTTP Proxy Discovery TFs

  • No progress. Alastair Dewhurst has promised to make the updates to the monitor before the end of the month.

Network and Transfer Metrics WG

  • 107 instances updated to 3.4.1 following the WLCG and EGI broadcasts sent with the new install/update instructions
  • Second broadcast to be sent next week, deadline to update will be 8th January 2015
  • Planning to start validation of the existing 3.4.1 sonars next week
  • perfSONAR data store configured in ITB; stress testing to start next week
  • Metrics area meeting to be held next week (http://doodle.com/ezrfh8eybu7iybxyqzrcbze9)

  • Marian Babik explains that December 8th is the recommended deadline for the perfSONAR upgrade as previously announced, and January 8th is the hard deadline.

Action list

  1. ONGOING on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
    • Maite to provide full updated statistics after lxplus5 closure.
  2. ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing.
    • Ongoing discussions on publication in AGIS for ATLAS.
  3. ONGOING on experiment representatives - report on voms-admin test feedback
  4. NEW on Alberto Peon and IT-PES - summarize feedback on voms-admin received so far at the next meeting
  5. ONGOING on Andrea Sciaba - review the critical services table
    • Andrea will provide an update at the next meeting.
  6. CLOSED on Alessandro Di Girolamo - suggest to monitoring team to organize common meeting with experiments to collect requirements for new SLS dashboards.
    • The first meetings were held, now an ongoing activity.

  • On HTCondor-CE, Alessandro Di Girolamo comments that -as he reported to the HTCondor developers- he considers the publication of HTCondor-CEs as an opportunity to go in a common direction, avoiding too quick solutions which can be manageable on the short term but not on the long one: the solution adopted for CMS of publishing in OIM the sam_uri is not enough for all the ATLAS workflows.

AOB

  • Next meeting on December 4th.

-- NicoloMagini - 2014-11-03

Edit | Attach | Watch | Print version | History: r32 < r31 < r30 < r29 < r28 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r32 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback