WLCG Operations Coordination Minutes, January 26th 2017

Highlights

  • The new WLCG accounting portal has been validated.
  • Please check the new accounting reports: if no major problems are reported, they become official as of Jan.
  • Please check the baseline news and issues in the MW news.
  • Long downtimes proposal and discussion.
  • Tape staging test presentation and discussion.

Agenda

Attendance

  • local: Alberto (monitoring), Alejandro (FTS), Alessandro (ATLAS), Andrea M (MW Officer + data management), Andrea S (IPv6), Andrew (LHCb + Manchester), Jérôme (T0), Julia (WLCG), Kate (WLCG + databases), Maarten (WLCG + ALICE), Maria (FTS), Marian (networks + SAM), Vincent (security)
  • remote: Alessandra (ATLAS + Manchester), Catherine (IN2P3 + LPSC), Christoph (CMS), David B (IN2P3-CC), Di (TRIUMF), Frédérique (IN2P3 + LAPP), Gareth (RAL), Kyle (OSG), Marcelo (LHCb), Oliver (data management), Renaud (IN2P3-CC), Ron (NLT1), Stephan (CMS), Thomas (DESY-HH), Vincenzo (EGI), Xin (BNL)
  • Apologies: Nurcan (ATLAS), Ulf (NDGF-T1)

Operations News

  • Next WLCG Ops Coord meeting will be on March 2nd

Middleware News

  • T0 and T1 services
    • CERN
      • check T0 report
      • FTS upgrade to v. 3.5.8 and C7.3 planned for next week
    • FNAL
      • FTS upgrade to v. 3.5.7
    • IN2P3
      • dCache upgrade 2.13.32 -> 2.13.49 on Dec 2016
    • JINR
      • dCache minor upgrade 2.13.49 -> 2.13.51, xrootd minor upgrade 4.4.0 -> 4.5.0-2.osg33
    • NDGF-T1
      • Upgrade to dCache (3.0.5) next week to fix a communication bug that is rare.
    • NL-T1
      • SURFsara upgraded dCache from 2.13.29 to 2.13.49 on Dec 1
    • RAL
      • Castor upgrade to v 2.1.15-20 ongoing, Tape servers upgraded to v 2.1.16
      • Almost completed migration of LHCb data to T10KD drives.
      • Update SRMs to version 2.1.16-10 has been planned
    • RRC-KI-T1
      • Upgraded dCache for tape instance to v 2.16.18-1

Tier 0 News

  • Apex 5.1, released in December, has been installed on the development instance; the upgrade of the production environments is scheduled for February.
  • The main compute services for the Tier-0 are now provided by 31k cores in HTCondor, 70k cores in the LSF main instance, 12k cores in the dedicated ATLAS Tier-0 instance, and 21.5k cores in the CMST0 compute cloud. Some CC7-based capacity is available in HTCondor for user testing. We are in contact with major user groups for consultancy for migrating to HTCondor.
  • 2016 LHC data taking ended with the p-Pb run; about 5 PB have been collected in December. Since then there has been a lot of consolidation activity.
  • The p-Pb run for ALICE was performed using a 1.5 PB Ceph-based staging area, without incident or slow-down. This configuration, offering increased flexibility and easier operation, is now considered production-ready for CASTOR.
  • Mostly transparently to users, EOS and CASTOR instances have been rebooted. On 23 January the failover of an EOS ATLAS headnode failed, causing service disruption; the root cause is being investigated. Some EOS CMS instability requiring a headnode to be rebooted has been due to heavy user activity.
  • New storage hardware will be installed in February as soon as it becomes available to the service.
  • The FTS service has been upgraded to v.3.5.8 and a refreshed VM image based on CentOS 7.3 fixing vulnerabilities.
  • In January the mark of one billion files in EOS at CERN was crossed.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • High to very high activity since the end of the proton-ion run
    • Also during the end-of-year break
    • In particular for Quark Matter 2017, Feb 5-11
    • Thanks to the sites for their performance and support!
  • On Wed Jan 4 the AliEn central services suffered a scheduled power cut
    • Normally such cuts are short and do not present a problem
    • This time the UPS was exhausted many hours before the power was restored
    • All ALICE grid activity started draining away
    • Fall-out from the incident took ~2 days to resolve
  • A recent MC production is using very high amounts of memory
    • Large numbers of jobs ended up killed by the batch system at various sites
    • Experts are looking into reducing the memory footprint further

ATLAS

  • Very smooth production operations during the winter break; MC simulation, derivations and user analysis have been running in full speed since then. Derivations use up to 100k slots to finish full processing of data15+data16 and their MC samples. Very high user analysis activity for Moriond and spring conferences.
  • ATLAS Sites Jamboree took place at CERN on January 18-20. Good feedback and discussions from sites. Interest on virtualization from sites; docker containers and singularity, a dedicated meeting is being planned before Software&Computing week in March.
  • Global shares are being implemented to better manage the required resources among different workflows (MC, derivations, reprocessing, HLT processing, Upgrade processing, analysis). More intensive productions will run in the future. Sites will be asked to get rid of the fairshares set at the site level. This is currently in revision and some sites that ran only evgensimul (or part of it) in the past will be affected.
  • Tape staging test at 3 Tier1’s two weeks ago to understand the performance in case running the derivations from the tape input might become a possibility. Results will be presented today: 20170126_Tape_Staging_Test.pdf

CMS

  • good, steady progress on Monte Carlo production for Moriond 2017
    • tape backlog at sites worked down
    • JINR looking into tape setup to improve performance
  • PhEDEx agents were updated in time at all but two sites, ready to switch to FTS3 and new API
    • Christoph, Stephan: the old SOAP interface can be switched off now
    • FTS team: will be done in the intervention next Tue
  • dedicated hi-performance VMs for GlideinWMS, under evaluation using Global Pool scalability test
  • slots overfilled at sites due to HT Condor bug (triggered by node with high network utilization/errors); bug will be addressed in v8.6 to be released early February
    • Stephan: a fraction of the jobs used more cores than reserved; so far this could be mitigated
  • CentOS 7 plans: no general upgrade planned; sites are free to upgrade and provide SL6 via container, etc.; CMS software is also released for CentOS 7; physics validation expected soon; discussions and tests to move HLT farm to CentOS 7;
  • pilots sent to Tier-1 sites consolidated: one pilot with role "pilot" instead of two with roles "pilot" and "production"

LHCb

  • Andrew: high activity, nothing special to report

Discussion

  • David B: what is the general readiness for CentOS7 WNs?
  • Maarten: ALICE and LHCb can use such resources today
  • Christoph, Stephan: the CMS report has the details for CMS
  • Alessandro: ATLAS production can run OK, analysis to be checked;
    we will check if SL6 containers are OK
  • Alessandra after the meeting: the situation for ATLAS was presented in the ATLAS Sites Jamboree held on Jan 18-20
    • All SL6 SW workflows have been validated
    • Site admins should read the presentation for details, considerations and advice

Ongoing Task Forces and Working Groups

Accounting TF

  • New WLCG accounting portal (http://accounting-next.egi.eu/wlcg) has been validated and people are welcome to start using it
  • Migration to the new accounting reports started. Two sets of accounting reports (current and new ones) have been sent for November and December . In case no major problems are reported, starting from January, new reports become official.

Information System Evolution


  • Next meeting Feb 2.

IPv6 Validation and Deployment TF


  • NTR

Machine/Job Features TF

  • Further updates to DB12-at-boot in mjf-scripts distribution. See discussion at HEPiX benchmarking WG meeting. These values are made available as $MACHINEFEATURES/db12 and $JOBFEATURES/db12_job

Monitoring

  • NTR

MW Readiness WG


This is the status of Actions from the 19th MW Readiness WG meeting of 20161102.

  • 20161102-05: Christoph to investigate EL7 UI testing by CMS. Keep Andrea S. informed as maintainer of the workflow twiki.
  • 20161102-01: Andrea S. to update the CMS workflow twiki.

This is the status of jira ticket updates since the last Ops Coord of 20161201:

  • MWREADY-138 DPM 1.9.0 on C7 at GRIF_LLR for CMS - completed
  • MWREADY-104 DPM 1.9.0 SRM-less for ATLAS at LAPP Annecy - on-going. Atlas verified the modification to the pilot to use dav for stage-in/out
  • MWREADY-142 FTS 3.5.8 for ATLAS & CMS at CERN - on-going
  • MWREADY-140 ARC-CE 5.2.1 on C7 for CMS at Brunel - on-going ( ARC-CE 5.2.0 completed)
  • MWREADY-135 WN for C7/SL7 at TRIUMF for ATLAS - on-going
  • MWREADY-141 dCache 3.0.2 at PIC for CMS - on-going

Network and Transfer Metrics WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • CERN now has the first site-specific http://grid-wpad/wpad.dat service, and CMS is using it for their jobs
    • hosted on the same 4 physical 10gbit/s servers that CMS uses for squid service
    • supports both IPv4 and IPv6 connections in order to determine whether squids in Wigner or Meyrin should be used first
    • for CMS destinations (Frontier or cvmfs), it directs to the CMS squids, otherwise it defaults to the IT squids

Traceability and Isolation WG

  • A tool was identified as a possible solution for isolation (without traceability), 'singularity':
    • WG now evaluating the tool: a small test cluster is being built now
    • A security review of this tool is needed (transition plan might require SUID)
  • Next meeting: Wednesday 1 Feb 2017 (https://indico.cern.ch/event/604836/)

Theme: Downtimes proposal followup

See the presentation

  • Andrew: shifting a downtime can upset the experiment planning, it should not be done lightly
  • Stephan: only explicit policies can be programmed automatically
  • Maarten: we can have SAM apply the numbers of the new policies,
    whereas the clauses in purple (see presentation) would be "best practice";
    we can always do manual corrections as needed
  • Julia: manual operations should only be done in exceptional cases,
    e.g. when a downtime had to be extended or postponed
  • Maarten: today's feedback plus the outcome of a discussion with the SAM team
    will be incorporated into v3 to be presented in the MB

Theme: Tape usage performance analysis

See the presentation

  • Alessandro: can the FTS handle batches of 90k requests per T1, times 5 sites?
  • FTS team: v3.6 should be able to handle 500k per link
  • CMS: we want to understand the tape usage performance through a similar exercise;
    afterwards we can do a common challenge together with ATLAS
  • Alessandro: the common challenge would be at a shared T1
  • CMS: the monitoring is tricky; a file could first be read from tape
    and then multiple times from disk; some files could already be on disk;
    such cases would have to be disentangled
  • Alessandro: that is why in the ATLAS exercise we took very old files
  • Alessandro: the throughput was measured between submission of the request and
    the file being present in the ATLAS_DATA_DISK space
  • Alessandro: new files could be made bigger (not trivial), existing files cannot
  • Renaud: w.r.t. the number of files per request, we will check what limitations exist on our side
  • Alessandro: a typical campaign would cover O(100) TB and O(100 k) files;
    we can customize that per site
  • Renaud: what do you think of the observed performance?
  • Alessandro: it did not change over the past years, because experiments did not work on it !
    it also depends on site-specific tape handling aspects
  • Julia: shouldn't also the recording be optimized?
  • Alessandro: indeed, e.g. through new tape families

Action list

Creation date Description Responsible Status Comments
01 Sep 2016 Collect plans from sites to move to EL7 WLCG Operations Ongoing The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.
Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which are reported above.
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations Pending Jan 26 update: needs to be done in collaboration with EGI
03 Nov 2016 Discuss internally on how to follow up long term strategy on experiments data management as raised by ATLAS WLCG Operations DONE Jan 26 update: merged with the next action
03 Nov 2016 Check status, action items and reporting channels of the Data Management Working Group WLCG Operations Pending  
26 Jan 2017 Create long-downtimes proposal v3 and present it to the MB WLCG Operations Pending  

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion
29 Apr 2016 Unify HTCondor CE type name in experiments VOfeeds all InfoSys Proposal to use HTCONDOR-CE.   Ongoing
01 Dec 2016 Open tickets to sites for moving to FTS3 client CMS - There are PhEDEx prerequisites Jan 2017 DONE

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion
01 Dec 2016 Proposal for advance warning of long site downtimes All - Please, give feedback to the Dec GDB proposal 20th January 2017 DONE

AOB


This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsWeb > WLCGOpsCoordination > WLCGOpsMinutes170126
Topic revision: r20 - 2018-02-28 - MaartenLitmaath
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback