ADCResourcesBenchmarking

Introduction

Benchmarking is an important and pervasive number. It is at the base of computing procurements, accounting, pledges, monitoring, brokering and production planning. Benchmark has to be representative of the performance of the applications on the resources, easy to use and understood by the vendors. Up until now we have used for HEP-SPEC06 for all the above. HS06 is a subset of the industry standard Spec2006. It is usually measured by sites when they buy new hardware according to a procedure agreed long time ago by the Hepix benchmarking WG and the experiments.

Sites have heterogeneous resources and what they publish is dictated the attributes in the BDII which had some overloaded attributes, so at best the number they publish is a weighted average which when used for both accounting and brokering creates problems even when calculated with care. To avoid this compression of information in one single number the MJF (Machine Job Feature) was developed. MJF is a set of files on each grid node describing the status of the node. For the benchmark it now contains the measured HS06 and the DB12 short benchmark values for that node which the pilot can read and send back to the WMS to decide what payload to run. LHCb is building its payload brokering around this. However it may not be applicable to all resources.

The experiments that started to heavily use the benchmark for payload brokering claim HS06 is not representative anymore of their applications. Partly this is due to the publishing, partly it is due to the fact that HS06 is becoming old and is not representative anymore. However there is no industry standard replacement in view and some experiments have started to look at replacing it with what we call short benchmarks for brokering. Short benchmarks can also be used on resources where it is not possible to measure HS06, for example some HPC resources or commercial clouds and they also give the experiments the ability to run them along with the job. While until now it was understood that experiments could use whatever for their brokering, but we should leave HS06 for the pledges and procurements, lately there is talk to replace HS06 with one of these short benchmarks or with a combination of favourite experiments applications for everything not everyone agrees and sites in particular are weary of changing HS06 for pledges and procurements, but discussion on finding a suitable replacement that can be used is ongoing. The new benchmarks, where and how to use them will be the topic of the WLCG

Status

Running the benchmark

  • Currently running python DB12 and whetstone benchmark in the pilot on resources that allow it.
    • Standard grid: Yes / Done
    • Nordugrid: Yes / Done
    • BOINC: Yes / Done
    • Non restrictive HPC: Yes / Done
    • Cloud: Yes / Done
    • Restrictive HPC: No
      • Payload doesn't run on the same node as the pilot. Benchmark needs to be integrated in the payload. Will need dedicated jobs.
      • Don't have outbound connectivity. Cannot write values in ES directly.
    • Analysis resources: No
      • Not considered useful.
  • Benchmark is run from the ATLAS CVMFS area
    • New DB12 written in C or new containers run benchmarks should be added. As explained below in the WLCG questions the latter is important because the suggestion is to use it in case we move from HS06 to a more custom benchmark built on experiments applications.

JEDI Brokering & Harvester

  • Currently using HS06 to evaluate tasks average time needed on each resources
    • Need a better failure codes analysis. Work started by Ryu.
  • Harvester the plan is to have a semistatic map WN - benchmark. Which benchmark and measured how is still under discussion.

Machine Learning & Monitoring

  • Currently using UC ES collecting data directly from the pilots. Benchmark run every 100 jobs.

Accounting

  • Currently done with HS06 32bit there were discrepancies between APEL and ATLAS accounting but they've been ironed out. See Accounting FAQ for more explanation and information. Further exaplantions in the Jan 2017 Jamboree presentation
  • As discussed in the the WLCG accounting TF ATLAS would find useful to have the values of different benchmarks stored in the accounting portal to do comparisons in the future.

Current ATLAS position

  1. ATLAS wants a stable benchmark for pledges, accounting and resource planning in the same way HS06 has been so far.
  2. ATLAS will run a fast benchmark (< than XXX seconds/job) in the worker node at job runtime. ATLAS will not run such benchmark with every job. Already now it is run every 100 jobs and although like this it is possible to run it on most resources a first analysis on 2 months data shows that the predictability value of benchmark run like this is only marginally better than the manipulated HS06 values currently in use and the CPU name have the best correlation with the s/event for almost all the applications (see plot).
  3. Fast benchmark cases:
    • Monitoring the initial analysis of the data in ES shows that this can be used to find misconfigured sites with HS06 values completely wrong. Since short benchmarks and HS06 are more or less in line.
    • Brokering: the evolution of the WMS should have a map WN - benchmark. How to build the mapping is still under discussion but could be short campaigns of pilot benchmarking every few months. ATLAS will not do any runtime payload brokering and is not using MJF
    • Benchmarking opportunistic resources

WLCG Questionaire

Fast benchmarks

  1. Does the experiment need to access benchmarking information in the job slot? For which purpose?
    • Running a fast benchmark in the pilot was considered a way to verify the HS06 values published by sites. The values are also closer to the running conditions of the job which have been for long time blamed for the discrepancies between running times and HS06 values an initial study on a limited sample has shown misconfigured sites a couple of months ago. However an initial study over 2 months data shows that in this conditions the best predictor is the name of the CPU rather than any benchmark. While short benchmarks have marginally better results than HS06, for most applications this is not enough to warrant running them in each pilot.
benchmark-r2.png
Applications - benchmark correlation. April-May 2017 data
Global numbers on limited samples for now however more numbers can be calculated using the jupiter notebooks.
  1. Expectation: have a pessimistic benchmark score, based on fully loaded server (what can be obtained with MJF) or running fast benchmarks in pilot jobs
    • Currently not using MJF. However wants to build a WN database with the benchmarking values for brokering. What benchmark and measured how still under discussion.
  2. What is the state of the art for the adoption of fast benchmarks in the pilot framework?
    • Currently the the pilot runs DB12 and whetstone in the pilot and the numbers are then written in a ES instance in UC.
  3. What are the preferred fast benchmarks from the experiment point of view? Is it still DB12 python? are there other benchmarks evaluated? (DB12 cpp?)
    • Currently still DB12 python and whetstone. DB12 cpp may be used in the future.

HS06

  1. Is the issue of HS06 score Vs time Simulation workload confirmed, within the accuracy you need?
    • An initial study on the error rates has also some information about brokering done with current averaged HS06 values See thispresentation. Needs more digging
  2. Is the correlation still good for reconstruction jobs?
    • Same as above
  3. When and how was it studied in the recent years? Isolated machines or job slots?
    • A comprehensive study hasn't been done.
  4. Would it be better if HS06 is compiled a 64 bit?
    • It would be closer to the application setup. However to do such a change it would have to last for a reasonable period. We shouldn't change for the sake of change.

Preparation for the new long-running benchmark (Successor of HS06)

  1. We need to prepare a suite of Experiment workloads to compare the execution time respect to the future proposed benchmarks.
    • What are the suggested workloads from the experiments?
      • Full simulation is ~37% of the ATLAS cpu power (HS06) used, followed by 22% evgen and then all the others. For other experiments Simulation represents 80% of their used cpu. One of the argument for the short benchmark is that it correlates better with simulation which is the most cpu intensive of the applications. However for atlas it is not. So it is important to measure the correlation with short benchmarks for all the applications.
        resourceutilization individual.png
        June 2016-June 2017 Used power (HS06)
      • Characteristics of simulation workloads
      • Characteristics of reconstruction workloads
  2. Action that the experiments can take to make available such workloads in containers (I will present next Friday an example)
    • Domenico is already using KV but is an old version it should be updated to a more recent version. The container method is explained in this presentation and is based on work done by ATLAS to run analysis.
  3. Collection of results: How shall we collect results? Is there a need of a common DB for hardware models? N.B I do not refer here to the accounting use case, but just to the approaches to run the benchmarks and the WLCG workload suite in a reproducible way and collect and share the results.
    • From the WLCG point of view it would be ideal to have all the results together. However when we tried to write in their infrastructure from the pilot we couldn't do it because it is not certificate aware. If we want to include ATLAS results in whatever WLCG common database it will need some asyncronous copying from whatever ATLAS instance we are going to use. Alternative change the communication method and use what we use at UC.
  4. Looking at the future:
    • What is the status of adoption of multi-threading? this will impact the selection of benchmarks
      • ATLAS is currently running multi-processing in production but not multi-threaded jobs. However work is ongoing to use MT geant4 (see Steve Farrell's talk at Valncia S&C week. It is not yet validated but the plan is to move to MT simulation by 2018. For reconstruction we have a lot more work to do. In addition this is targeted for Run 3, so it will not run in Release 21 ever. Thus it's unlikely to swing into action big time before 2020. This would be the right time to add a multi-threaded benchmark for sites - we could even offer ATLAS G4 MT if that's interesting. (Graeme 20/6/2017)
    • What is the set of new architectures where the WLCG workloads will run (that then need to be benchmarked?). What is the status of adoption of GPUs?


Major updates:
-- AlessandraForti - 2017-06-12

Responsible: AlessandraForti
Last reviewed by: Never reviewed

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng benchmark-r2.png r1 manage 11.9 K 2017-06-15 - 10:25 AlessandraForti  
PNGpng resourceutilization_individual.png r1 manage 119.9 K 2017-06-15 - 10:39 AlessandraForti  
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2017-08-07 - AlessandraForti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Atlas All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback