Hyperparameter Optimization with PanDA and iDDS



PanDA and iDDS allow users to run large scale hyperparameter optimization (HPO) on geographically distributed GPU/CPU resources. The idea is to provide a uniform resource pool and monitoring view to end-users while dynamically tailoring workloads to heterogeneous resources.

Typically each HPO workflow is composed of iterations of three steps:

Sampling step
To choose hyperparameter points in a search space.
Training step
To evaluate each hyperparameter point with an objective function and a training dataset.
Optimization step
To redefine the search space based on loss values of evaluated hyperparameter points.
In the system, the sampling and optimization steps are executed on central resources while the training step is executed on distributed resources. The former is called steering and the latter is called evaluation, i.e., steering = sampling + optimization and evaluation = training. Users can submit HPO tasks to PanDA using a new client tool, phpo, which is available in panda-client-1.4.24 or higher. Once tasks are injected into the system, iDDS orchestrates JEDI, PanDA, Harvester, the pilot and DDM to timely execute HPO steps on relevant resources, as shown in the figure below. Users can see what's going on in the system using PanDA monitoring. The iDDS document explains how the system works, but end-users don't have to know all the details. However, one important thing is that a single PanDA job evaluates one or more hyperparameter points and thus it is good to have a look at log files in PanDA monitoring if there is something wrong.



The main tricks are the separation of steering and evaluation and their asynchronous execution, as shown in the figure below. Basically users need two types of containers, one for steering and the other for evaluation. For steering, users can use predefined containers or their own containers. Note that users need to use HPO packages such as skopt and nevergrad which support the ask-and-tell pattern, when making steering containers.


The iDDS document explains how containers communicate with the system.

For steering, users provide execution strings to specify what is executed in their containers. Each execution string needs to contain several placeholders to dynamically get actual parameters through command-line arguments when the containers are executed. Input and output are done through json files in the initial working directly ($PWD) so that the directory needs to be mounted.

For evaluation, users also provide execution strings to specify what is executed in their containers. There are two files for input (one for a hyperparameter point to be evaluated and the other for training data) and three files for output (the first one to report the loss, the second one to report job metadata, and the last one to preserve training metrics). The input file for a hyperparameter point and the output file to report the loss are mandatory, while other files are optional. See the iDDS document for the details of their format. Note that evaluation containers are executed in the read-only mode, so that file-writing operations have to be done in the initial working directory (/srv/workDir) which is bound to the host working directory where containers and the system communicate using the input and output files. It is better to dynamically get the path of the initial working directory using os.getcwd(), echo $PWD, and so on, when applications are executed in evaluation containers, rather than hard-coding /srv/workDir in the applications, since the convention might be changed in the future.

Client tool for HPO task submission

phpo is a dedicated panda-client tool for HPO task submission. It is available once panda-client is set up through CVMFS or local installation. For example,
$ setupATLAS
$ lsetup panda

The following options are available in addition to usual grid options. All options can be loaded from a json if preferable.

  --nParallelEvaluation NPARALLELEVALUATION
                        The number of hyperparameter points being evaluated
                        concurrently. 1 by default
  --maxPoints MAXPOINTS
                        The max number of hyperparameter points to be
                        evaluated in the entire search. 10 by default.
  --maxEvaluationJobs MAXEVALUATIONJOBS
                        The max number of evaluation jobs in the entire
                        search. 2*maxPoints by default. The task is terminated
                        when all hyperparameter points are evaluated or the
                        number of evaluation jobs reaches MAXEVALUATIONJOBS
For steering,
                        The max number of hyperparameter points generated in
                        each iteration. 2 by default Simply speaking, the
                        steering container is executed
                        maxPoints/nPointsPerIteration times when
                        minUnevaluatedPoints is 0. The number of new points is
  --minUnevaluatedPoints MINUNEVALUATEDPOINTS
                        The next iteration is triggered to generate new
                        hyperparameter points when the number of unevaluated
                        hyperparameter points goes below minUnevaluatedPoints.
                        0 by default
  --steeringContainer STEERINGCONTAINER
                        The container image for steering run by docker
  --steeringExec STEERINGEXEC
                        Execution string for steering. If --steeringContainer
                        is specified, the string is executed inside of the
                        container. Otherwise, the string is used as command-
                        line arguments for the docker command
  --searchSpaceFile SEARCHSPACEFILE
                        External json filename to define the search space.
                        None by default
For evaluation,
  --evaluationContainer EVALUATIONCONTAINER
                        The container image for evaluation
  --evaluationExec EVALUATIONEXEC
                        Execution string to run evaluation in singularity.
  --evaluationInput EVALUATIONINPUT
                        Input filename for evaluation where a json-formatted
                        hyperparameter point is placed. input.json by default
  --evaluationTrainingData EVALUATIONTRAININGDATA
                        Input filename for evaluation where a json-formatted
                        list of training data filenames is placed.
                        input_ds.json by default. Can be omitted if the
                        payload directly fetches the training data using wget
                        or something
  --evaluationOutput EVALUATIONOUTPUT
                        Output filename of evaluation. output.json by default
  --evaluationMeta EVALUATIONMETA
                        The name of metadata file produced by evaluation
  --evaluationMetrics EVALUATIONMETRICS
                        The name of metrics file produced by evaluation
  --trainingDS TRAININGDS
                        Name of training dataset

                        A comma-separated list of files and/or directories to
                        be periodically saved to a tarball for checkpointing.
                        Note that those files and directories must be placed
                        in the working directory. None by default
                        The name of the saved tarball for checkpointing. The
                        tarball is given to the evaluation container when the
                        training is resumed, if this option is specified.
                        Otherwise, the tarball is automatically extracted in
                        the working directories
  --checkPointInterval CHECKPOINTINTERVAL
                        Frequency to check files for checkpointing in minute.
                        5 by default
For the full list of options,
$ phpo --helpGroup ALL

How to submit HPO tasks

The most important options of phpo are --steeringContainer, --steeringExec, --evaluationContainer, and --evaluationExec, i.e., container names for steering and evaluation and what is executed in those containers. Here is an example to show how those options look like.
$ cat config_dev.json
{"steeringExec": "run --rm -v \"$(pwd)\":/HPOiDDS gitlab-registry.cern.ch/zhangruihpc/steeringcontainer:0.0.1 /bin/bash -c \"hpogrid generate --n_point=%NUM_POINTS --max_point=%MAX_POINTS --infile=/HPOiDDS/%IN  --outfile=/HPOiDDS/%OUT -l nevergrad\"", "evaluationExec": "bash ./exec_in_container.sh", "evaluationContainer": "docker://gitlab-registry.cern.ch/zhangruihpc/evaluationcontainer:mlflow", "evaluationMetrics": "metrics.tgz”}
where users cannot specify steeringContainer for now until iDDS is updated to allow to specify the container name in a separate attribute. This is the reason why the steering container is included in steeringExec in this example. Note that execution strings for the evaluation container written in a file exec_in_container.sh. All files with *.json, *.sh, *.py, *.yaml in the local current directory are automatically sent to the remote working directory. So users don't have to specify a complicated execution string in evaluationExec. E.g.
$ cat exec_in_container.sh
export CALO_DNN_DIR=/ATLASMLHPO/payload/CaloImageDNN
curl -sSL https://cernbox.cern.ch/index.php/s/HfHYEsmJNWiefu3/download | tar -xzvf -;
python $CALO_DNN_DIR/scripts/make_input.py input.json input_new.json
cp -r $CALO_DNN_DIR/exp_scalars $CURRENT_DIR/
python /ATLASMLHPO/payload/CaloImageDNN/run_model.py -i input_new.json --exp_dir $CURRENT_DIR/exp_scalars/ --data_path $CURRENT_DIR/dataset/event100.h5 --rm_bad_reco True --zee_only True -g 0
rm -fr $CURRENT_DIR/exp_scalars/
tar cvfz $CURRENT_DIR/metrics.tgz mlruns/*
rm -fr mlruns dataset
The initial search space can be described in a json file.
$ cat search_space_example2.json
    "auto_lr": {
      "method": "categorical",
      "dimension": {
        "categories": [
        "grid_search": 0
    "batch_size": {
      "method": "uniformint",
      "dimension": {
        "low": 10,
        "high": 30
    "epoch": {
      "method": "uniformint",
      "dimension": {
        "low": 5,
        "high": 10
    "cnn_block_depths_1": {
      "method": "categorical",
      "dimension": {
          "categories": [1, 1, 2],
          "grid_search": 0
    "cnn_block_depths_2": {
      "method": "uniformint",
      "dimension": {
        "low": 1,
        "high": 3
$ phpo --loadJson config_dev.json --site XYZ --outDS user.blah.`uuidgen`
Once tasks are submitted, users can see what's going on in the system by using PanDA monitoring.


Protection against bad hyperparameter points

Each hyperparameter point is evaluated 3 times at most. If all attempts are timed-out, the system considers that the hyperparameter point is hopeless and a very large loss is registered, so that the task continues.

Visualization of the search results

It is possible to upload a tarball of metrics files to a grid storage when evaluating each hyperparameter point. For example, the above example uses MLflow for logging parameters and metrics, collects all files under ./mlflow into tarballs, and uploads them to grid storages. The filename of the tarball needs to be specified using the --evaluationMetrics option. Tarballs are registered in the output dataset so that they can be download using rucio client. It is easy to combine MLflow metrics files. The procedure is as follows:

$ rucio download --no-subdir <output dataset>
$ tar xvfz *
$ tar xvfz metrics*
$ mlflow ui
Then access to using your own browser will show something like the picture below.


There is an on-going development activity to dynamically spin-up MLFlow services on PanDA monitoring or something which would do the above procedure on behalf of users and centrally provide MLFlow UI to users.

Relationship between nPointsPerIteration and minUnevaluatedPoints


The relationship between nPointsPerIteration and minUnevaluatedPoints is illustrated in the above figure. The steering is executed to generate new hyperparameter points every time the number of unevaluated points goes below minUnevaluatedPoints. The number of new points is nPointsPerIteration-minUnevaluatedPoints. The main idea to set a non-zero value to minUnevaluatedPoints is to keep the task running even if some hyperparameter points take very long to be evaluated.

What "Logged status: skipped since no HP point to evaluate or enough concurrent HPO jobs" means

PanDA jobs are generated every 10 min, when the number of active PanDA jobs is less than nParallelEvaluation and there is at least one unevaluated hyperparameter point. The logging message means that there are enough PanDA jobs running/queued in the system, or the system has evaluated or is evaluating all hyperparameter points which have been generated so far. Note that there is a delay for iDDS to trigger the next iteration after enough hyperparameter points were evaluated in the previous iteration.


If evaluation containers support checkpointing it is possible to terminate evaluation in the middle and resume it afterward, which is typically useful to run long training on short-lived and/or preemptible resources. Evaluation containers need to
  • periodically produce checkpoint file(s) in the initial working directory or in sub-directories under the initial working directory by using relevant functions of ML packages like keras example, and
  • resume the training if checkpoint file(s) are available in the initial working directory, otherwise, start a fresh training.
Users can specify the names of the checkpoint files and/or the sub-directories using the --checkPointToSave option. The system periodically checks the files and/or sub-directories, and saves them in a persistent location if some of them were updated after the previous check cycle. The check interval is defined by using --checkPointInterval which is 5 minutes by default. Note that the total size of checkpoint files must be less than 100 MB. When PaDA jobs are terminated while evaluating hyperparameter points, they are automatically retried. The latest saved checkpoint files are provided to the retried PanDA jobs. If the --checkPointToLoad option is specified the checkpoint files/directories are archived to a tarball which is placed in the initial working directory, otherwise, they are copied to the initial working directory with the original file/directory names.

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2020-06-26 - TadashiMaeno
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback