HammerCloud v4

1. Introduction.

HammerCloud (HC) is a Distributed Analysis testing system. It can test your site(s) and report the results obtained on that test. Used to perform basic site validation, help commission new sites, evaluate SW changes, compare site performances...

You can find HC, in three flavours:

ATLAS
responsible(s) Daniel Colin Van der Ster, Johannes Elmsheuser, Ramón Medrano.
CMS
responsible(s) Andrea Sciaba.
LHCb
responsible(s) Mario Ubeda Garcia.

If you have any comment or question about the project, contact us through Savannah.

SVN repository: accessible inside CERN.

1.1. Publications.

In case you want more information, you can take a look to the latest publications and related articles:

  • Distributed analysis functional testing using GangaRobot in the ATLAS experiment. External site
  • HammerCloud: A Stress Testing System for Distributed Analysis. External site
  • ...

2. Workflow.

First of all, we have to understand the basic use cases, depicted on the following picture.

HammerCloud's basic use cases There are two types of tests, functional and stress ones:
Functional (automated) testing: test configured by experts/admins, which frequently submits a few "ping" jobs. Are autoscheduled, which means configure once, run forever. Used for basic site validation.
Stress (on-demand) testing: test configured by any user using a template system, which submits a large number of jobs concurrently during a slot of time. Are on-demand, which means configure once, run once. Used to help comission new sites, evaluate changes to the site infrastructure, evaluate SW changes, compare site performances and check site evolution.

Appart from the scheduling difference and the number of jobs submitted, there is no difference between a functional and a stress tests. In fact, both tests have the same components.

  • Start time: defines test's start time.
  • End time: defines test's end time.
  • User code*: real analysis job, ( usually) taken from the pysicists community.
  • Option file*: some optional configuration parameters needed for submission.
  • Dataset patterns*: data to be processed.
  • Test options*: submission's node set up options.
  • Ganga bin: type/version of Ganga.
  • Hosts: machines elegible to host the test.
  • Sites: target site(s) for the test.
  • Users: authorized persons to modify the test.

* Some test parameters have a different meaning/behaviour depending on the application ( ATLAS, CMS, LHCb). Check carefully the specification.

One more time, once we know that there are two different ways to schedule a test: automatic or on-demand; the leftover mechanisms in the HC engine, are common. The following image (left) shows the steps required to run a HC test, once it has been scheduled. The second image (right) shows all test states and their transitions.

HammerCloud's basic workflow

A few hints:

  • Test create: represents test's scheduling, either on-demand or automatic. First test is created, and stored with
    state DRAFT. Tests in this state must be sent for approval if are on-demand (it is done automatically for functional tests).
    If the test fulfills a certain set of conditions, is approved automatically, otherwise is redirected to administrators. In the
    mean time, until a decission is taken about the test, its state is moved to UNAPPROVED. They either can modify/reject
    it or approve it. In any case, an approved test has a new state, TOBESCHEDULED. After approval, other application admins
    and test users are notified.
  • Test generate: there are many agents waiting for approved tests (TOBESCHEDULED). The test is still not assigned to any
    of the elegible hosts. Between the agents in the different machines is picked the one living on the machine with the lowest load.
    Then, the host is assigned to the test and all the necessary scaffolding to run the test is generated (mainly, write Ganga jobs).
    Test's status is updated to SCHEDULED and an agent will wake up at Start time to run the test.
  • Test submit: at Start time, an agent named run-test-$TESTID moves test's status to SUBMITTING and starts a Ganga session.
    Submission is multithreaded in Ganga, but it can take a while. After submission, if there were no errors, test state is set to RUNNING,
    otherwise to ERROR.
  • Test monitoring: if there were no errors, the execution becomes multi-threaded. One monitoring thread, one resubmission thread.
    The monitoring thread updates job attributes ( status, different metrics, times, etc ...) in the DB obtained from the Ganga plugin used.
  • Test resubmission: this second thread has a main tasks: control jobs' statuses and handle jobs resubmission according with the
    configuration set by the user.
  • Test statistics: at End time, both threads end, and test state is moved to COMPLETED. Plotting and summarising results are
    the two last tasks performed on every test execution. Histograms and pie charts are used to understand global efficiencies
    and timings. Means and standard deviations are used on comparisons between tests, days or sites. All them are available
    through the web site once calculated. As finalisation, users are( will be -- TODO) notified with the most remarkable metrics
    obtained on the test.

HammerCloud's states flow

This is an initial contact will the workflow, necessary to understand next chapters.

3. Architecture.

HammerCloud was originally designed as a monolithic application. Its design was good enough to deal with one application ( ATLAS). However, the addition of CMS to the list of applications, showed that the amount of time needed to maintain both applications was high. The addition of a third application -- LHCb -- showed that in fact, it was worth to restructure HammerCloud in order to deal with an unknown number of different applications.

The basic motivation of v4 sits on the following:

  • All applications were VERY similar, but orthogonal enough to screw any cheap workaround to make them work together.
  • A common platform was wanted, where changes to the core will apply to every application. Then, common parts were extracted.
  • With the non-common, but very similar components, a "plug in" system was created.
  • There are different machines, running different applications, but should be ready to run any other application ( but ideally this will never happen).
  • Avoid VO-installed software libraries, 3 different VOs, all them have different packages. We will provide our packages, no more headaches.
  • Tidy a little bit, rename & refactor, add some functionalities...

NodesConfig.png

Machine roles:

Tag master server node Brown tag app server node Yellow tag app submit node

Production machines:

ATLAS
voatlas49 -- Yellow tag Brown tag Tag
voatlas73 -- Yellow tag Blank box Blank box
CMS
vocms57 -- Yellow tag Brown tag Blank box
LHCb
volhcb29 -- Yellow tag Brown tag Blank box

Others:

ATLAS
voatlas ?

The current production schema has four main machines, two of them used by ATLAS, one CMS machine and another LHCb machine. You should be wondering what do the roles mean, so there we go !

3.1. Roles.

There are three roles, master server node, app server node and app submit node.

Master server node: the machine having this role, hosts a central database storing users and permissions basically. Here are stored users and their permissions to access any of the applications, in our case: ATLAS, CMS and LHCb. It also hosts the Apache server, which serves the portal. Yes, the portal is common, but the data served on it comes from different sources. As you may imagine, there is ONLY ONE machine with this role.

App submit node: the machine having this role, has as main purpose, run tests. In fact, this machine must be registered on the app specific database, otherwise it will not exist for the system.

App server node: the machine having this role, hosts the app specific database, where all information regarding tests, sites, results, etc... is stored. This means that any request from the portal to the app CMS, will query the CMS specific database, which is hosted in vocms57. But also, any submit node will request information from that database in order to run tests, and of course, it will push the test results. Also, the app server node hosts the configuration file needed to access any other node or database, which makes this node indispensable at every single machine. If, like in the case of ATLAS, there is more than one machine, we will have more than one app server node. This does not mean that we will have two ATLAS specific databases, one per machine. We will set one database, and point the configuration files to that one.

HELP Sumarizing: one master database, one apache server, one app specific database per application, one app server node per machine and as many different submit nodes as we want per machine. But to be fair, we only use XYZ_nodes on voXYZ machines.

4. The first test

4.1. Getting an account

If you want to send tests you need a HammerCloud account. Please put in contact through the Savannah project page establishing the VO where you want to send tests.

4.2. Creating your first test

Access the HammerCloud page corresponding your VO ( http://hammercloud.cern.ch/hc/app/###_VO_###) and log in the Administration page ( Obvious, isn't it ?) To create a test, select the desired start and endtimes, and the test_template needed. You can choose any stress template.

If you need more details about the test templates, please keep reading first and if your question has not been answered there, please ask a HammerCloud operator.

Save, it and you will get your test created.

4.2 Modifying your first test

All parameters from the template have been copied into your first test. Note that you might have not enough permissions to click all links ( you cannot modify neither a template nor files).

You can add the sites or clouds you desire to your test, and configure the submission algorithm. For your first test, leave as follows for every site or cloud you add:

  • Resubmit enabled: checked.
  • Resubmit force: unchecked.
  • Num datasets per bulk: 1.
  • Min queue depth: 5.
  • Max running jobs: 5.

You can add hosts, dspatterns and users as well. But it will be explained in further chapters. Anyway, when you submit a test your user is added automatically.

Two more steps, and your test will start running. Save your modifications and in the tests list check yours. Find in the actions selectable list 'Send selected tests for approval' and click 'Go'. That's all !. A HammerCloud operator will check it and aprove/modify or reject it.

Go to the main page and select your test.

5. Code structure.

The HammerCloud source code directory tree looks like this:

This directory is typically (always) installed under:

  /data/hc

5.1. apps

5.2. bin

Not used.

5.3. doc

Not used. Intended to host pyDoc files, but just intended.

5.4. external

It should look like this:

  • Folder atlas
  • Folder cms
  • Folder lhcb
  • Folder bin
  • Folder lib
  • Folder django
  • Folder ganga

where we have app specific folders. We can put there whatever external software that app needs. But only there. There are also two folders, bin and lib, used to store locally compiled code, mainly numpy and GChartWrapper. As we need to compile the code with the libraries that run HammerCloud, we have to store the compiled code somewhere. Also is the place to checkout the ganga code, and untar the latest django version.

5.5. logs

Not used.

5.6. python

5.7. scripts

5.8. test

Not used. This is an important TODO.

5.9. web

On the web subdirectory, we can find three main folders, media ( used to store css, js and images), src ( which contains the server logic) and templates ( which contains the html files that are going to be rendered by django). The most interesting module is src, so let's explain it first.

5.9.1. src

It has a django project called 'hc', generated with the command:

    django-admin.py startproject hc
  
In the hc folder, that command has generated:
  • manage.py, a command-line utility. External site
  • settings.py, a configuration file. External site
  • urls.py, URL declarations for this Django project. External site

Appart from that, you can find:

  • atlas, ATLAS specific folder.
  • cms, CMS specific folder.
  • core, where all the 'meat' is.
  • lhcb, LHCb specific folder.
  • router.py, multi databases router. *Be careful with it * External site
  • ssl.py, add ssl layer to http connections (https). External site

5.9.1.1. settings.py

We can see that there are four different databases, the central one with users (default), and one per application. Right now, the names of the databases start with dev_. Also notice that media is taken from media2, due to the coexistence of HCv3 and HCv4, we had to create a paralell directory. Just do not forget it.

    MEDIA_URL = 'http://hammercloud.cern.ch/media2/'
    ADMIN_MEDIA_PREFIX = '/media2/admin/'
  
We are using a router to deal with multi databases.
    DATABASE_ROUTERS = ['hc.router.PrimaryRouter']
  
We have added the ssl middleware:
    MIDDLEWARE_CLASSES = (
      ...
      'hc.ssl.SSLRedirect',
    )
  
And finally, the applications:
    INSTALLED_APPS = (
      'hc.core',
      'hc.atlas',
      'hc.cms',
      'hc.lhcb',
      ...
  

5.9.1.2. router.py

This file has four methods, but would be better to explain before how does the Meta property app_label works. Whenever a model is created within a models.py file, the application name is added to the Meta dictionary, in our case, we are adding either 'atlas','cms' or 'lhcb'. The names are taken from the application top directories, see INSTALLED_APPS in settings.py.

  • db_for_read : If the model requestsd app label is on settings.INSTALLED_APPS ( we only accept atlas, cms and lhcb with this setup), we return it, which is BTW the name of the database on the DATABASES dictionary in settings.py. Its easier to read the code than explain it. Otherwise, we return None, which is the same as return 'default'.
  • db_for_write : same procedure than above.
  • allow_relation : we don't need anything special.
  • allow_syncdb : we only synchronize the model with the database if the model app_label is on INSTALLED_APPS and if the app_label and the database matches.

5.9.1.3. ssl.py

Read the snippet External site, but basically we are adding ssl to the admin pages.

5.9.1.4. manage.py

You can open it, you can read it, you can execute it. Do not modify it unless you know what are you doing. Two commands can be highlighted:

But first do not forget to import the needed libraries.

      export PATH=/afs/cern.ch/sw/lcg/external/Python/2.5.4p2/x86_64-slc5-gcc43-opt/bin/:$PATH
      export PYTHONPATH=/afs/cern.ch/sw/lcg/external/mysql_python/1.2.2-mysql5.0.18-python2.5/x86_64-slc5-gcc43-opt/lib/python2.5/site-packages/:$PYTHONPATH
      export PYTHONPATH=$PYTHONPATH:/data/hc/external/django/ ##DJANGO VERSION##

      # Now ##DJANGO VERSION## = Django-1.2.3

      cd /data/hc/web/src/hc
    

DB synchronization ( being ##DATABASE## = ['atlas','cms','lhcb','default']):

      python2.5 manage.py syncdb --database= ##DATABASE##
    

Interactive django-python console:

      python2.5 manage.py shell
    

5.9.1.5. urls.py

The file urls is adding the urls from:

  • hc.core.urls
  • hc.core.base.rss.urls
  • hc.core.base.xhr.urls
  • hc.core.base.xmlrpc.urls

and setting https instead of http for any request that goes to admin/

5.9.2. templates

5.9.3. media

6. Tests

Tests are the human readable part of HammerCloud. In a test instance are defined all necessary parameters to submit jobs to the GRID. The basic parameters are:

  • id: unique among tests.
  • startime: when does the test start.
  • endtime: when does the test end.
  • state: state of the test. List of possible states below.
  • host: which machine is submitting the jobs.
  • sites: where are the jobs sent.

HammerCloud provides users with two different test types, stress and functional. Both have an special purpose, but the main objective of HammerCloud stays intact. Both tests provide metrics and results withdrawn from the submission. On it's way to improvement, HammerCloud has become a tool for testing, analyzing, evaluating and comparing sites with a few clicks.

If you want to know more about the two different flavors, keep on this chapter.

6.1. Stress test

It's a Distributed Analysis On-Demand test. It's a large-scale stress tests using real analysis jobs to test one or many sites simultaneously. This can be seen as a brute force test. A list of possible usages:
  • Help comission new sites
  • Evaluate changes to site infrastructure
  • Evaluate SW changes
  • Compare site performances

6.2. Functional test

Functional tests perform basic site validation with the submission of few jobs frequently during large periods of time. This can be seen as a "ping" test. A list of possible usages:
  • Robot testing

6.3 Test Templates

A Test Template contains all the information needed to make the HammerCloud logic behind work. Some of the parameters are completely orthogonal and have no dependencies between each other, but others are coupled and A without B will not work. That's why test templates are configured forehand. Let's see what's on a test template.

6.3.1. Parameters

Some of the parameters are common for every VO. They are the skeleton of the template, and loosely coupled. Others are not... The VO dependent parameters are highly coupled and must be explained under different points of view depending on the VO.
6.3.1.1. Common
  • Type information
    • category: either stress of functional.
    • description: probably the most important field. This is what users see when selecting a test template. The more accurate, the better.
    • lifetime: only meaningful for active functional templates. For how long the test will run ( in days).
    • active: only meaningful for functional templates. If it is active, a functional test will be started by the HammerCloud robot using this template if there is no other test running with this *template. If it is not active, users can take it as a normal stress template. The lifetime parameter will be ignored and the user will be able to setup a start and endtime.
  • Files
    • test script: the main HammerCloud script. Mainly used for development. Nothing to worry about.
    • gangabin: the Ganga version used to submit the jobs.
    • extraargs: probably empty. Some jobs need extra parameters, here is where them are added.
    • hosts
    • clouds
    • sites
    • dspatterns
    • users

6.3.1.2. ATLAS
[coming soon]

6.3.1.3. CMS
  • Files
    • Jobtemplate: file used by the GangaCMS plugin to create crab.cfg. Contains parameters like events_per_job and number_of_jobs. For example, 100_1000.tpl means that 1000 jobs are created with 100 events each.
    • Usercode: contains the Python configuration for CMSSW.
    • Inputtype: CMSSW version. It must match with Testoption
    • Testoption: used to choose the CMSSW version and the VOMS role. It must match with Inputtype.

Value Explanation
1_10000.tpl 1 job with 10000 events
10_10.tpl 10 jobs with 10 events each
10_100.tpl 10 jobs with 100 events each
10_1000.tpl 10 jobs with 1000 events each
100_100.tpl 100 jobs with 100 evens each
100_1000.tpl 100 jobs with 1000 events each
1000_10.tpl 1000 jobs with 10 events each
JPE.py A simple test program that plots eta and pt of CaloJets
MTR3.py A test program reading 42 branches and information from CaloJets, GenJets, PFJets, Muons, Electrons, Photons and Tracks
*_CHLD_CACHE20.py A configuration using Lazy Download, read-ahead-buffered and a 20 MB cache

6.3.1.4. LHCb
[coming soon]

6.3.2. Rules

Not many rules related with test templates.
  1. You cannot create/modify templates.
  2. Only HammerCloud operators can, so ask them if after reading this you still have questions.
  3. Stress templates NEVER have sites or clouds associated.
  4. You cannot create tests with ACTIVE functional templates.

APPENDIX 1 ( API )

A1.1. GET-JSON

HammerCloud has simple JSON dumper, that can be used to request certain values from the DB, specifically results and templates -- more options will come. Given an app, for instance CMS, we have four different actions against the server:

    http://hammercloud.cern.ch/hc/app/cms/xhr/json/
  

  • Number of results (per test)
      number_of_results
      
    ?action=number_of_results&test=400 ?action=number_of_results&test=400&test=401
  • Results (per test)
      results
      
    ?action=results&test=400 ?action=results&test=400&detailed=1 ?action=results&test=400&test=401&detailed=1
  • Results (per test and site)
      results_at_site
      
    ?action=result_at_site&test=400&site=T2_US_Caltech ?action=result_at_site&test=400&site=T2_US_Caltech&detailed=1 ?action=result_at_site&test=400&test=401&site=T2_US_Caltech ?action=result_at_site&test=400&test=401&site=T2_US_Caltech&detailed=1 ?action=result_at_site&test=400&test=401&site=T2_US_Caltech&site=T2_BR_SPRACE ?action=result_at_site&test=400&test=401&site=T2_US_Caltech&site=T2_BR_SPRACE&detailed=1
  • Templates
      templates
      
    ?action=templates ?action=templates&detailed=1

A1.2. XMLRPC

APPENDIX 2 ( Site exclussion )

For PanDA, HammerCloud can automatically disable queues when FUNCTIONAL tests are failing. Automatic exclusion follows a specific policy:

  • This policy applies only to sites named ANALY_*
  • Panda queue status is measured by HammerCloud: If the status is "online" or "brokeroff", the site is tested by HammerCloud. Sites in "offline", "test" or any other status are not tested.
  • One or more HC FUNCTIONAL test templates are used to evaluate the site functionality. The current test template(s) used for enforcement is http://hammercloud.cern.ch/atlas/80/testtemplate/, http://hammercloud.cern.ch/atlas/67/testtemplate/
  • If an "online" Panda site fails the most recent 5 consecutive test jobs (in the past 12 hours) of the above test template(s), the site will be excluded:
    • The Panda queue status will be changed from "online" to "brokeroff"
    • The cloud support mailing list will be notified of the automatic exclusion
  • If a "brokeroff" Panda site succeeds in running the 5 most recent test jobs (in the past 12 hours) of the above template(s), the site will be considered for re-inclusion:
    • The cloud support mailing list will be notified that the queue is again passing the tests
    • Note that the site is not automatically set online by HC. This is to ensure that sites do not start getting user jobs before the site admin has re-validated the site.
  • If a "brokeroff" Panda site succeeds in running the 5 most recent test jobs (in the past 12 hours) of the above template(s), the site will be automatically set online:
    • The cloud support mailing list will be notified that the queue was set online

A2.1. Disabling the HC Auto-Exclusion Service for a Site

It is possible to disabled the HC auto exclusion service on a site-by-site basis. A list of sites "disabled" for the auto-exclusion service is available at http://hammercloud.cern.ch/atlas/autoexclusion

Sites in the "disabled" state will not be affected by the auto-exclusion procedure above.

If you wish to disable the auto exclusion service disabled for a specific site, use the following command:

curl http://hammercloud.cern.ch/atlas/autoexclusion/disable/ANALY_XYZ
where ANALY_XYZ is the queue you wish to change.

To re-enabled the auto-exclusion service, use the command

curl http://hammercloud.cern.ch/atlas/autoexclusion/enable/ANALY_XYZ
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2011-05-10 - MarioUbedaGarcia
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback