HammerCloud Architecture

In this twiki we introduce the HammerCloud system starting from a very generic introduction which aims to provide basic information about technology and use cases currently supported. In the second section we provide a description of the test workflow and its lifecycle. The third section has the objective to provide a general overview of the architecture, summarizing and complementing this part of the available documentation. The fourth section provides a preliminary overview of the implementation of each presented building block as well as some key concept of HC implementation paradigm.

HC from 15Km far, its technology and how is used today

HammerCloud achitecture diagram:
HammerCloud_achitecture_diagram.png HammerCloud (HC) is a service to test sites and report results. It is used to perform basic site validation, help commission new sites, evaluate SW changes, compare site performances.

The HammerCloud system is composed of:

  • Apache web server, empowered with varnish (HTTP accelerator) and memcached (caching system)
  • Django framework: it was created a Django application for every experiment extending a common core application.
  • 4 MySQL databases, hosted by DBonDemand service at CERN. There is one db for every application (with same schema) and one database containing the information needed for Django admin site.
  • Ganga: it provides interface to the experiments WMS and it's used by HammerCloud to submit, monitor and retrieve the results of jobs that are part of the tests. HammerCloud uses a local instance.
  • A set of cronjobs, used to run commands that take care of the main functionality of HammerCloud:
    • test generation,
    • test submission,
    • proxy management,
    • summary creation,
    • experiment specific functionality: site blacklisting, sw release check, dataset lookups for Atlas, generation of robot summary plot for CMS
HammerCloud is implemented by using different languages:
  • Python, to develop Django applications and scripts
  • Javascript for the web interface, to produce tables and manage user input
  • Shell scripting
Currently there are tree available HC applications: Atlas, CMS, LHCb. Regarding the supported test types it is important to clarify that HammerCloud can run:

  1. Functional (automated) testing: test configured by experts/admins, which frequently submits a few "ping" jobs. Are autoscheduled, which means configure once, run forever. Used for basic site validation.
  2. Stress (on-demand) testing: test configured by any user using a template system, which submits a large number of jobs concurrently during a slot of time. Are on-demand, which means configure once, run once. Used to help commissioning new sites, evaluate changes to the site infrastructure, evaluate SW changes, compare site performances and check site evolution etc etc.

HC Test Workflow

HC implements basically the very same workflow both for functional and stress tests. From HC perspectives there are two main differences among the two type of test:

  1. Stress tests are created on demand by users and administrators, are only run once.
  2. Functional tests are configured through a functional test template and automatically generated (added to the DB) by HC (the server node) so that they're continuously running.
NOTE: In this section we refer to the stress test, but the description fully applies also to the functional test. Starting from the very beginning, including the test definition, the main steps of the HC Workflow are:
  1. User delegates the proxy to myproxy allowing HC for retrieval
  2. User defines JobTemplate and (Test)Template, a given test template is bundled to a specific jobTemplate. In a nutshell the job template represents the actual payload of the user job, the test template represents the template of the backend configuration to be used by a specific Test (e.g. crab configurantion/Panda configuration the user want to use in the test), plus some HC configuration parameters.
    • The definition of both Job and Test templates requires also some manual intervention, like editing/creating specific files in a well defined location of the HC Filesystem
  3. Once the above steps are successfully done, User creates a Test, just using the specific form in the HC GUI
  4. Once the Test is created and approved, HC (which poll the HC DB) see a test 'to be submitted'. At that point:
    • HC starts a process which take care of consecutively execute the test generation, test submission and test tracking. The tracking phase proceed till the end time of the test (as defined by the user at test creation time). The following subsection is dedicated to the test lifecycle description

HC Test Lifecycle

As introduced just above, HammerCloud test has 4 major phases that are performed consecutively:

  1. Test scheduling
    • This is not a proper phase of the test run, since just involves the creation of the scripts that will run the test, the reservation of the space for the logs and the creation of the at job that will run the test script when desired by the user in the test configuration.
  2. Test generate
    • This is the first proper test phase. Starts on the configured start time and has the objective of generate the Ganga jobs that will constitute the test itself. Every site that is set for testing will have one Ganga job file that will be processed by the GangaRobot module of Ganga. Please note that this may be experiment specific.
  3. Test submission
    • The first batch of jobs will be sent to the sites in this phase. This submission is performed by the GangaRobot and only monitors jobs until they are fully submitted. Each site will receive its custom job (mostly, all of the jobs are equal, but the input datasets are set to those that are on that site to ensure the correct brokerage).
  4. Test report
    • Usually, this phase of the test is the longest of all. In this phase, the GangaRobot is running a thread that monitors the status of the jobs and, as soon as they finish (completed or failed) stores statistics and generates the plots that allow the users to track the test on real time. Also, at this time is performed the resubmission, which in base to a set of parameters, sends new jobs to the sites (always with the same configuration per site) to keep a load on each site.
In addition to the lifecycle and workflow just presented, it's useful to have an idea of the state machine of a test. The following are the possible states a test can have.

StateFlow.png

HC Building Blocks: The Architecture

So far the HC Workflow has been fully described and clarified. In the next text we illustrate the HC architecture designed in order to enable the test workflow.

Disclaimer:

  • It is important to note that this section describes the architecture based on the building blocks that we've identified. The reader will notice slightly different view wrt the"official" documentation. The main difference is that in our view HC is a four tiers system composed by: Web service, Server node, Submission node all having a DB as a core. On the contrary the official documentation bundles the Web service to the server host. The fact that the server node can host the DB and/or the web service together with the template system it is a deployment aspect and not a technical design constraint. (To some extend all the HC services could be deployed together on a single node)
As introduced HammerCloud’s design can be viewed as a four layers infrastructure with a front-end, a back-end and middleware in between plus a Database. In term of roles we can summarize as following:

  1. The Web Service, the frontend: Provides an abstraction of the domain logic under it simplifying the underlying component by providing a user-friendly interface. It’s based on the open source web framework Django. It is a "quasi pure" Django application which expose the Admin/User GUIs.
  2. The Server Node: This node provides the infrastructure needed to run HammerCloud as a service. It take care of the load balancing. All the inputs come from the DB, and its output goes to the DB. Here is where the auto scheduling functional test module run.
  3. The DB: Is a MySQL DB bundled to its Django interface. In fact every DB access, (as well as sanity checks and security backups) are managed by the Django interface. Among others, the DB role is to allow the communications between the various tiers of HC stack. In the next section as well as in the official documentation (schema description)
  4. The Submit Node: Here is where the bulk of the work is performed. Submit node handles Grid interaction through Ganga, a job front end for accessing the Grid, where each experiment makes use of its own plugin. The submit node comprises the blocks Ganga, Ganga Robot, Ganga Plugin and Grid (and thus eventually all the experiment specific Software). The server gets work assigned to it, polling the DB and start scheduling at job for each approved test. Each at job has implement the following workflow:
    • Job Generation -> Job submission -> Job reporting -> Job monitoring
This is a schema which aim to summarize the above descriptions :
HC_Dia_Review.png

A first look at the implementation

Having identified and described the four building blocks we provide now a first description of the related implementation trying to highlight the main aspects for each block. Where we feel it may help the reader, we've also added snapshot of the actual code.

Web Service

HammerCloud web service has been implemented using the Django Framework as a set of Django applications. The admin site has been developed through the standard admin application available in Django. To extend HammerCloud there are instructions that documents how to properly create a new application extending the base design:

  • how to create a new app: here
  • application's structure and customization: here
Creating or customizing applications mainly consists in creating new views to extract data from the database and javascript code to generate the visualization through tables and plots. Inside the Django application, models are used to define the objects that translate in the database schema, in HammerCloud applications inherit these models from a common abstract definition but can also customize it. Being mostly Django, we decided to drop all the details and related paradigms.

DataBase

As already introduced, the HammerCloud uses MySQL databases hosted by DBonDemand service at CERN. Every application has its own database instance with a common schema ( with few differences regarding experiment specific feature like the fields containing the informations returned from the jobs ). The schema design is available here. In addition HammerCloud use another db instance for the database needed by the Django admin site, to manage users, authentications and sessions.

As mentioned before, all the components of HammerCloud access the database through the interface provided by Django. This means that no queries are written directly in SQL but are perfomed accessing the models objects. For example this is how the Server service retrieve the list of active functional tests from the database:

template       = custom_import('hc.%s.models.Template'%(app))
templates = template.objects.filter(category='FUNCTIONAL').filter(active=1)

The above call is the equivalent to the following SQL statement:

SELECT * FROM template WHERE category = 'FUNCTIONAL' AND active = 1;

Server Node

This service is made of a set of components all implemented as cronjobs executing shell scripts that act as a wrapper for the actual python code.

The server has three main roles:

  • Run the server_main component, a script that takes care of the load balancing and the generation of functional tests. More in detail it has the following responsibilities:
    1. To retrieve from the db all the active functional templates
    2. To check if there are running or scheduled tests for each template
    3. To create a new test from the template if there are none,
    4. To assign the test to the submission host with less load
    5. To check if there are stress test in tobescheduled state without host assigned and assign the test to the submission host with less load
  • Check user proxy validity. HammerCloud needs valid proxy to interact with distributed resources.
    • For ATLAS it uses the gangarbt robot certificate from which a 1 year long proxy is created. Check proxy generate a short life time proxy, with the proper extensions, from the 1 year long one.
    • For CMS it uses personal grid certificate credential (Andrea Sciaba'). Andrea use a script that runs every day to generate a proxy and copy it on a specific location in the HammerCloud FS. Check proxy copy this file on a specific location renaming it and giving the right permission to be used by the various components
      • Recently enabled to completely rely on myproxy server on the HC CRAB3 instance
  • Act as container for experiment specification. Other components of this service implement functionality that are experiment specific, for example:
    • E.g. for ATLAS: the blacklist-main script black list sites based on their efficienty in the HC tests
    • E.g. for CMS: the robot-main script generate the summaries used in the robot web page to show site efficiency and availability

Submit Node

As previously introduced this is the service responsible for the interaction with the distributed environment. Everything happens through Ganga, which interface the experiment WM which finally knows about the grid/cloud. The submit node as well has been implemented as a main cron job. It is called submit_main and implements the following workflow:

  1. update the host load on the database
  2. check the database for tests in tobescheduled state assigned to the host
  3. for every test of the above, it creates a work area on the HC FS, generates a run-test script and schedule its execution on the host at the test's start time with at job. The test state is set to scheduled
  4. perform health checks on the running tests of the host and try to take actions in case of problems
When executed, the run-test script :
  • sets up the proper environment (specific to the application)
  • calls the test-run script.
Test-run is the actual test executor and it perform the following sequential actions:

# Set up the core environment.
source $HCDIR/scripts/config/config-main.sh $APP $*

# Set up the submission environment.
source $HCDIR/scripts/config/config-submit.sh $GANGABIN

# Launch the test_generate action.
echo 'Launching the test_generate action...'
python $HCDIR/python/scripts/dispatcher.py -f test_generate -t $TESTID -m $HC_MODE

# Launch the test_submit action.
echo 'Launching the test_submit action...'
python $HCDIR/python/scripts/dispatcher.py -f test_submit -t $TESTID -m $HC_MODE

# Launch the test_report action.
echo 'Launching the test_report action...'
python $HCDIR/python/scripts/dispatcher.py -f test_report -t $TESTID -m $HC_MODE

  1. Test generate: the result of this step is to create a ganga job spec for every site where the test is has to submit. The job spec is generated using the test configuration information gathered from the database. This step is customized by the application, as the ganga job spec structure depends on the plugin used by the application, and might contain experiment specific functionality like, for example, performing data discovery.
  2. Test submit: the ganga jobs generated are submitted through ganga with the gangarobot application
    echo 'Launching Ganga for submission...'
    $GANGABIN --config=$HCAPP/config/$APP-ROBOTStart.INI.50 -o[Logging]_format=VERBOSE -o[Configuration]gangadir=$HCAPP/testdirs/test_$1/gangadir -o[Logging]_logfile=$HCAPP/testdirs/test_$1/ganga.log $2 robot run $HCAPP/testdirs/test_$1/jobs/*
          
  3. Test report: this step starts again the ganga engine passing the script runtest_default to be run within the ganga session. This runs till the end time of the test is reached and perform the following tasks:
    echo "Reading $HCAPP/testdirs/test_$1/gangadir"
    
    echo 'Launching Ganga for reporting...'
    umask 022
    $GANGABIN --config=$HCAPP/config/$APP-ROBOT.INI.50 -o[Logging]_format=VERBOSE -o[Configuration]gangadir=$HCAPP/testdirs/test_$1/gangadir -o[Logging]_logfile=$HCAPP/testdirs/test_$1/ganga.log $2 $HCAPP/python/scripts/runtest_$HC_MODE.py $1
          
    • job monitoring: this is performed by the ganga monitoring thread that poll the wms through the proper plugin, ganga then update the local job repository (implemented in sqlite) with the updated status. It also take care of retrieving the job results and store them in its repository
    • site load generator: (the following tasks are implemented inside the runtest_default script which define the test management logic) a copy thread is used to check the submitted and running jobs per site. It takes care of job resubmission (thrugh ganga) according with the site load configuration provided by the user.
    • job reporting: a loop in the script updates the result table of the database with the jobs information which are gathered accessing the ganga job objects. When the test end time is reached the script kill all the leftover jobs (not copleted or failed).
    • job summaries: a thread is used to generate the summaries of the results and the plot to be visualized on the web interface.

Valentina's and Daniele's notes

Here a collection of open questions/issues, personal observations and other stuff we'd like to discuss at some point. There is no order of priority/importance.

  • While integrating CRAB3 and doing some early scale tests we realized two main things:
    • The Ganga plugin for CRAB3 requires a serious optimization, otherwise ganga will consume to much memory (due to internal reasons) that may limit the usage of HC..
    • In the aim of targeting us for the optimization we looked at some HC benchmark and/or scale test. We didn't really find any complete licterature (e.g. setup to reproduce etc etc)
  • Official documentation is partially up-to-date, w/o some know-how about Ganga, extending HC to a new customer requires quite some effort (more on the optimization than in the implementation)
  • Log management (actually the tests work area are hosted and served through a web server separated from the main one)
  • The actual deployment also foresee CMS sw (crab and cmssw) locally installed on the submission nodes
  • The previous deployment had the HammerCloud code deployed on every node and synced with rsync from a central location. While the actual deployment (result of the migration to the IT hammercloud openstack project) the code is deployed on a central host and accessed by the other hosts with NFS
  • The plots on the Web service are created with WChartWrapper (Python Google Chart Wrapper)
  • SLS sensor component on the server
  • Sentry + raven for exception logging, Sentry server installed on services machine, use core database for logging.
Minutes following the discussion during Daniele's and Valentina's presentation on the overall architecture

-- SpigaDaniele - 10 Apr 2014

-- JuliaAndreeva - 17 Apr 2014

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng HC_Dia_Review.png r1 manage 94.2 K 2014-04-17 - 17:26 JuliaAndreeva Hammer Cloud Diagram
PNGpng HammerCloud_achitecture_diagram.png r1 manage 76.8 K 2014-04-17 - 17:31 JuliaAndreeva Hammer Cloud Architecture diagram
PNGpng StateFlow.png r1 manage 24.7 K 2014-04-17 - 17:33 JuliaAndreeva State flow
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2014-04-17 - JuliaAndreeva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback