TWiki> LCG Web>HammerCloudReview>Architecture (revision 24)EditAttachPDF

HammerCloud Architecture

In this twiki we introduce the HC starting from a very generic introduction which aims to provide basic information about technology and use cases currently supported. In the second section we provide a description of the test workflow and its lifecycle. The third section has the objective to provide a general overview of the architecture, summarizing and complementing this part of the available documentation. The fourth section provides a preliminary overview of the implementation of each presented building block as well as some key concept of HC implementation paradigm.

HC from 15Km far, its technology and how is used today

HammerCloud_achitecture_diagram.png:
HammerCloud_achitecture_diagram.png HammerCloud (HC) is a service to test sites and report results. It is used to perform basic site validation, help commission new sites, evaluate SW changes, compare site performances.

The HammerCloud system is composed of:

  • Apache web server, empowered with varnish (HTTP accelerator) and memcached (caching system)
  • Django framework: it was created a Django application for every experiment extending a common core application.
  • 4 MySQL databases, hosted by DBonDemand service at CERN. There is one db for every application (with same schema) and one database containing the information needed for Django admin site.
  • Ganga: it provides interface to the experiments WMS and it's used by HammerCloud to submit, monitor and retrieve the results of jobs that are part of the tests. HammerCloud uses a local instance.
  • A set of cronjobs, used to run commands that take care of the main functionality of HammerCloud:
    • test generation,
    • test submission,
    • proxy management,
    • summary creation,
    • experiment specific functionality: site blacklisting, sw release check, dataset lookups for Atlas, generation of robot summary plot for CMS

HammerCloud is implemented by using different languages:

  • Python, to develop Django applications and scripts
  • Javascript for the web interface, to produce tables and manage user input
  • Shell scripting

Currently there are tree available HC applications: Atlas, CMS, LHCb. Regarding the supported test types it is important to clarify that HammerCloud can run:

  1. Functional (automated) testing: test configured by experts/admins, which frequently submits a few "ping" jobs. Are autoscheduled, which means configure once, run forever. Used for basic site validation.
  2. Stress (on-demand) testing: test configured by any user using a template system, which submits a large number of jobs concurrently during a slot of time. Are on-demand, which means configure once, run once. Used to help comissioning new sites, evaluate changes to the site infrastructure, evaluate SW changes, compare site performances and check site evolution etc etc.

HC Test Workflow

HC implements basically the very same workflow both for functional and stress tests. From HC perspectives there are two main differences among the two type of test:

  1. Stress tests are created on demand by users and administrators, are only run once.
  2. Functional tests are configured through a functional test template and automatically generated (added to the DB) by HC (the server node) so that they're continuosly running.

NOTE: In this section we refer to the stress test, but the description fully applies also to the functional test. Starting from the very beginning, including the test definition, the main steps of the HC Workflow are:

  1. User delegates the proxy to myproxy allowing HC for retrieval
  2. User define JobTemplate and (Test)Template, a given test template is bundled to a specific jobTemplate. In a nutshell the job template represents the actual payload of the user job, the test template represents the template of the backend configuration to be used by a specific Test (e.g. crab configurantion/Panda configuration the user want to use in the test)
    • In order to create JobTemplates and Templates, specific files must be edited/copied in specific locations of the FS, in addition to fulfill specific forms in the HC GUI
  3. Once the above steps are successfully done, User creates a Test, just using the specific form in the HC GUI
  4. Once the Test is created and approved, HC (which poll the HC DB) see a test ready to be submitted. At that point:
    • HC starts a process which take care of consecutively execute the test generation, test submission and test tracking. The tracking proceed till the end time of the test. The following subsection expand the test lifecycle.

HC Test Lifecycle

As introduced just above, HammerCloud test has 4 major phases that are performed consecutively:

  1. Test scheduling
    • This is not a proper phase of the test run, since just involves the creation of the scripts that will run the test, the reservation of the space for the logs and the creation of the at job that will run the test script when desired by the user in the test configuration.
  2. Test generate
    • This is the first proper test phase. Starts on the configured start time and has the objective of generate the Ganga jobs that will constitute the test itself. Every site that is set for testing will have one Ganga job file that will be processed by the GangaRobot module of Ganga. Please note that this may be experiment specific.
  3. Test submission
    • The first batch of jobs will be sent to the sites in this phase. This submission is performed by the GangaRobot and only monitors jobs until they are fully submitted. Each site will receive its custom job (mostly, all of the jobs are equal, but the input datasets are set to those that are on that site to ensure the correct brokerage).
  4. Test report
    • Usually, this phase of the test is the longest of all. In this phase, the GangaRobot is running a thread that monitors the status of the jobs and, as soon as they finish (completed or failed) stores statistics and generates the plots that allow the users to track the test on real time. Also, at this time is performed the resubmission, which in base to a set of parameters, sends new jobs to the sites (always with the same configuration per site) to keep a load on each site.

In addition to the lifecycle and workflow just presented, it's useful to have an idea of the state machine of a test. The following are the possible states a test can have.

StateFlow.png

HC Building Blocks: The Architecture

So far the HC Workflow has been fully described and clarified. In the next text we illustrate the HC architecture designed in order to enable the test workflow.

Disclaimer:

  • It is important to note that we present the architecture based on the building blocks that we've identified. It will appear a sligthely different view wrt the"offical" documentation. The main difference is that in our view HC is a four tiers system: Web service, Server node, Submission node all having a DB as a core. On the contrary the official documentation tend to bundle the Web service to the server host which from our perspectives it is more a deployment aspect. The fact that the server node usually can host the DB and/or the web service together with the template system it is a deployment aspect and not a technical design constraint.

As introduced HammerCloudís design can be viewed as a four layers infrastructure with a front-end, a back-end and middleware in between plus a Database. In term of roles we can summarize as following:

  1. The Web Service, the frontend: Provides an abstraction of the domain logic under it simplifying the underlying component by providing a user-friendly interface. Itís based on the open source web framework Django. It is a "quasi pure" Django application which expose the Admin/User GUIs.
  2. The Server Node: Here is where the auto scheduling functional test module run. This node also provide the infrastructure needed to run HammerCloud as a service. It take care of the load balancing. All the inputs come from the DB, and its output goes to the DB.
  3. The DB: Is a MySQL DB, always accessed through the Django interface. No direct access foreseen. All the DB access, (as well as sanity checks and security backups) are managed by the Django interface. Among others, the DB role is to allow the communications between the various tiers of HC stack. In the next section as well as in the official documentation (schema description)
  4. The Submit Node: Here is where the bulk of the work is performed. Submit node handles Grid interaction through Ganga, a job front end for accessing the Grid, where each experiment makes use of its own plugin. The submit node comprises the blocks Ganga, Ganga Robot, Ganga Plugin and Grid. (and thus all the experiment specific layers). The server get work assigned to it, polling the DB and start scheduling at job for each approved test. Each at job has implement the following workflow:
    • Job Generation -> Job submission -> Job reporting -> Job monitoring

This is a schema which aim to summarize the above descriptions :
HC_Dia_Review.png

A first look at the implementation

Having identified and described the four building blocks we provide now a first description of the related implementation trying to highlight the main aspects for each block. Where we feel it may help the reader, we've also added snapshot of the actual code.

Web Service

HammerCloud web service has been implemented using the Django Framework as a set of Django applications. The admin site has been developed through the standard admin application available in Django. To extend HammerCloud there are instructions that documents how to properly create a new application extending the base design:

  • how to create a new app: here
  • application's structure and customization: here

Creating or customizing applications mainly consists in creating new views to extract data from the database and javascript code to generate the visualization through tables and plots. Inside the Django application, models are used to define the objects that translate in the database schema, in HammerCloud applications inherit these models from a common abstract definition but can also customize it.

DataBase

As already said, the HammerCloud use MySql databases hosted by DBonDemand service at CERN. Every application has its own database instance with a common schema ( with few differences regarding experiment specific feature like the fields containing the informations returned from the jobs ). The schema design is available here. In addition HammerCloud use another db instance for the database needed by the Django admin site, to manage users, authentications and sessions.

As mentioned before, all the components of HammerCloud access the database through the interface provided by Django. This means that no queries are written directly in SQL but are perfomed accessing the models objects. For example this is how the Server service retrieve the list of active functional tests from the database:

template       = custom_import('hc.%s.models.Template'%(app))
templates = template.objects.filter(category='FUNCTIONAL').filter(active=1)

The above call is the equivalent to the following SQL:

SELECT * FROM template WHERE category = 'FUNCTIONAL' AND active = 1;

Server Node

The components of this service are implemented as cronjobs executing shell scripts that act as a wrapper for python code.

The server has three main roles:

1. Run the server_main scrip, this script is the component that takes care of the load balancing and the generation of functional tests:

  • retrieves from the db all the active functional templates
  • checks if there are running or scheduled tests for each template
  • create a new test from the template if there are none,
  • assign the test to the submission host with less load
  • check if there are stress test in tobescheduled state without host assigned and assign the test to the submission host with less load
1. Check user proxy validity. HammerCloud needs valid proxy with the credential to submit to the various resources.
  • For ATLAS it uses the gangarbt robot certificate from which a 1 year long proxy is created. Check proxy generate a short life time proxy, with the proper extensions, from the 1 year long one.
  • For CMS HammerCloud use personal grid certificate credential (Andrea Sciaba'). Andrea use a script that runs every day to generate a proxy and copy it on a specific location in the HammerCloud fs. Check proxy copy this file on a specific location renaming it and giving the right permission to be used by the various components
  • Recently enabled to completely rely on myproxy server on the HC CRAB3 instance
1. Act as container for experiment specification. Other components of this service implement functionality that are experiment specific, for example:
  • for ATLAS: the blacklist-main script black list sites based on their efficienty in the HC tests
  • for CMS: the robot-main script generate the summaries used in the robot web page to show site efficiency and availability

Submit Node

As introduced is where appear the interaction with the distributed environment. Everything happens through Ganga, which interface the experiment WM which finally knows about the grid/cloud. It is implemented as a main cron job called submit_main. Its workflow can be summarized as :
  1. update the host load on the database
  2. check the database for tests in tobescheduled state assigned to the host
  3. for every test of the above, it creates a work area, generate a run-test script and schedule its execution on the host at the test's start time with at. The test state is set to scheduled
  4. perform health checks on the running tests of the host and try to take actions in case of problems

When executed, the run-test script will setup the proper environment (specific to the application) and call the test-run script. Test-run is the actual test executor and it perform the following sequential actions:

  1. Test generate: the result of this step is to create a ganga job spec for every site where the test is has to submit. The job spec is generated using the test configuration information gathered from the database. This step is customized by the application, as the ganga job spec structure depends on the plugin used by the application, and might contain experiment specific functionality like, for example, performing data discovery.
  2. Test submit: the ganga jobs generated are submitted through ganga with the gangarobot application
  3. Test report: this step starts again the ganga engine passing the script runtest_default to be run within the ganga session. This runs till the end time of the test is reached and perform the following tasks:
    • job monitoring: this is performed by the ganga monitoring thread that poll the wms through the proper plugin, ganga then update the local job repository (implemented in sqlite) with the updated status. It also take care of retrieving the job results and store them in its repository
    • site load generator: (the following tasks are implemented inside the runtest_default script which define the test management logic) a copy thread is used to check the submitted and running jobs per site. It takes care of job resubmission (thrugh ganga) according with the site load configuration provided by the user.
    • job reporting: a loop in the script updates the result table of the database with the jobs information which are gathered accessing the ganga job objects. When the test end time is reached the script kill all the leftover jobs (not copleted or failed).
    • job summaries: a thread is used to generate the summaries of the results and the plot to be visualized on the web interface.

Valentina's and Daniele's note:

Here a collection of open questions/issues, personal observations and other stuff we'd like to discuss at some point:

  • While integrating CRAB3 and doing some early scale tests we realized two main things:
    • The Ganga plugin for CRAB3 requires a serious optimization, otherwise ganga will consume to much memory (due to internal reasons) that may limit the usage of HC..
    • In the aim of targeting us for the optimization we looked at some HC benchmark and/or scale test. We didn't really find any complete licterature (e.g. setup to reproduce etc etc)
  • Official documentation is partially up-to-date, w/o some know-how about Ganga, extending HC to a new customer requires quite some effort (more on the optimization than in the implementation)
  • Log management (actually the tests work area are hosted and served through a web server separated from the main one)
  • The actual deployment also foresee CMS sw (crab and cmssw) locally installed on the submission nodes
  • The previous deployment had the HammerCloud code deployed on every node and synced with rsync from a central location. While the actual deployment (result of the migration to the IT hammercloud openstack project) the code is deployed on a central host and accessed by the other hosts with NFS
  • The plots on the Web service are created with WChartWrapper (Python Google Chart Wrapper)
  • SLS sensor component on the server
  • Sentry + raven for exception logging, Sentry server installed on services machine, use core database for logging.

-- SpigaDaniele - 10 Apr 2014

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng HC_Dia_Review.png r1 manage 93.6 K 2014-04-15 - 15:32 SpigaDaniele HC schema
PNGpng HammerCloud_achitecture_diagram.png r1 manage 76.8 K 2014-04-15 - 15:38 ValentinaMancinelli  
PNGpng StateFlow.png r1 manage 24.7 K 2014-04-16 - 14:19 ValentinaMancinelli  
Edit | Attach | Watch | Print version | History: r28 | r26 < r25 < r24 < r23 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r24 - 2014-04-16 - SpigaDaniele
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback