Certification process for LHCbDIRAC releases

[process moved to trello board and to slack channel]

Please see https://trello.com/lhcbdiraccertificationteam and lhcbdirac.slack.com #lhcb-certification

Certifying a release is a process. There are a number of steps to make to reach the point in which we can finally say that a release is at production level. Within LHCbDirac, we are trying to streamline and automatize this process as much as possible. Even with that, some tests still require manual intervention.

We can split the process in a series of incremental tests.

Within the following sections we describe, step by step, all the actions needed.

For the general ideas behind certification of a (VO)DIRAC release, please look for http://dirac.readthedocs.io/en/integration/DeveloperGuide/CodeTesting/index.html

The philosophy behind the whole process is that you should start by running tests in isolation, and then move to "larger isolation" until you test the integration of systems, and you run tests that are "smoke tests", and system tests, which is testing that the whole system works well together.

Repositories of tests

There are 2 repositories, one for DIRAC and one for LHCbDIRAC (which extends the first), that contains various tests, or code used for the certification process:

Jenkins

We run what can be automated via Jenkins.

Pylint

Pylint is a useful tool for every developer. The suggestion is that each developer should ALWAYS run pylint on her code BEFORE committing. We also run pylint within Jenkins, and this produces a nice graphical results, easy to check.

We give pylint a custom configuration file. This changes few things that are specific to DIRAC, like the fact that the default space size is 2 instead of 4.

Needless to say, the number of pylint warnings should get lower from an older release to a newer one. Any "+" sign should be carefully investigated.

Unit tests

Unit tests are written by the developers, and should be, as pylint, also run by the developers before committing any code. They are also run by the same Jenkins jobs that run pylint. Coverage information is at the moment not shown due a technical issue to be investigated.

Unit tests results should be checked by the developers and by the certification managers, and notify the developers of failing ones.

The Jenkins jobs do a good job in visualizing if and which tests are failing.

Integration tests for clients, services, and DBs

Within the repositories, there are a number of integration tests that all share the same structure: they are tests of the interaction between clients, services, and DBs. There is, in fact, a pattern used within DIRAC in several places: a (mySQL) DB which exposes certain functionalities through a myDB.py class, and a service (my_handler.py) which exposes the functionalities inside the DB via XMLRPC, and a client which instantiates a RPCClient object that points to that service. There is more than one example of integration tests that test this interaction. Of course, in order to run these tests, the DB has to be present, the service up and running, and the proxy has to be valid. And of course, they have to be running within a (LHCb)DIRAC installation.

Integration tests of these types are written for some of the existing systems, but not for all. Developers write these system tests, and should use them while developing a service, a DB, or a client. It's usually supposed that the DB should be empty when the test starts, and should be emptied as last operation within the test. This procedure is documented within the DIRAC developer guide, for example within http://dirac.readthedocs.io/en/integration/DeveloperGuide/AddingNewComponents/index.html

We have setup a jenkins test that do the following:

  • install (LHCb)DIRAC
  • install all the mySQL DBs that are found in the system using dirac commands, on a mySQL DB instance used for this purpose only.
  • install all the dirac services that are found in the system using dirac commands
  • run all the integration tests of these type

We have Jenkins tests both for DIRAC, and for LHCbDIRAC. These tests should be started by hand. Due to the fact that they all read/write on the same DB, one should pay attention not having 2 of these tests running at the same time.

These Jenkins jobs don't really do a good job in visualizing if and which tests are failing. The only way to know is in fact to inspect the Jenkins job output, and look for possible failures.

Integration tests for jobs

LHCb Jobs are run using the workflow package. This makes it possible to run a job locally. The point is that, if you have the correct configuration in your cfg file(s), you can emulate running a job on a worker node. For example, if your configuration is the following:

DIRAC
{
  Setup = LHCb-Certification
  Configuration
  {
    Servers = dips://lbvobox18.cern.ch:9135/Configuration/Server
  }
}
LocalInstallation
{
  Setup = LHCb-Certification
  ConfigurationServer = dips://lbvobox18.cern.ch:9135/Configuration/Server
  SkipCADownload = True
  SkipVOMSDownload = True
  SkipCAChecks = True
  UseServerCertificate = True
  SiteName = DIRAC.Jenkins.ch
  CEName = jenkins.cern.ch
  LocalSE = CERN-SWTEST
}
LocalSite
{
  ReleaseProject = LHCb
  ReleaseVersion = v8r2-pre2
  GridMiddleware = DIRAC
  Site = DIRAC.Jenkins.ch
  GridCE = jenkins.cern.ch
  CEQueue = jenkins-queue_not_important
  Architecture = x86_64-slc6
  CPUScalingFactor = 9.4
  CPUNormalizationFactor = 9.4
  LocalSE = CERN-SWTEST
}

you can then run https://github.com/DIRACGrid/DIRAC/blob/integration/tests/Workflow/Integration/Test_UserJobs.py which is just a "hello world" type of job. Obviously, you'll need a proxy.

More complex tests are in LHCbTestDirac.Integration.Workflow.[Test_ProductionJobs.py,Test_UserJobs.py] (some of these are way more complex, some do dataReconstruction, stripping, mc...).

We have setup a jenkins test that do the following:

  • run the dirac-pilot script, but omits to run the "LaunchAgent" command, so that we re-create the same environment that one can find on a worker node
  • run the workflow integration tests

The DIRAC one is https://buildlhcb.cern.ch/jenkins/view/DIRAC/job/dirac_pilot_certification_vXrY/, the LHCbDIRAC one (which takes much longer to complete) is at https://buildlhcb.cern.ch/jenkins/view/LHCbDIRAC/job/lhcbdirac_pilot_certification/

These Jenkins tests should be started by hand.

These Jenkins jobs don't really do a good job in visualizing if and which tests are failing. The only way to know is in fact to inspect the Jenkins job output, and look for possible failures.

Regression tests for jobs

Regression tests for jobs are very similar to the integration test for jobs. The only difference is that the job description is not created, but what's used is an old description, which is in the form of a XML file. These tests are also run within the same Jenkins jobs that run the integration tests for jobs.

System tests

System tests are a different thing: system tests means going large, so start interacting with the Grid (something that, up to now, we didn't do). So, with system tests, we run jobs on the Grid, and we write and read files on SEs, and we replicate, and delete, and we write on the certification Bookkeeping, and so on.

So, in order to proceed from here, we need a running server that has installed the latest pre-release. And it needs a Grid behind, and this Grid, in fact, is the same Grid of the production setup. All the rest is specific to the LHCb-Certification setup, including the Bookkeeping, and the DFC.

Jenkins tests

There's only one jenkins test that we can consider as a system test, and this is https://buildlhcb.cern.ch/jenkins/view/LHCbDIRAC/job/lhcbdirac_submitAndMatch/ This test

  • submits a real job to the LHCb-Certification setup, targeting the "DIRAC.Jenkins.ch" site
  • start a pilot, fully, so including the JobAgent
  • try to match, and run, the job previously submitted.

Client tests

From now on, tests should be run after having run sure that we are pointing to the LHCb-Certification setup, so edit the file ~/.dirac.cfg as follows:

DIRAC
{
  Setup = LHCb-Certification
}

The shell script "LHCbDIRAC.tests.System.Client.client_test.csh" calls a number of dirac scripts to verify that they run correctly.

User jobs

The following script....

Replicating and deleting files through a transformation

[ADD]

Running a MCSimulation transformation

This is a test that involves all the production system. There's nothing written down: the only thing to be done are "hand operations":

  • create steps, using the web portal (usually, take from production)
  • create a production request
  • launch it
  • verify that it correctly goes through testing phase, and so on

Usually, Vladimir runs this test.


* Running other types of data processing transformations

This is not usually necessary, so it is discrectionary.

Specific tests

The Jira board https://its.cern.ch/jira/secure/RapidBoard.jspa?rapidView=604 has, on its left, all the LHCbDIRAC tasks that are marked "completed". A "Resolved" task is a task that needs to be tested within certification. On the right, those in "Closed". You can drag and drop tasks from left to right to move them from one state to the other, once they have been verified within the certification setup.

Schematically: what operations to do:

  • when a new pre-release of DIRAC is out
  • a new LHCbDIRAC pre-release is out:
    • From https://buildlhcb.cern.ch/jenkins/view/LHCbDIRAC/ start "lhcbdirac-pre-release", "lhcbdirac_certification", "lhcbdirac_pilot_certification", "lhcbdirac_submitAndMatch" and check the results
    • Run the client tests
    • Run the log checker script
    • Run the system tests of submitting user jobs to the Grid
    • Extend an existing MC transformation, and do basic checks
    • Create a new production request for MC, launch it, wait until results appear
    • Do all the system tests that are specific for the release (listed in https://its.cern.ch/jira/secure/RapidBoard.jspa?rapidView=604) and update the tasks accordingly

What to do if the tests reveal some issues.

The bugs should be tracked down. Use the slack channel.

-- FedericoStagni - 2015-07-06

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2017-08-01 - FedericoStagni
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback