Technical Description of the LHCb Nightly Build System
This is a brief summary of the implementation technical details of the LHCb Nightly Build System, useful to understand which are the various components of the system and how they interact with each other.
Elements
The LHCb Nightly Build System is constituted of a few main components:
These components interact so that for each slot, project and platform, we checkout, build, test, and report the results of builds and tests.
The Scheduling
The scheduling of the operations (checkout, build and test) is achieved through the continuous integration (CI) system Jenkins, which can be seen as a
cron on steroids.
Jenkins Basics
Jenkins is an extensible continuous integration system written in Java. Its main goal is to allow developers to regularly build and validate their projects. The great flexibility of Jenkins makes it possible to use the system in very different scenarios, ranging from the build and test of a project, to very complex workflows, like in the LHCb case.
The key component of Jenkins is the
Job: the prototype for a unit of work run by the system. When Jenkins executes a Job, the operations defined in its configuration are run. For example, in a simple case, a Job can be configured to check out the sources of a project from a repository, build the project and run the tests. In more complex scenarios, Jobs can be executed in different environments, or trigger other Jobs according to some conditions.
To effectively help in the testing of projects, Jenkins allows execution of Jobs on other machines (
slaves) identified by name of labels. For example, a build-and-test Job could be run on a Linux, a Windows and a MacOS machine to ensure that the code always works.
Jobs in Jenkins can be triggered by different conditions, for example by a change in the source code repository, or a change in the content of a URL, or regularly (like with the UNIX
cron
daemon).
Scheduling of LHCb Jobs in Jenkins
The LHCb Nightly Build System uses a few interdependent Jobs in a Jenkins instance, divided in two groups: build and test.
The builds are handled by four generic Jobs, that, depending on some parameters, operate of one slot or another:
Execution flow of Jenkins Jobs used for the builds.
- nightly-slot
- run once a night to start the checkout and build of every enabled slot
- nightly-slot-checkout
- triggered once per slot by
nightly-slot
, checks out the projects defined in the slot configuration; if there are preconditions to check (e.g. wait for the LCG nightly builds), the Job nightly-slot-precondition
is triggered (once per requested platform), otherwise it triggers directly nightly-slot-build-platform
(once per requested platform)
- nightly-slot-precondition
- triggered only if needed by
nightly-slot-checkout
, it checks the preconditions defined in the slot configuration (e.g. waiting for the LCG builds); when/if the precondition is met, it triggers nightly-slot-build-platform
- nightly-slot-build-platform
- when started, declares to the tests jobs which are the expected builds (slot, project, platform), then starts the build of all the projects in the specified slot for the specified platform
Execution flow of Jenkins Jobs used for the tests.
The tests are run asynchronously with respect to the build, so that they can be started as soon as the build of the project to be tested is ready. This is achieved using two dedicated Jobs:
- nightly-slot-test-poll
- run every 10 minutes, starting from 4:00 AM, checks which of the builds declared by
nightly-slot-build-platform
are ready to be tested, and triggers one nightly-slot-test
for each slot, project platform (with a maximum of 20 triggered jobs per execution, to avoid overloading Jenkins).
- nightly-slot-test
- triggered by
nightly-slot-test-poll
, takes the project binaries (produced by nightly-slot-build-platform
) to be tested, and runs the tests
All jobs are configured to retry the execution in case of failure and send a mail to
lhcb-core-soft-alarms@cernNOSPAMPLEASE.ch (the mail is sent also when a Job succeeds after a failed execution).
See
LHCbNightliesTroubleshooting for more information on the failure conditions as well as on manual operation of the builds.
Distributing the Load on Different Machines
As already mentioned, Jenkins has a mechanism to execute Jobs on different machines, either to access more computing power or to run on different configurations and architectures.
Mapping between virtual machines and Jenkins slaves in LHCb Nightly Build System.
This functionality relies on the concepts of
slaves and
executors, where, in the simplest configuration,
slaves maps to machines and
executors to CPUs, but nothing prevents to have several slaves on one machine and map executors (logically) on multiple CPUs. Slaves can be grouped logically with labels, which can be used to declare where a Job should be executed.
The LHCb Nightly Build System uses a pool of 12 x 8 cores SLC6 virtual machines and 3 x 8 cores SLC5 ones (with more platforms to come), plus another machine to run Jenkins and CouchDB.
The machines are organized in
- 12 slaves * 1 executor, labeled slc6-build
- 12 slaves * 6 executors, labeled slc6
- 3 slaves * 1 executor, labeled slc5-build
- 3 slaves * 6 executors, labeled slc5
Build Jobs are sent to slaves labeled
-build, while checkout and tests are sent to
labels.
Since the release procedure uses the Nightly Builds System, to ensure that release builds can be done even when the Nightly Build System is very busy, we added an extra label
-release which is attached to all the
-build slave plus one, so that, for example, release builds for SLC6 have access to 12 (shared) + 1 (exclusive) executors.
The Artifacts' Archive
The archive of artifacts is essentially a storage space that can serve the contained files via HTTP. We store there all the products of the nightly builds, from source and binary tarballs to test results summaries.
This archive has a crucial role in the system because it is used for deployment of the builds to AFS and CVMFS as well as to exchange files between the different Jenkins Jobs.
The implementation is very simple: we use ssh to write to an HTTP served directory on lhcb-archive.cern.ch. There is one directory per nightly build flavour, containing one directory per slot. Each slot directory hosts the build data in a directory named after the numeric unique id of the slot build, with moving soft links to these directories named with the corresponding date (YYYY-MM-DD), the day name (three letters abbreviation), and
Today.
The Nightly Build Steps
The Scripts
The Jenkins Jobs use a collection of scripts to perform their tasks. These scripts are in a GIT repository hosted at gitlab.cern.ch:
https://gitlab.cern.ch/lhcb-core/LbNightlyTools
.
The package is organized in a few subdirectories, including
- python
- Python modules containing the actual code of the scripts
- scripts
- mainly small scripts delegating the actions to Python modules
- jenkins
- wrapper scripts used to call the main scripts from Jenkins Jobs
Each Jenkins Job with a role in the LHCb Nightly Build System is configured to get the latest version of LbNightlyTools from the main repository before executing its main actions. Then, depending on the operation it must perform, it calls the corresponding script in the
jenkins
directory (
checkout.sh
for check out,
build.sh
to build, etc.).
To simplify the maintenance, reduce code duplication and improve readability, the
actions scripts in
jenkins
rely on (bash) functions defined in some scripts in
jenkins/utils.d
.
An important feature of the scripts used in the Nightly Build System is that they can be run by hand to build and test a private copy of the LHCb software stack or of part of it.
Start
Every night, just after midnight, a Jenkins Job is run to start all the enabled slots.
First we download (from
Git
) the files describing the configuration of the slots, then we identify the slots that are enabled and enable a checkout job for all the enabled slots.
The checkout jobs are
Checkout
For each slot, we run a checkout job, which retrieves (again) the slots configuration files, then checks out from repositories the source code of the projects to be built in the current slot.
After the checkout, the sources are patched to form a consistent set (fix dependencies), with the applied changes stored in a patch file for reference.
Sources are finally packed in compressed tarballs and sent to the artifacts' archive together with a copy of the relevant configuration file and some checksum files.
If the slot requires some preconditions to be checked, then we trigger jobs to check/wait for the preconditions (one per platform to be built), otherwise we directly trigger the build jobs (one per platform).
Preconditions Check (optional)
At the moment, this step is used only to wait for the completion of SFT builds, but it's a very flexible mechanism that could be used to throttle builds, or to disable slots/platforms depending on some special conditions.
This job gets from the artifacts' archive the configuration files of the current slot and runs the
precondition function, which will define when (or if) to trigger the build job.
Build
When the build job is started, it gets configuration and source tarballs for the current slot from the artifacts' archive. Then we record in a special directory the list of expected build artifacts, needed to know which test jobs should be started.
Each project is built in dependency order, and, project by project, the binary files are packed and sent to the artifacts' archive together with build log summaries.
Test Polling
Every 10 minutes, starting from 4:00 AM, we check if any of the projects declared by the build step is ready to be tested. Then we trigger a test job for each slot/project/platform ready for testing.
To avoid congestion in Jenkins, we trigger at most 20 test jobs every 10 minutes.
The reason why we wait until 4:00 AM before starting the tests is that in this way we essentially dedicate the full CPU power of the build farm to the build for the first few hours. By then, a good fraction of the builds will be done and we can use (almost) all the CPU power for tests. This showed to have a better throughput of the builds than other scheduling policy we tested.
Test
The test job gets the build artifacts for the project + platform to test (using the same tool used to install the nightly builds on AFS and CVMFS) and runs the test.
The test results summaries are sent to the artifacts' archive, from where they can be checked by the developers.
The Dashboard
The results of the Nightly Builds are reported on a web page,
the dashboard
. The information is organized in one table per slot with one row per project built and two columns per platform, for build and tests.
Jenkins jobs store details about the build and test results to a CouchDB database in form of JSON documents. Map-reduce algorithms in the CouchDB instance extract an format the information in these documents so that they can be used to fill the tables of the dashboard.
The HTML, JavaScript and CouchDB views are available in the
couchdb
directory in
https://gitlab.cern.ch/lhcb-core/LbNightlyTools
, in the form of
CouchApps
managed with the tool
erica
.
More Details
Configuration Details of Jenkins Jobs
--
MarcoClemencic - 2015-07-01
--
MarcoCattaneo - 2016-08-26 - updated obsolete URLs, fixed typos