LHCb Nightly Build System Troubleshooting and Operation

It's responsibility of the Deployment Shifters to check that the nightly builds are functional.

Introduction

First of all, be sure that you read LHCbNightliesImplementation to understand how the various bits and pieces stick together.

It should not happen, but, as a matter of fact, Jenkins Jobs might fail for a number of reasons. Shifters will receive a mail for every (nightly build related) failed Jenkins job, as well as for the first successful job after a failure.

In the most common cases the failure is due to a glitch of the infrastructure (communication between Jenkins and its slaves, connection with git/svn servers) and the automatic retry we use for most jobs will be enough to recover.

How to Read Jenkins Mails

There are two types of mails: failure and back to normal.

Failed Builds

In case of failure, the shifter will get a mail with a subject like:

Build failed in Jenkins: <job name> <build name or id>

where the <job name> is the string referring to the jobs described in LHCbNightliesImplementation and the <build name or id> could be a numeric id (if the job failed very early) or a human readable name for the job build, like <flavour>.<slot>.<id>.

The body of the mail starts with a link to the failed build in the Jenkins web interface, followed by an excerpt of the console output of the build.

Build Back to Normal

For the first successful build after a failure, shifter will receive a mail with a subject like:

Jenkins build is back to normal : <job name> <build name or id>

The body of the mail consists only of a link to the successful build in the Jenkins web interface.

It is important to follow the link and check that the preceding build ("Previous Build" link) to check why it failed, because sometimes a build fails so early that the failure mail is not sent.

Jenkins Build Farm

Jenkins uses a pool of machines called slaves to actually perform the tasks, plus one special machine called master.

In our configuration, the master is used for jobs that do not take CPU (e.g. waiting for LCG nightly builds, polling, etc.), and the slaves for builds and testing. In particular, we do not have a direct mapping machine-slave, but we have some special partitioning (see LHCbNightliesImplementation).

master

The _master_ node is a crucial node which is always up. If the master node goes offline the whole system is blocked.

slaves

-- MarcoClemencic - 2015-07-06

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2015-07-07 - MarcoClemencic
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback