Running LHCb software on the ARM architecture

LHCb is the first experiment (that we know of :-)) having (almost) its full software stack running on the ARM architecture.

Goal:

  • Benchmarking the performance of LHCb software on ARM-based servers
  • Measuring the power consumption of the Boston Viridis server while running LHCb software

Current Status

  1. The first complete HEP computing stack running on ARM!! (that we know of :))
  2. The full LHCb software stack is compiled, up and running on the "CARMA" development board AND the "Boston Viridis" server (details below)
  3. Test results are consistent with x86_64 (well within acceptable precision)
  4. Timing measurements are being collated - tests to ideally be deployed on ARM servers

Mini-goals:

  1. Build ARM cross-compiler

  2. Compile LHCb software on CARMA board
  3. Run benchmarks using Brunel (Event reconstruction) test runs on CARMA board
  4. Package the test-suite for easy deployment
  5. Repeat Brunel tests on Boston Viridis ARM servers [may involve partial rebuild]
  6. Collect numbers for running-time and power consumption

Build ARM cross-compiler

  1. Used lab11 to build the cross compiler (We can also use the same machine for cross compilation later)
  2. Built gcc-4.6.2 to cross compile for target architecture ARMv7. Target used is arm-redhat-linux-gnueabi
  3. crosstools-ng was used to generate the toolchain
  4. Toolchain is available at /group/online/arm/crutches/arm-redhat-linux-gnueabi/
    1. Salient features are -
      1. hard float ABI
      2. FPU = vfpv3-d16 (NOT NEON)
      3. CPU = cortex-a8 (NOT cortex-a9)
    2. Most relevant libraries (binutils, mpfr, gmp, mpc, elf, ppl and cloog) are the same versions as in /sw/lib/lcg
    3. glibc is "relatively" old (= 2.14.1)

Compile LHCb software on CARMA board

CARMA = CUDA on ARM [Shipped by SECO]

carmaboard.png

CARMA board = development kit for GPU programming, with the following on board

  1. CPU = NVIDIA Tegra 3, a Quad-core ARM Cortex-A9 CPU running at around 1.3 GHz
  2. GPU = NVIDIA Quadro 1000M
  3. Onboard RAM = 2 GB
  4. Dedicated GPU memory = 2 GB
We are not using the development board for GPU programming, instead limiting ourselves to using the ARM CPU.

Development environment on the CARMA board -

  1. Ubuntu 11.04
  2. Custom kernel and filesystem shipped with the board. Kernel and arch are :
    1. carma-devkit:~$ uname -a
      Linux carma-devkit 3.1.10-carma #2 SMP PREEMPT Fri Aug 31 15:28:42 PDT 2012 armv7l armv7l armv7l GNU/Linux
  3. Not a hard-float kernel - which is not ideal

Compile LHCb software on Boston Viridis server

Viridis = Low-power ARM-based "microserver" shipped by Boston Ltd.

bostonviridis.png

Boston Viridis = 2U chassis with a maximum of 48 compute nodes

  1. CPU = Calxeda EnergyCore (SoC), with one Quad-core ARM Cortex-A9 CPU (per node) running at around 1.1 GHz
  2. Onboard RAM per node = 4 GB
  3. Claimed 5W per node maximum power consumption
Development+running environment on the Boston Viridis-
  1. Fedora 18
  2. Hard-float kernel:
    1. cloud12 ~$ uname -a
      Linux cloud12 3.6.10-8.fc18.armv7hl.highbank #1 SMP Tue Jan 29 14:01:38 EST 2013 armv7l armv7l armv7l GNU/Linux

Build the LHCb software stack

Stack to build

  1. The following projects needed to be built
    1. MOORE (8 packages) : 150,000 LOC (Physical lines of code)
    2. BRUNEL (7 packages) : xxxxxx LOC
    3. HLT (26 packages) : 320,000 LOC
    4. PHYS (32 packages) : 51,000 LOC
    5. REC (62 packages) : 500,000 LOC
    6. LBCOM (33 packages) : 410,000 LOC
    7. LHCB (95 packages) : 1.2 million LOC
    8. GAUDI (21 packages) : 976,000 LOC
    9. LCGCMT (18 packages, including GCC, Boost, ROOT, COOL, CORAL, Qt, SQLite among other external packages)
    10. DBASE and PARAM (8 packages): Data packages
depgraph2.pngFigure showing a partial view of the LHCb software stack, up to the Event Reconstruction packages collectively called "Brunel"

The LHCb software stack runs up a total of approximately 3.6 millions lines of code

This does not include the external software tools that needed to be built for ARM (GCC, Qt, CLHEP etc.)

Compile toolchain

  1. We built the toolchain with GCC 4.7.2 natively on the CARMA board to -
    1. Stay current with the compiler installed on the Boston Viridis node
    2. Use updated libstdc++
  2. The side-effect of using gcc-4.7 is that Boost-1.48 doesn't play nice with it, so we have built Boost-1.51 for our use (Boost 1.48 is in use in current LHCb software, which is compiled with gcc-4.6)

Building ROOT on ARM

  1. Native compile of ROOT (v5.34.05)
    1. Need to force the ROOT build-system to build all required features (like Cintex, Reflex, RooFit etc.) for ARM
    2. The trampoline code for particular function-calls in Cintex seems to be a grey area for non-x86 architectures (Cintex is automatically disabled on the iOS build supported by ROOT)
      1. CMS posted a patch on the ROOT savannah webpage
      2. After several trial builds of ROOT with the patch and tuning other build parameters, we have now decided to not use this patch
      3. "make test-cintex" fails, but the tests under ROOT/tests run fine - TO BE SEEN

Building GAUDI

  1. Needed to patch requirements and build policy to -
    1. build with updated Boost (not anticipated in this Gaudi version)
    2. continue accepting discrepancies in source code (were warnings in earlier GCC versions, flagged as errors in gcc-4.7)
  2. Needed to patch source code to -
    1. update Boost-related calls in functions
    2. clear compiler errors (function template descriptions...)

Building LHCb and upwards in the software stack

  1. Similar tweaks (as Gaudi) necessary for successful builds of most projects
  2. Needed to pick out and disable the packages/algorithms that used NeuroBayes (Since it is only available as a binary for x86/x86_64)
  3. A common and occasionally occurring problem must be noted here -
    1. Build fails with : "Bad instruction: fwait"
    2. Compiler seems to be emitting this instruction which the assembler refuses to accept on the grounds that it is not part of the ISA (!?)
    3. Perhaps due to bad ABI configuration of the compiler?? Related to chosen FPU?
    4. *Need to check why this happens*
    5. Temporarily avoiding this problem by removing the fwait instruction by hand in mid-compile before the assembler accepts input! (Requires manually intervening in the build and "relaunching" with modified build parameters)
      1. A quick glance seems to indicate that an explicit fwait has been made redundant in recent arch versions
  4. Another very architecture specific problem is -
    1. Build fails with Boost static assert in homemade function code
    2. The static assert in the source code is based on checking size differences between a struct{empty} and a struct{2 chars} to branch away(if-else). Looks like both these structs have the same size on ARM, so it was always the wrong branch, and consequently, disastrous. To overcome this, changing the second struct{2 chars} to struct{2 ints} solved the problem as this struct has a different size from a struct{empty}
    3. This was found on multiple occasions - such functions need to be made more architecture-independent

Testing

Brunel test runs: Event reconstruction

  1. Running the Brunel test using raw data from 2012 -
    1. Data specs: 2012 proton-proton collision data at 4000 GeV beams with VeLo closed and "Down" polarity of the magnet.
    2. Use options file used for the regular x86-tests using PRConfig - modified to remove all dependencies on NeuroBayes
    3. Read data from NFS instead of CASTOR
    4. Brunel results seem to match the x86_64 test run results (with acceptable/minor differences)
    5. Now the results are automatically zipped up neatly to be inserted in a common database with results from other targets and configurations (c.f. Jenkins-based build/test suite)
    6. Time taken: Each test -
      1. Processes 1000 events
      2. Runs on one core (!)
      3. Takes ~0.5 hours on an Intel Xeon server
      4. Takes ~2.5 hours on the Boston Viridis server
      5. Takes ~10 hours on the CARMA board
        1. Reasons for 10 hours on the CARMA board?
          1. Tegra 3 + CARMA board inherently slow (HEPSPEC/Coremark?)
          2. No hard-float ABI in compiler (or in the compiled kernel) for the CARMA board
Table showing the time taken by different processors for running the same Brunel test job

timecomparison.png

timeincreasewithjobs.png

Chart showing the adverse effect of running concurrent jobs on the event reconstruction time [on the Boston Viridis node].

Table showing the minor differences between results obtained on ARMv7 and x86_64. The differences are thought to arise due to precision in 32-bit and 64-bit architectures.

brunelcomparison02.png

brunelcomparison01.png

Tasks for the immediate future

  1. Run Brunel tests for x86 (instead of x86_64) to make a more relevant comparison
  2. Package the software stack and the Brunel tests for easy/quick deployment on other ARM machines
  3. Revisit the cross-compile toolchain, build the entire stack using the current cross-compiler (or build a gcc-4.7 cross-compiler)
  4. Push patched code to LHCb repository for future builds on ARM - goal is to be able to include ARM as a target for nightlies

Takeaway

  1. LHCb software stack is fully compiled for ARM and is now running on the CARMA board
  2. LHCb software/build may be reviewed in light of portability and architecture-independence
  3. Compiler patience thresholds are lowering (at least in the case of GCC), so obsolete content in the source code should be rewritten to clear compile-time errors and warnings
  4. Brunel tests show results matching x86_64 Brunel runs
  5. Processing time is longer than desired - it needs to be seen if this can be lowered/tweaked by -
    1. GCC m-flags
    2. Kernel config
    3. Boston Viridis servers [UPDATE: Processing time for ARM is cut down by a factor of around 3.5-4 with a Boston Viridis node as compared to the CARMA board]
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2013-06-20 - VijayKartik
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback