Data Processing Project in CMS: a historical perspective

Introduction

At the end of year 2004 the "prototyping phase" for a Data Processing system for the CMS experiment at LHC was declared over and a new Computing Project was initiated with the aim of delivery the system for the operation phase of the experiment at that time foreseen for summer 2007. Years 2004 and 2005 also corresponded to a major influx of new human resources made available from projects that were closing down at FermiLab, Desy, SLAC. This was therefore also the opportunity for a restructuring of the project itself and its management structure. The project is described in broad terms in the CMS Computing Technical Design Report (june 2005) as well as a part of the LHC Computing Grid Technical Design Report (june 2005) and in much more details in CMS Physics Technical Design Report Volume I : Detector Performance and Software (February 2006).

On the technical side the new project encompassed

  • A restructuring of the physical organization of the code
  • A re-design and re-implementation of the processing frameworks and the I/O and storage layers
  • the porting of all scientific algorithms to the new infrastructure
  • a new software development process including a new set of tools to support it
  • a new infrastructure for software validation and verification
  • a new framework for Data Quality Monitoring
All these components further evolved in the following years and are still in active development today (year 2011).

The new system was ready to support the detector commissioning in spring 2006, the scientific work in preparation for the Physics Performance TDR and to be fully tested in the Computing, Software, Analysis Challenge CSA06 (a full end-to-end test). These activities were followed by more commissioning runs, new CSA challenges in 2007 and 2008 to be finally ready for beam august 2008, At the same time a series of "service challenges" (grid end-to-end tests) were organized that culminated in the Combined Computing Readiness Challenge (CCRC'08) in spring 2008. The rest is history, included an additional commissioning year in 2009. Among the many further developments and activities after the start of Physics at LHC in is worth noting how, due to the impressive performance of the LHC machine that in summer 2011 delivered much higher intensity (luminosity) than initially foreseen, a dedicated effort was produced to improve the computational performance of the prompt reconstruction to gain a further factor two both in speed and memory usage.

The evolution of the project can be tracked in the reports to the various annual reviews (access restricted) and in conference reports.

Data Processing at LHC

A general account of the status in 2010 for all four experiments can be found in The Experiment Offline Systems after One Year, CHEP 2010.

The description of the CMS Data processing system in the article CMS Data Processing Workflows during an Extended Cosmic Ray Run still applies.

To have an idea of the data volume and resource usage in 2011 one can look at Data Processing Status in CMS as October 17

  • 0.5 PB of RAW data
  • A total of 2 PB of data as output of prompt processing (and distributed to at least one more site)
  • Up to 5000 jobs (one per core) running simultaneously at CERN for prompt-reconstruction
  • 1 PB of data re-processed worldwide
  • 3 PB of simulated data produced and distributed
  • An average of 12.6K jobs running simultaneously worldwide

The testing and commissioning campaigns

The various CSA, grid service challenges as well as the detector commissioning runs with cosmic rays shall be considered full end-to-end tests of the data-processing system in an operational state. All challenges included also human and scientific components with operators and scientists acting as planned for the first days of beam.

Details about the planning and the outcome of those tests can be found in various internal reports (restricted access) as well as contribution to conferences and journals such as The CMS Computing, Software and Analysis Challenge, chep09 and CMS Data Processing Workflows during an Extended Cosmic Ray Run and reference therein.

Computing Service challenges and commissioning

Besides general end-to-end tests a whole set of tests were organized to commission and verify computing infrastructures and services which have a mixture of generic lhc-grid and specific experiment components. During these tests no real data processing was required. Job-robots were used to test the processing farms and the submission and scheduling infrastructures (batch queues). Data storage and data transfers were tested using existing files. The decoupling of scientific data processing from these large scale tests made possible to run those tests independently from the release cycle of the software and the availability of resources to actually perform end-to-end tests. Indeed we run many more service challenges than CSA exercises. A full task force was dedicated to the debug and commissioning of the data transfer system (see for instance Debugging Data Transfers in CMS at cheep`09 )

The Software Development Process

The current status is well described in Release Strategies:The CMS approach for Development and Quality Assurance at CHEP`10

Release schedule and continuous integration

verbatim from Liz's slides

With the advent of data taking there is a balance to be struck between software release stability for operations and the need to improve the physics and technical performance of the code.

  • The feature set for a development release is known within the first week or two of the cycle => avoids mission creep
  • Pre-release schedule is set in advance and is usually 6 weeks; however some accommodation is made for the major CMS milestones, and the seasonal variation of manpower.
  • New developments are continuously integrated 2 times a day in the IB, release validation tests are run by our data operations team once a week. Developers are asked to check the results.
  • The cycle ends with a high statistics data processing which is signed off by the Physics Validation Team.

Once a release is signed off it can go into production use. This decision is made by the whole collaboration and depends on events like accelerator technical stops, and physics conference deadlines. A new development cycle is started immediately. By continuously integrating we avoid the problems of the big bang and reassure our developers that their work is valued and will eventually be put into production depending on it's urgency. Once a release has gone into production, it’s feature set is gelled. Bug fixes and a limited number of back ports from the development release of validated features, required for data taking, is allowed. CMS has many train cycles per year but only 2 or 3 will be deployed for data taking or reprocessing.

QA

Quality Assurance procedures, infrastructures and tools are also described in the above talk.

  • Many of the "static" QA components are embedded directly in the build tools: developers are immediately notified. Many QA defects produce build errors,
  • The full release is built twice a day and the full test suite (including integration tests based on realistic full data processing examples) is run.
  • The detailed performance measurement suite is run once a day.
  • Large statistics tests (RelVal) are run for each pre-release (weekly) including full DQM. Scientist are supposed to run their own checks and report.
  • RelVal includes regression testing based on DQM.
All results are available on the web.

The validation infrastructure is described in Validation of software releases for CMS, cheep`09

The latest version of Release Validation System is described in this poster presented at ACAT 2011 conference.

Data Quality Monitoring and its use in Data Processing QA

Information for Data Certification is collected as a final step during "prompt-reconstruction" for each event, stored in histograms and saved to local files for each job. Histograms are then "harvested", organized by time-segments and saved into a "database". During harvesting histograms are compared to a reference sample and differences qualified according to predefined criteria. All these histograms are available though a web-based GUI that can be customized for various Data quality monitoring use cases. Scientists, detector experts and shifters use such a GUI to verify the status of the detector and the quality of the data process. "Alarms" are generated when the comparison with the reference sample fails.

A description of the DQM system and its integration in the Data processing can be found in "Data Processing Workflows" paper as well in the contributions to chep09 CMS data quality monitoring: systems and experiences and CMS data quality monitoring web service

The very same system is used for Release Validation: the reference samples in this case are just the results from the previous pre-release (one week old) and production-release (previous cycle).

Performance Improvements

The first CSA and service challenges uncovered various performance issues that affected almost all components of the system. If, a posteriori, we may say that it was not unexpected,it clearly pointed out a missing culture of "performance quality" mostly guided by the infamous motto "Premature optimization is the root of all evil" by D.K. himself (somewhere in here). Without entering into historical consideration, contextualization and philosophical quarreling (enough one more quote from the same paper: "(Most programs are probably only run once; and I suppose in such cases we needn't be too fussy about even the structure, much less the efficiency, as long as we are happy with the answers)"), and giving full credit that " we should strive most of all for a program that is easy to understand and almost sure to work", it shall be recognized that the complexity of today large applications, such as a scientific data processing system, requires that "performance awareness" shall be built into the software development process early on.

At the end of 2006, CMS decided to establish a "Performance Task Force" with the mandate to

  1. benchmark the cpu, memory and I/O performance of typical CMSSW applications (reconstruction, simulation, analysis, HLT) in CMSSW_1_2_x, using SLC4/gcc345
  2. build an infrastructure for automated evaluation and documentation of the cpu, memory and I/O performance of CMSSW applications from release to release
  3. Work with CMSSW and external software developers in the release series CMSSW_1_3_x and CMSSW_1_4_x to improve cpu, memory and I/O performance of CMSSW applications
  4. Provide a roadmap for further enhancements at the end of the task force
  5. document the basic tools, and how they are used with CMSSW, for developers in the workbook
  6. document the advanced tools, and how they are used with CMSSW, for specialists
The initial mandate was to run for three/four months. It ended lasting one year completing successfully all 6 "work packages".

After completion of the work of the "Performance Task Force", systematic performance improvements were made part of the software development cycle itself. Summary of this continuous work can be found in

all three plenty of anecdotes and examples.

A more recent presentation "Measuring CMS Software Performance in the First LHC Years" at IEEE 2011 NSS conference report slides includes many screen snapshot not easy to get from outside CMS.

quote form the first talk:

  • Strategy 1: As a consequence of our Performance Task Force, several groups now routinely include work on improving performance into their development plans in addition to “new feature” development/changes (some of which increase the CPU requirements)
  • Strategy 2: Monitor in detail the performance over time, watching for increases and also looking for new possibilities for code performance improvements, improving things continuously.
    • No “crisis, boom and bust” cycle for performance!

As a measure of the achivements we may note how over the year 2008 the performance of reconstruction improved from 17.1 seconds/event to 6.0 seconds/event (with a minimum down to 5.0) and in year 2010 improved from 4.1 seconds/event to 3.1 seconds/event in the span of six months despite continued development in reconstruction algorithms to improve their functionality.

Most of the work of the task force concerned the "scientific software" and we will come back to it later on. We will first discuss other issues, more related to infrastructures and "external components" that were tackled in collaboration with "providers' i.e. the "WordWide LHC GRID project" and related teams.

Computing Services

A summary of the current status is described for instance in Experience with CMS Offline and Computing from Commissioning to Collisions, ICHEP 2010

Computing Services were affected mostly by robustness and reliability problems rather than computational issues. Job submission and scheduling as well as data transfer were initially affected by all sort of failures so that the global efficiency of data processing was significantly reduced. A consequence was also a major overload of the operation teams called to frequent manual operations for unforeseen cleanup, retry and debug sessions with introduced even more delay and inefficiency. Strict software issues were scalability problems often related to use-patterns differing from those predicted and tested (typical case was requests arriving in large burst instead of being uniformly distributed in time). The intrinsic complexity of the system is such that one cannot rely on the full and continuous functioning of all its components. Besides general improvements of the software of those services, the global performance was therefore improved increasing the resilience of the system to faults by

  • continuos monitoring of the various components
  • making the top level software aware of faulty services and able to find alternative "paths"
  • increasing the robustness of the system introducing automatic mitigating actions such as retry, delays,
see for instance

Data Access

Data storage in CMS (and at LHC, even whole HEP) is file based. ROOT is used as software I/O layer. The main storage system at CERN and other T1 centers is tape based with a "large" disk staging-cache. Several different server-client systems (RFIO,dCache, Xrootd) are used to serve data from the file servers to the data-processing jobs. A recent comparison of all these various systems for secondary data-processing (analysis) can be found in Optimization and performance measurements of ROOT-based data formats in the ATLAS experiment, cheep 2010.

We can observe how the fact that different data centers chose different solutions for data access and storage was at the end very beneficial from an optimization point of view. The comparison of the various systems for different use cases has eventually allowed everybody to evolve toward an optimal solution well adapted to the size and use of the data center itself. Large data centers, like CERN, have finally deployed multiple (layered) solutions to serve the very different use cases they are asked to support.

Primary Scientific Data processing (reconstruction) is not in itself I/O bound (cpu usage is 100%). It should be noted though that the client I/O software layer (ROOT + CMS specific) was at some point using as much as 20% of the total processing time and a not negligible fraction of the memory for operations like object streaming, buffer management and data compression. Scientific Analysis can be instead easily "I/O" bound and its performance optimization requires tuning of the data access layer (see above).

Typical problems have been:

  • Complex data structures that makes object streaming slow and often increase the size of their representation on file
  • not optimal placement of objects in "logical I/O buffer" causing an increase of the data volume in I/O
  • not optimal size and not optimal placement of the "logical I/O buffer" and their "metadata" in the file causing mostly an increase of the seek time

Performance improvements to data access have involved

  • Simplification of the data model
  • better (dynamic) management of data-buffers in memory
  • Explicit asynchronous I/O (pre-fetching, read-ahead, local caches, proxies)
  • parallelization
  • reordering of data-buffers in the file to make I/O more sequential
  • use of SSD when seek time really dominates

These improvements have involved changes and tuning in all layers: scientific code, framework, ROOT and data access requiring collaboration and synergies to be developed between the various development (and operation) teams to avoid to "step on each other feet".

for a description of the recent improvements in Root see

Quoting from the conclusions of this last report:

Over the last year (2010), the main focus of the ROOT I/O team has been to significantly enhance the I/O performance both in terms of CPU and real time. We have consolidated the code by leveraging old and new tools to reduce the number of defects, including Coverity, Valgrind and traditional test cases, often provided by users via the ROOT forum and the bug tracking system. Thanks to our collaboration with the LHC software framework teams (as highlighted by CMS), these enhancements resulted in substantially more efficient use of the available CPU resources for end-users.

Databases (Oracle at CERN) use is restricted to calibration and other "condition' constants (few hundred MB loaded by a Data processing application, few TB in total). Access to those constants is done through a hierarchy of standard http caches (squids) resident both on the LAN and on the WAN. This avoids server overload and data race even in the non uncommon case of thousands of jobs performing the very same set of queries at the very same time (see for instance CMS Conditions Data Access using FroNTier, at CHEP`07 Databases are of course used as backend for most of the computing services and for bookkeeping (see previous section and The CMS dataset bookkeeping service, chep07)

Scientific software

Performance improvements in scientific code can be divided in three broad classes

  1. pure "technical changes" not affecting the numerical results of the application
  2. modification to the implementation of algorithms that may modify their numerical accuracy
  3. changes in the algorithms or even just in their parameters that affect the results of the application and its scientific accuracy (physics performance in CMS jargon)

Changes in category 1. may be quite invasive even in user code when for instance affecting API or data structures. Still in CMS they are conduced as a "whole code cleanup campaign" executed by the core team with just prior notification to code responsibles and maintainers.

Changes in category 2. are also usually initiated and executed by the core team or "performance experts". They require minimal involvement by code maintainers and scientists to inspect the new version of the code and verify/validate the physics performance. In our experience we had few cases of serious delay in the approvals of such changes and one/two cases of request of rollback. It shall be noted that changes with are apparently technical, such as changes of compiler or OS, often end up in this second categories as numerical accuracy is affected.

Changes in category 3. need full involvement of development and scientific teams. Often they require full validation and approval by the full scientific analysis team as they will eventually affect the final scientific output of the experiment. They are also those that bring usually the largest performance improvements.

As an example we will go through the latest cycle of improvements performed in summer 2011. For more details see the presentation at CMS general weekly meeting

The new "Task Force" was initiated directly by top-level management who mandated all involved parities (software developers, detector experts and physics analysis teams) to collaborate in solving the problem. In this case the general finding was that the most time consuming reconstruction algorithms were not scaling linearly (neither in time nor in memory) with detector occupancy due their combinatorics nature.

  • The original full combinatorial algorithm was already changed in the past in favor of an iterative scheme in which the very same combinatorial algorithm is run multiple times in sequence with different parameters, each time starting from the "leftover" of the previous step.
    • A first set of technical changes was applied to avoid copying data structures when going from one step to another of the iterative reconstruction and to perform cleanup of candidates (retaining only the best) during the search itself and not just at the end.
      • This changes, even if quite invasive in the algorithms themselves, were implemented mostly by the "experts in the core team".
  • In another case the quadratic behavior of of an algorithm based on nested loops was "linearized" adopting a search approach based on a k-d-tree.
    • The new algorithm was developed by software-skilled scientists (students).
  • The main time improvements came from changes in the parameters of the algorithms themselves reducing tolerances, search intervals and optimizing the sequence of the "steps".
    • These changes were implemented directly by the scientists responsible of those "packages".
Those last changes did modify the physics performance of the event reconstruction including the "Physics-Objects" identification efficiency requiring full validation from the scientific teams (Physics Analysis in CMS jargon). The risk involved, in particular the risk of delaying the availably of scientific results due to the need to retune analyses and the need to eventually reprocess simulated data, brought to the decision not to deploy those changes in the on-going prompt reconstruction preferring to take the risk of a short delay in the availability of the reconstructed dataset in continuing to use the less performant version used up to that point. The new version will therefore be deployed in production in 2012. At the same time additional hardware resources were secured by an earlier commissioning of new nodes with higher memory per core and by squatting user queues. At the same time an optimization of the LHC running which produces in average less high occupancy events helped in speeding up the prompt data processing .

This last "exercise" confirms few facts in performance improvement projects (at least in HEP)

  • It is not difficult to gain a factor 2 in performance once developers of scientific algorithms are forced in
  • Validating the results for physics analysis takes about 4 weeks (including feedback loop due to regressions) that can easily double due to various human factors
  • A factor 2 in hardware resources is also not difficult to find at very little cost

The situation is likely to reproduce in 2012 with the additional constrains of the detector and software groups been engaged in the upgrade program (see below) and the physics team more and more focussed on discovery physics. A probable scenario for optimization of data-processing may therefore include

  • mild technical improvements mostly to fix regressions
  • re-prioritization of hardware resources and eventual use of opportunistic computing including commercial clouds
  • re-prioritization of data processing preferring discovery physics and delaying precision measurements
  • partial descoping of prompt reconstruction in favor of a more efficient and precise full reprocessing

R&D: planning for a future already with us

In year 2007 it was realized that neither the computing model nor the software architecture would be adapted to the novel hardware and software technologies coming to the market.

A first set of R&D projects was therefore proposed to investigate multi-core architecture and virtualization. They were followed by projects investigating data access, data storage and databases.

see for instance

The current data taking phase will end in 2012. It will be followed by a "long shut-down". Data taking will resume in 2014. All experiments are planning to perform detector upgrades as well as upgrades of their software and computing infrastructures. The goal will be to be able to run scientific applications efficiently on cpus with a number of core in excess of 50 each (200 threads?) with no more than 1GB of memory allocated per thread. Projects also exist to try to exploit coprocessors such as GPUs or other forms of accelerators.

-- VincenzoInnocente - 12-Oct-2011

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2020-05-27 - VincenzoInnocente
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback