-- RosamariaVenditti - 2021-02-18

Description of L3 Position, Duties and Skills

Position Description Duties Skills

DQM Online Shift Manager

The process of online detector and data quality monitoring in CMS involves many details. One of the core aspects of the process is the work of the online DQM primary shifters who inspect the observables related to the detector performance in the DQM-GUI, report results in Run Registry (Data Quality Book-keeping tool) and e-logs, and warn the operation crew and the sub-systems in case of problems. In addition, if unexpected behavior of the DQM infrastructure is spotted, this should be reported to relevant DQM experts. During data taking, we have 24/7 shift operation for the whole period (3 x 8h-shifts per day including weekends). The primary DQM shifters are called from the CMS collaborators-wide. In addition to the primary shifters, we have DQM experts taking DQM- on-call-shifts (DQM DOC) (7 consecutive days at a time) to provide support to the primary shifters and on the DQM infrastructure. The shift manager should: organize shifts, training-shifts and tutorials to ensure the primary shifters are well trained, and support the work of the shifters, in coordination with DQM conveners.

  • organize online DQM shifts
  • organize training-shifts
  • organize and provide tutorials
  • deal with possible problems (exchange of shifts, uncovered slots, sudden cancellation of shifts)
  • support the work of the shifters, in coordination with DQM conveners and subsystem DQM experts
  • keep up to date documentation and short term instructions
The duties listed here apply for both primary and DQM-DOC shifters.
  • experience/good-understanding in CMS DQM operations, shift organization, DQM shifter duties
  • good knowledge of CMS shift tools
  • good organizational, communication and problem solving skills
Online DQM Operation Manager

The process of online detector and data quality monitoring in CMS is rather complex. The DQM primary shifters inspect the histograms related to the detector performance in the DQM-GUI, report results in the RR and e-log and warn the sub-systems in case of problem. The Online DQM operation manager is responsible for running the online DQM software infrastructure (used by primary shifters and by all the CMS subsystems to monitor the relevant histograms) and for ensuring smooth and robust operation. The main softwares that should run during operations are the Online DQM-GUI and the Online Run Registry. Part of the DQM GUI software (the applications responsible for producing per-subsystem histograms) run in CMSSW environment and are always affected by integration request by subsystem experts, that should be plugged in (through a dedicated test process) before the start of a given round of data taking. Another part of the DQM GUI software is the one dedicated to the plot visualization/analysis, that should be kept up-to-date with the help of the software expert. The Run Registry is javascript application running on CERN DB, with the goal to keep track of the data quality. It is very stable, but sometimes affected by accessing/functioning problems that you may ve required to understand and adrress to the relevant experts.

  • deployment of new CMSSW releases in DQM machines
  • PR integration and test in playback
  • software upgrades (including DQM GUI)
  • code debugging,
  • communication with the responsibles of DQM in each of the subsystems
  • communication with system admins,
  • keep up-to-date the related documentation
  • communicate with DQM on-call.
  • Good (demonstrated) knowledge of the DQM software infrastructure or good knowledge of bash/C++
  • Good knowledge of GitHub
  • Good communication skills
Offline DQM ML infrastructure developer

The process of offline detector and data quality monitoring in CMS is rather complex. One of the key point of the process is the work of the DOC3 shifters that, for each run, inspect by eye dozens of histograms related to the detector performance in the DQM-GUI, compare the observed distribution with the reference ones, report results in the RR and e-log and warn the sub-systems in case of problem. This process is performed continuously during data taking, involving thents of shifters/per week, sometimes is prone to human errors and presently does not allow to monitor the performance at the lumi-section (LS) level.

The core DQM team has been studying the implementation of a package based on latest Machine learning (ML) techniques, that allows to perform the whole monitoring and certification process in a semi-automated way (also increasing the inspection granularity to the per-LS level). Some prototypes of ML models have been developed in the context of single sub-systems with successful results.

The Offline DQM ML infrastructure developer will be responsible of extending such prototypes for all the sub-system, creating a software toolkit able to define the proper signal and background dataset for each subsystem, implement, test and validate the model, combine the results obtained on sveral histograms and put the final flag in the Run Registry. She/he should collect input from the subsystems about the relevant histograms to be used as an input for the ML model, define the proper test/train/validation dataset with the DQM-core team, coordinate the inclusion of the final discriminator in the GUI and the inclusion of the final flag (per run or per LS) in the Run Registry.

  1. take care of the extension of Machine learning models to all the CMS sub-systems

  2. implement a general ML-based DQM framework (should include tools for training on proper dataset, implementing, testing and validating the model, combining results obtained on different distributions, and provide a final flag per LS/run).

  3. coordinate with the experts to put the final output in the DQM-GUI and RR.

  • good knowledge of python and C++
  • good knowledge of machine learning techniques
  • experience in the standard DQM and certification workflow

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2021-02-19 - RosamariaVenditti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback