Titulo, autores
Observacoes -- nenhuma
Referencias
Poster maldito do tracker
Paper do HPU taskforce
OBS: Revisado ate o 4!
1
Texto
One of the biggest challenges in the
CMS detector, is the precise reconstruction of particle tracks. This is done by very complex algorithms, which translates into a CPU intensive task. At the scale of the LHC, understanding how the algorithm behaves according to event complexity, is one of the key factors to process workflows, in a more uniform and efficient way. This analysis makes possible to, based on previous observation, to estimate how the event reconstruction time will look like for incoming data.
Figuras
Talvez mostrar esse poster?
https://indico.cern.ch/getFile.py/access?contribId=8&sessionId=4&resId=0&materialId=poster&confId=189524
2
Texto
The complexity of track reconstruction, comes from the number of tracks, and how much they overlap, making the algorithm to iterate more before distinguishing the tracks. This has a direct relation with the instantaneous luminosity, or with the "Number of Pile Up interactions per bunch crossing". The former is not measured, but a function of the accelerator running conditions and instantaneous luminosity , for this reason, we are focusing in instantaneous luminosity on this study. Although Pile Up is a more intuitive value.
Below we can observe in the
CMS event display, a High Pile Up event, compared to an event that has less complexity.
Figuras
3
Texto
Less talk more code, this is the effect of the luminosity/PileUp variation into real data taking
The
CMS Fill Report already provides interesting plots of instantaneous luminosity and pile up over time, of a given fill, here we can compare those, with the reconstruction time per event of this data observed in the Tier-0.
Figuras
- Luminosity over time
- PU over time
- TpE over time
DISPLAYED VERTICALLY, SO ONE CAN VISUALIZE HOW IT EVOLVES OVER TIME
4
Texto
The following is a curve of CMSSW performance, for a given Release and Primary Dataset (type of event). It varies significantly according to the type of event, because, it is very related to the type of events, but is a very good reference, to estimate, based on observed values, how the time per event will look. There's an important sistematic error in this measurement, which is the fact that the workflows run in non-uniform farms. different CPU models will result in different processing times for the same event.The advantage is that as a general curve, it covers better all the CPU models range that we have in the farms that we utilize. It is at the end what is most useful for central operations.
Figuras
Usual curve in the known interval. Supposed low chi2 -- report chi2?
5
Texto
Some measurements were done on
PromptReco workflows, to observe how close the estimate can get, or how far from the real value it can vary. Please consider the error introduced by the CPU difference fluctuation. In the Tier-0 farm, this is 37.75% of HEPSpecs 2006 performance difference, between the fastest and slower CPU. in the figure 37 we can observe the distribution of different CPU speeds in the farm, and in figure 47, the distribution of the error distribution for the
TpE prediction for 35 different workflows. Also the table 42 showing different specific cases.
Figuras
Tabelinha =D
Ver este
post no HN
, que agora e pato.
6
Texto
One of the uses of this measurement, is to have an idea on how the reconstruction time will look like in higher luminosities, for example, extrapolating until the Run2 (2015) luminosity. Obviously some things will change that will improve the curve parameters, but we should look at it as a guideline to see what kind of challenge lies ahead, not a precise report.
Figuras
7
Texto
Due to the wide range of luminosity, and its effect on time per event, we can observe here some distributions on a multi-run reconstruction workflow. the consequence, is the famous effect where 95% of the processing gets done in 50% of the workflow total time, and a considerable ammount of time, is due to high luminosity jobs, that can take up to 48h to finish, if they don't retry.
Figuras
Distribuicao TPE do DQM, distribuicao de job length do rereco
8
Texto
In order to have automatic ways to monitor this behavior, there were developed automated ways to generate this plot. At the end of a reconstruction workflow, the Workload Management Agent, harvests the performance information and uploads to a central database, in
CMS DashBoard. This information is used in monitoring interfaces (figures 2 and 4), and also can be queried by automated systems and scripts, through a
DataService.
Figuras
Dashboard tools
9
Texto
A work in progress is, to change the way we split jobs in a workflow in
CMS. Today, we either have a number of events or number of lumi-sections, defined by operators, based on how many are needed to average the job length for 6h. A new splitting algorithm is being written, where operators inform the expected job length, the system will query
DashBoard's performance database, estimate what is the time per event, and balance job inputs(number of events per job), in order to have more uniform running time, by considering luminosity in the data being processed.
Figuras
?? Talvez nao