Towards an Analysis Model for CMS Computing

The Problem

We don't have a proper Analysis model for CMS. We understand how one user submits CRAB jobs to analyse one dataset, more or less, but beyond that, we have nothing. We don't know how many datasets a user interacts with, how many users use each dataset, how widely any given analysis group uses CMS resources, or anything like that.

When it comes to optimising the use of resources, we need to be looking at the global picture. We have many tools that are under development that will optimise different parts of the system, independently (Dynamic Data Placement, AAA, multi-core pilots, opportunistic resources and so on), but we don't take an integrated approach to the whole system.

An integrated approach would have to take account of CPU, storage, network, and variations in the workflow of analysis and production. The goals would be several:

  1. Optimise resource-use in the existing system, by considering all components together
  2. Provide robust response to changes in the daily workflow (e.g. unexpected downtimes, deadlines for conferences or new discovery announcements)
  3. Provide a tool for testing the effect of planned changes (e.g. is it worth buying more disk to make data more available, or more CPU per disk, or shifting the managed vs. WAN data-placement ratio, or...?)
  4. Provide a model that can give more realistic long-term predictions of the needs of CMS, for budgetary and planning purposes.

N.B. While I refer to an 'analysis model', it should include scheduled processing too, such as Monte-Carlo generation and (re-)processing of raw data. These are already far better understood, but as things that consume resources and therefore compete with analysis, they need to be included in the overall model.

Towards a solution

Building a complete model is not going to be easy, it will likely take several years. We can break the process down into a few steps.
  • analysing existing sources of data
  • extending monitoring to cover types of information that are not yet recorded
  • providing long-term storage of, and access to, this information
  • providing tools that can analyse that data over different timescales, to pick out trends on the timescale of hours to months or years.

We can and should get started with the first two points as soon as we can. The last two are longer-term projects.

Analysing existing sources of data

We have several sources of information about how we use data today:
  • PhEDEx knows about data-placement and data-movement
  • AAA knows about WAN-access to data
  • CRAB knows about who submits which type of job where, and what data it runs on
  • The Popularity DB and Dashboard integrate a lot of this information already, so there's no need to go back to each source individually

This is certainly enough to get started. Below I list some of the things that come to mind to help kick-start a discussion. This is far from complete, and is meant only as a starting point.

Data from PhEDEx

We know that much of our traffic is between T2's, and not the scheduled T0->T1 or T1->T2 traffic. We don't know things like:
  • what's the activity profile for T2's?
    • how do we characterise the amount of data they pull down from other sites on the timescale of days, weeks, or a year?
    • how much data do they export to other sites on the same timescales?
    • how many sites do they pull data from? Where are they situated?
    • how many sites do they send data to? Where are they?
    • how do these profiles vary across geographical regions of CMS?
    • how do these profiles vary with time?
  • how does transfer activity depend on physics group?
    • are all the sites that host a given physics group equally active or inactive at the same time?
    • are all physics groups equally active/inactive at the same time?
    • are physics groups well distributed geographically, or do they cluster by region?

On a more global level, there are many other questions to consider:

  • what do our errors look like?
    • are errors clustered by source, by destination, by link, in time, or what?
    • does the presence of errors between A and B say anything about errors between C and D?
  • replica distribution, for datasets that have more than one replica:
    • are the replicas grouped or anti-grouped by geographical region?
    • are the replicas owned by the same analysis groups, or shared?
    • what's the time-profile for creating replicas? Do they all happen at once, or are they distributed? If so, how? Does it depend on the dataset?
  • link stability
    • we have never, to my knowledge, looked at the history of link stability. We know that entropy cuts in from time to time, but we don't know if there are short (~few hours) wobbles in stability.. We could look at the RouterHistory API to see how stable the rates and latencies calculated by PhEDEx are.
  • transfer latency: For some time now PhEDEx has been accumulating latency measurements for blocks, but nobody has looked at it.
    • how do we characterise blocks that show latency for transfer-completion? Are they totally different from other transfers, or are they a tail in a distribution?
    • how is block latency affected by source site/destination/number of files/size of dataset/dataset-type (RAW/RECO/MC?) etc
    • when latency occurs, is it sporadic (one link) or systemic (several links)?
    • if systemic, what characterises the clustering?
  • network topology
    • in general, I'd like to know standard network metrics for PhEDEx, just for curiosity
    • network degree distribution: i.e. how many partners each node has
    • network diameter: how fully-connected is the mesh?
    • these are simple things to calculate, but could be made more interesting in many ways:
      • compare these metrics over time, from the beginning of the information in PhEDEx
      • compare over different timescales: an hour, a day, a week, a month. What does the topology look like at different timescales?
      • compare Production to Debug: are the two similar or not?

Sample PhEDEx queries

There is now a separate page where I collect some sample PhEDEx queries for getting basic information from TMDB.

Data from AAA

From AAA it would be interesting to see how often the same file is accessed, and how many replicas of that file exist to choose from. This would give us a measure of how well the data is being distributed around CMS, for example.

One interesting project around AAA would be to analyse the performance of users analysis jobs and look for differences in performance when running locally or in overflow, i.e. when accessing data over AAA. If we can spot which types of job, which datasets, which users have greater performance hits when running over the WAN maybe we can schedule those jobs differently, and choose other jobs to run in overflow.

AFAICT, the monitoring that would allow us to do this is not yet in place, but it could be soon. Then we should just be able to harvest the data from the dashboard.

Data from ASO

Data from the Popularity DB

The Popularity DB has been collecting data for a couple of years now, based only on information from CRAB2. This is a treasure-trove of useful information:
  • what's the time-profile for accessing a dataset?
    • does it grow, plateau, decay? Does it start high and decay?
    • does it sit there for months untouched and then experience a burst of use?
    • are datasets accessed in their entirety over time, or are only newer blocks accessed as they appear?
    • is there a single activity-peak per dataset, or is it multi-modal?
    • does any of that that depend on the dataset? E.g. signal samples might be shorter-lived or less active than background samples, which are shared.
  • are access-profiles constant across re-processings?
    • i.e. if dataset Xv1 is replaced by Xv2, is there a change in the activity of the sum of those two datasets compared to similar datasets which are not reprocessed?
    • do older versions of a dataset fall off in popularity faster when there's a newer version?
  • does the activity in a dataset come from only one physics group, or many?
    • if many, do they each have different sub-profiles?
    • does each physics group have a characteristic access-profile, or are they all the same?
  • does the activity come from mostly one user, or many?
    • how does that vary with dataset?
  • Domenico had the idea of analysing the popularity information with neural networks. I think that's a great idea, not only for what we can learn about popularity but also because I think that this type of non-linear approach is going to be needed for modeling the full system. Simple 'if-then' logic just won't cut it.

Other sources of data

Operations-related sources

One interesting source is Savannah tickets. it would be interesting to have an analysis of how many tickets of different types were created (e.g. 'agents down' vs. 'transfers stuck' vs. 'invalidate files' or whatever). Break this down by things like:
  • number of people who answer on the ticket
  • time from opening to resolution of the ticket
  • site/region clustering
  • source/destination (link) clustering, by site or region (i.e. do we have more problem within a region or between regions?)
This will give us a handle on which problems will benefit from more robust monitoring or can otherwise be improved by technology.

I also think it will be interesting to see how we turn over personnel. E.g, track the history of how long someone remains a data-manager at a site, or a site-admin at a site, or a central operator. I don't think we know how rapidly different classes of staff turn over, but if we did, we might be able to target training or operations better. Maybe do the same for developers?

Extending monitoring to cover types of information that are not yet recorded

There may well be many questions we would like to address that we can't, because we're not measuring or correlating the right information. E.g. we have information about what a user submits to CRAB in terms of where the job goes, what data it runs on, and which files it opens, but do we have things like the number of threads it spawns, the CPU, memory, and local disk space it uses? Probably at some level we do, but we will doubtless find things that we aren't monitoring today that we need to if we are to answer all our questions.

Providing long-term storage of, and access to, this information

Much of our information is stored in databases, either Oracle or CouchDB. Some is archived, some is compacted and archived, some is not archived at all. We need to review how each source of information is treated and plan accordingly. It's not unreasonable to suggest we should design some sort of data-vault for archiving this stuff.

Providing tools that can analyse the information over different timescales

This is a completely open topic at the moment. Clearly tools such as ROOT and R can be used for statistical analysis, while other approaches (cluster analysis, neural networks) will probably also be needed. As we start analysing the data we have we should look for opportunities to standardise on tools that can allow the analysis to be run routinely, so we can see how our understanding changes with time.

A course of action...

The first thing to do is to start analysing some of the data we have. Summer students, or their equivalent, are a good bet for this. Get some tools that we can use regularly to update results, and get in the habit of looking at them.

Once we have some basic analyses, look at how to correlate them. Is the information from different sources compatible, can we put them together to get a larger picture?

From the experience we gain with that, we'll get an idea of how much information is missing to complete the picture, or how much needs to be recorded differently to the way we do it today.

Then we can design an information infrastructure to make this information more useful. This may mean modifications in many parts of the DM/WM software to provide information in an appropriate format.

Finally, when we have properly maintained information sources, we can look at modeling the behaviour of the system in a robust manner, one which allows predictions to be made.

I'm not sure how to do all that, but after all, "research is what I'm doing when I don't know what I'm doing".

Some references

I Sfiligoi 2014 J. Phys.: Conf. Ser. 513 032087 doi:10.1088/1742-6596/513/3/032087 Estimating job runtime for CMS analysis jobs
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2014-11-11 - TonyWildish
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback