The Future of Data-Management, in CMS and maybe beyond

This page holds some of my ideas about data-management, how it could evolve in CMS, and how our tools could be made useful to communities outside LHC or HEP.

The goals are:

  1. consider how to manage data in CMS at all scales we use. The range of scales are:
    • Data-volume: two CMS users sharing root files between their laptops to the entire collaboration managing the sum total of CMS data. So it covers the range from a few GB of data up to hundreds of PB of data.
    • Number of nodes: from two laptops up to the set of managed storage elements in CMS, and beyond to every computer used by any user in CMS, including opportunistic resources such as clouds or loaned clusters available on different timescales (hours, days, weeks).
    • Network topology, which ranges from the LHCOPN/LHCONE environment out to home DSL networks, mobile device networks, and direct personal area networks (PANs).
  2. look at external technologies that can be used either directly (as components of a larger system) or as inspiration, from their architecture
  3. consider how we can re-factor CMS data-management tools to support a broader range of use-cases, either in CMS or beyond

Some sample use-cases

These are just for illustrative purposes, and are not all meant to reflect any concrete needs of CMS today. Some are clearly real, some are just made up.
  • Custodial storage of experiment RAW data. The classic reason for PhEDEx. Data must be transferred reliably, with reasonably low latency (to avoid backlog at the Tier-0 or online cluster), and bookkeeping must be impeccable.
  • WAN access to data from a worker node. This is what AAA does, allowing a batch job to access data from anywhere in CMS rather than fail when local access doesn't work.
  • Migration of user-generated files to safe storage. This is what ASO does, marshaling and storing files produced by analysis jobs around CMS.
  • A physicist wants data to help debug analysis code while on a long plane flight. They would like to browse the data-catalogue and request the transfer of 200 GB of a particular dataset to their laptop. They don't care which files they get, as long as they get a reasonable sample.
  • A physicist wants to monitor detector performance by always having a sample of the latest files from a given (long-lived) dataset available on their desktop. They want something like a cron-job to pull down 200 GB per night and analyse it, purging the data before fetching new files.
  • A detector group produces private monitoring data at the pit. They want to archive it reliably on tape at CERN. The volume is only a few GB per day.
  • Two (or a few) physicists want to share some private ntuples. They are both working on variations of the ntuple, and both want to have all versions synchronised between their desktops or laptops.
  • A cloud resource is available to CMS for 8 hours every night (e.g. a company lends us their computing farm when it's idle). We need to pump data in fast when we have access, then out again fast before the end of our slot, in order to make maximal use of the CPU.
  • A site at the 'edge' of the CMS network (i.e. with relatively low bandwidth connectivity) needs to shape its traffic so that its users can get priority during working hours, but bulk transfers take priority outside that time.

Dimensions of Data-Management

A useful way to look at data-management is to consider the different dimensions of any data-management problem. Different use-cases have different characteristics that we can use to group them into classes. Some obvious dimensions are:
  • Data-volume: How much data is being handled? We can subdivide this into the total pool of data being managed and the amount in use at any given time. So this will range from the entire CMS dataset down to a single data-block, or to a set of ntuples shared among a group of physicists.
  • Network structure. This has several components, such as:
    • Number of nodes. Is the network peer-to-peer or hierarchical? Even peer-to-peer may have special characteristics, in that buffers may need to be managed at either the source, the destination, or both.
    • Node topology. Is it static or dynamic? Do nodes regularly join/leave/join/leave the network (such as a laptop that is often paused)? Do nodes join, leave, and never come back (such as transient cloud resources)? Or are they static, such as storage elements in computing centres? For dynamic nodes, in particular laptops, they may join or leave the network in different locations as the user travels around CMS with the device.
    • Types of data-flows. Are the nodes in a network all involved in the same data-flow, or in several simultaneous data-flows, or in several sequential data-flows? Are data-flows bursty (e.g. raw data from the detector) or continuous (Monte Carlo production)? How often do new data-flows come along?
  • Throughput: What rates of movement are involved? Are there deadlines, and if so, what are the penalties or actions to be taken if they are missed? Is there a benefit from managing traffic (this could be true on any network, from PAN to WAN).
  • Latency: Does it matter if the data arrives today, tomorrow, next week?
  • Reliability: is it important that every file be delivered, or will some fraction be OK? If only a fraction, which fraction (newest, random, uniform sample)? Should the files arrive in some pre-determined order, or does that not matter?
  • Users and security: How many users does the system deal with? What kind of authentication, authorisation, and accounting are needed?
  • Metadata management: What bookkeeping is needed, with what lifetime? How are the list of input files determined? Who owns the files? Bookkeeping could simply be done by the existence of the file on the destination ('processing by dropbox') or by full replica management such as PhEDEx does today. Related to this is the user-interaction with the system, how they request data-movement, how they monitor it, how they manipulate it (pause, resume, cancel, change priority...)?

Not all of these are orthogonal. For example, guaranteeing high reliability means you can't guarantee low latency, since source files may become unavailable for a while for various reasons.

Note that I don't regard transport protocol (HTTP, FTS etc) as a dimension of data-management. There is no use-case for which the user can validly specify the protocol to be used as a formal requirement. The practical realities of the situation may constrain the choice of protocol today, but if we're looking forward then that should not be allowed to constrain us.

So, to summarise, these dimensions are:

  1. Data-Volume (small, large)
  2. Network Structure (#nodes, topology)
  3. Types of Data-Flow (number of flows, burst/constant)
  4. Latency (does it matter or not)
  5. Reliability (guaranteed or not)
  6. Users and security (single-user no security, multi-user full authentication...), and
  7. Metadata requirements (input file-discovery, transfer monitoring, output file bookkeeping)

CMS Data-Management tools


PhEDEx is the flagship data transfer-management tool for CMS. It manages large data-volumes on the WAN with a static topology. Data-flows are many and varied, latency is not guaranteed but reliability is high. Transfers are driven entirely through the knowledge of replicas that are statically managed, maintained in TMDB.

PhEDEx has been developed for over ten years now, and in recent years we have pushed it to become a more re-usable framework, rather than a set of customised tools. Despite this, it still isn't fully factored along the lines of the dimensions listed above.

  • Input-file discovery and data-movement are inextricably bound with replica management, so PhEDEx can only move files that are registered in its database and PhEDEx then serves as the source of knowledge of what files exist. Separating the file-movement layer from the replica-management layer would make sense. This would require a PhEDEx-transfer-task and a PhEDEx-transfer-report interface to be defined, so clients can provide their own list of files and do their own bookkeeping afterwards.
  • The nodes in PhEDEx are statically registered, with permitted links defined in the database and link performance monitored/recorded purely internally by PhEDEx. Adding a new node to PhEDEx and commissioning links is a lengthy process. For a site that's going to exist for a long time, and store and serve data, that's OK, but for a site that only wants to pop into existence long enough to pull down a block of data it's not good enough. Ideally we will have an API that allows a machine to declare itself to PhEDEx (or any other data transfer management tool) and to manage its links, its lifetime, and its local data catalogue.
  • PhEDEx has a very poor view of the network topology, in that it has its own naming convention for sites and these names are totally decoupled from the physical structure of the network. Relating the PhEDEx topology to IP numbers, for example, is non-trivial. This makes it hard to use external sources of network performance data, like PerfSONAR. This also makes it harder for PhEDEx to interact with the network frabric, such as setting up and managing virtual circuits in order to guarantee an efficient data-path (incidentally, first results from ANSE+PhEDEx were shown at the 2014 ISGC conference in Taipei in March)
  • PhEDEx attempts to be reliable, in that if you ask for a file to be transferred, it will not give up. In the event of failures it will back off and retry, indefinitely, until the file is transferred or the request is cancelled by an operator. This is of course what you want for custodial data transfer, but does not match the low-latency/low-reliability use-case of someone who simply wants some of the data in a dataset to be able to test and debug their code. This could be more flexible if we had an API to tell PhEDEx what characteristics the user needs, and of course the code to implement them. I'm not sure what that actually means, we'd need to develop some use-cases to explore it.
  • Likewise, PhEDEx also determines the order in which files are sent - this is not always simply chronological or alphabetic - and it decides when to schedule each particular data-flow among the set of data-flows to a destination. Beyond crude high/normal/low priority switches, there's nothing to say that you want a certain amount of data every so often, or to filter the set of files you want. Again, we need to develop use-cases to explore this possibility.
  • PhEDEx has relatively long lead-times before transfers start. The design is based on keeping queues full, rather than on making any particular transfer happen as fast as possible. So the time from a request being made to the time the first file transfer is attempted can be 30 minutes. It would be good to refactor this and bring down the latency, to less than one minute.
  • Related to the previous point, PhEDEx manages 'file-routing', which means the selection of source sites for a destination. Transfers are all driven by the destination, which pulls data, but the central agents determine what source replica to use for each destination. This is based on the internal statistics and knowledge of which site agents are up, on the assumption that sites whose agents are down are probably in downtime and therefore not accessible. There may be an advantage to changing this model, and letting the site (i.e. the FileDownload agent) determine the source replica to transfer from by itself. This would fall out naturally from separating replica management from transfer management.

Use outside CMS

In terms of adoption by other communities, this is a topic that comes up roughly every year or two. PhEDEx is only loosely coupled to the CMS data model, in that it requires a hierarchical description of data in Dataset -> Block -> File, but nothing else. There is no requirement on the file type, parentage, or metadata, PhEDEx never looks at it. It cares about size and checksum, but doesn't care which checksum it gets (adler32, cksum, anything will do).

While PhEDEx works best given large files organised into groups of 100-1000 files per block, it has no internal constraints that enforce this. A block may contain a single small file, or thousands of very large files. In simulations we've handled a 1 PB block before now without difficulty.

The main stumbling block for adoption of PhEDEx by other communities has always been the database. PhEDEx is based on Oracle, and makes heavy use of its features:

  • contraints are used to ensure the integrity of the data
  • triggers are used for bookkeeping, to update summary tables when transfers complete etc.
  • PL/SQL procedures are used to perform sensitive actions, such as registering a new node. The alternative would be to grant operators direct access to the tables, which would be much less secure
  • Database roles are used for access-control. This is a key element of our security, in fact we've been moving more and more in this direction in recent years.

For communities that don't wish to buy an Oracle license, PhEDEx could only be adopted if it were to be ported to another RDBMS. If not all the required features are available we would need to find a new way of doing things that gives the same result, or live with a reduced set of functionality.

For some use-cases, it may be that these features are not all needed. For example, the movement of ad-hoc files between two endpoints might not need a database at all, with the input and output dropboxes serving as the bookkeeping. If replica-management, history and monitoring are to be maintained, constraints and triggers will be needed.

PL/SQL and database roles are more important for the security model, which would need to be assessed by each potential customer. If they have a security model that is radically different to that of PhEDEx, maybe these features would not be needed.

Related to the security model, user-identities are managed by SiteDB, another Oracle database. Specifically, this maps users public grid certificate to his/her roles. This ought to be fairly simple to port to another RDBMS if required, or to implement in other ways.

Porting the PhEDEx schema to a different RDBMS would be a very interesting, but major, project. If it's to be done we should get interested external parties to participate, so we can drive the work with explicit use cases. That would be far more productive than simply guessing what someone else wants or needs. The schema should be split in layers with dependencies such that core functionality can be implemented without relying on other features, and more functionality can be added as required. Obviously this would require the agent code to be refactored too, to take advantage of the layering.


(this needs work!)

AAA is radically different to PhEDEx. It operates at the level of single-file transfers, so has no concept of 'data-flow'. It transfers files with very low latency, does it's own source-discovery in real-time and does no replica management at the destination (i.e. the file is delivered and forgotten about). The network structure is determined by the map of xrootd redirectors, so is not coupled to PhEDEx.

AAA does not, to my knowledge, have any form of user-identity management. Like PhEDEx, it's just a service that delivers the file without regard for ownership.

One interesting short-term idea is to use AAA as a transfer backend for PhEDEx. That bypasses the source-selection mechanism and would let PhEDEx get the file from the first site that replies to the request. This isn't suitable for large-scale use in its current form, since PhEDEx would have wrong statistics about the source-destination link performance if the file were served from somewhere else, but it could be used for the first retry on transfer failure. That would not mess with the statistics too much, but would hopefully solve most of the transfer-failures we see.

Other useful ideas are to cache the files, either at the endpoints, or internally in the network itself (see the section on caching, below)


(this needs work!)

ASO, Asynchronous StageOut, is a new component in CRAB3. It is designed to work also as standalone tool. It relies only on a central NoSQL database, CouchDB, as input and data storage. Its implementation was fast using this technology since no particular database design has been required. Moreover, the schema-less nature of NoSQL models satisfies the need to constantly and rapidly incorporate new types of data to enrich the new applications as the ASO. CouchDB exposes natively a REST interface. The CouchDB integrated Web server has facilitated the prototyping and implementation of the ASO Web monitoring by serving the application directly to the browser. Furthermore, no particular deployment of the monitoring application is required since it is encapsulated into CouchDB.

ASO is designed as a modular application relying only on CouchDB as input and data storage. It has progressed from a limited prototype to a highly adaptable service, which manages the whole CMS user files steps. Core ASO components are:

  • AsyncTransfer: retrieve user proxy/group/role, group users files into jobs and submit them to FTS
  • Monitor: interacts wit FTS to monitor transfers status and provide results to the Reporter components
  • Reporter: Update the document DB, CouchDB, with transfers status
  • RetryManager: Retry failed files max_retry time (configurable parameter)
  • FilesCleaner: Remove transferred files from the temp area of the site where the job has run
  • DBSPublisher: Publish transferred files

Typically, users write their output files to local storage on the farm their jobs run on, and then ASO deals with staging the files from there to their home site for further analysis. Estimated rates are more than 200K - 300K files per day, for more than 100 users per day. Files are then grouped and published in bulk into local scope DBS for further analysis. Files vary in size from o(MB) to o(GB) depending on size of the input dataset, userís code and the splitting algorithm parameters set. ASO sits in a different corner to both PhEDEx and AAA. It deals with multiple users, each user has one or more data-flows (the set of files from their analysis jobs) that must be kept separate. File-discovery consists of CRAB3 pushing file information into it's database via a REST API, bookkeeping is done in the same local database. Transfers are queued and throttled to a level that won't overload storage elements, so latency can grow if the system is saturated. Reliability, on the other hand, has to be high. The network topology is defined by the set of analysis centres (sources) T1s/T2s/T3s and the T2s/T3s that host users personal data (the destinations). The transfers are per performed across full-mesh links not necessarily commissioned for high bandwidth/high reliability transfer.

This is very PhEDEx-like in many respects. The multi-user aspect (users proxies, priorities between users, priorities between user files...) is different, there is simpler logic for scheduling, and no complex replica management, otherwise they are very similar. That is why the integration of Phedex monitor in ASO was quite easy. This integration is a proof-of-concept toward a common DM system.

Cessy -> Meyrin transfer system

For completeness, this is worth mentioning, since it's a dedicated transfer system. It's a bit different to everything else in that it's LAN, not WAN, which means it can realistically aim for low latency and high reliability at the same time. It's also single-user, single-path, and deterministic data-flows.

While the primary system is for detector data, there's at least one other, separate system, for sub-detector data that also needs archiving.

Non CMS-specific tools & technologies

(this needs work!)


PerfSONAR knows more about the state of the network than PhEDEx ever will. We should push for it to be adopted fully and supported everywhere, and also investigate what it means to use it outside the static grid topology of LHC computing sites. E.g. does it make sense for me to use it at home, or from my laptop at a conference?

Content-delivery networks and Caches

We currently store data at the end-points of our networks, the storage elements at CMS or LCG sites. That's not the only place we could store data, and we should consider the alternatives. Broadly speaking, these fall under Content-delivery networks (CDNs) and caching strategies.

CDNs are extensively used by services such as Amazon and NetFlix, and clearly work well for them. They rely on pre-placing multiple copies of data at strategic points around the network, close to clients. This is well suited to situation where data-flows which are not large (e.g. a single movie), or where only a fraction of your total data store is popular, so you don't have to replicate the whole lot too many times. It fits well where the same file is needed by many users, so you get a high hit-rate from successive clients. It's also well-suited to networks where latency is an issue, again movies being a good example. They're also very useful when you can predict your traffic reliably, so again when a new movie is released it's clearly a good idea to replicate it in advance at your CDN nodes.

Caching is very similar to CDNs. If there's a distinction to be made, it's probably that caching is reactive, CDNs can be pro-active. Caches can be placed at the destination node, to serve only the local users, or in the network itself, to serve an entire region from a hub. One interesting variation on caching is 'Named Data Networks', which address (and cache) objects by their own metadata, not by their source description. E.g, instead of caching 'this copy of that file', it caches 'this copy of that object', which may be a file, a block of files, or even single events. There's a detailed paper on NDN at

It's less clear how well CDNs and caches would work for us. We may have single clients that need large volumes of data (tens of TB) which nobody else is using, or which are used only infrequently so do not justify an opportunistic replica. Or we may have two clients on opposite sides of our network who are using the same files, but have no common network resource on which a cached copy could be kept. In short, we may end up trashing the cache and achieving a very low hit-rate unless we get smart.

Making better use of caches means having a better understanding of how analysis works in CMS, and of planning analysis activities around the opportunities for caching. These are both very hard problems. Today, we barely understand analysis at the level of one user submitting jobs via CRAB to run on one dataset. We know nothing about correlations between users, between the datasets a single user uses, or among the sites and topology that hosts and processes the data. We will need to produce a proper analytical model of analysis before we can design an effective caching strategy that doesn't require vast amounts of storage. I have some ideas on how to go about producing such an analysis model on another page.

Once we have an analysis model, we can examine how to shape our analysis traffic to make best use of the network. Sending jobs for the same dataset to the same location is one option, as is scheduling jobs for the same dataset to run within the cache-lifetime of the data they are accessing. This can, of course, also lead to overloading resources by creating demand peaks. Without an analysis model which has predictive power, we're blind.

Virtual circuits, bandwidth on demand

Apart from caching in its various guises, the actual movement of data can be improved. We currently have more bandwidth than we strictly need for our most important data-flows, but we are not rich in bandwidth everywhere in CMS. Some sites are poorly connected, and will remain so relative to the rest of CMS for some time. We run the risk of having our own 'digital divide' among groups which can access data easily and those that cannot.

Technologies such as virtual circuits and bandwidth on demand (BoD) would allow us to control our traffic better. In some cases, using circuits can give us access to network routes that would otherwise not be available. A combination of these approaches could make file-transfers much more deterministic than they are today. E.g, to transfer a large dataset today may take 24 +/-6 hours, due to varying network conditions. With controlled traffic, this might be changed to 12 hours +/- 30 minutes. That makes it sensible to co-schedule the CPU with the storage, which can help optimise the overall system.

Use of virtual machines for deploying services

We already use VMs extensively in production services, but not much outside that. A typical modern laptop is capable of running one or two virtual machines already, and this is only going to get easier with time. Rather than requiring users to install software on their laptops (and therefore having to support multiple operating systems) it should be easy to deploy services as turnkey solutions in VM images.

It's not hard to envisage a system where users install a VM manager on their private machines and we supply the local grid services to them as VM images. With tools like VagrantUP it's very easy to roll out an image that can be used everywhere. PhEDEx has already made some steps in this direction for ANSE (see the PhEDEx Testbed for ANSE page for details).

The hard part is auto-configuring the services, so they can connect to information sources, be properly authenticated and authorised, and run straight out of the box. Doing this securely implies using the user's grid certificate to identify them and having robust knowledge of what they are and are not allowed to do, stored in a database somewhere (SiteDB is the obvious choice today).

Related to this, and a natural stepping-stone towards it, is the idea of a testbed-on-demand. We now deploy all our production services in ways that are becoming more and more standardised. Yet for developers, there are often several extra hoops to jump through to get a working setup. For example, testing new versions of the PhEDEx web suite often requires connecting to private databases (where I run small emulations of network traffic) or to production or pre-production services. Likewise, testing DAS requires connecting to production services for PhEDEx and DBS. It would be very useful to be able to deploy a set of VMs as a coherent testbed, with either production, pre-production, or private instances of each service, selected as the user desires.

As a specific use-case for this, in order to test and debug transfer problems, it would be useful to have a standalone testbed consisting of a private Oracle database, a CMSWEB server with PhEDEx configured to point at the private Oracle instance, and two FTS servers with a small disk pool and a set of associated PhEDEx endpoints serving as the PhEDEx nodes. The TFCs and FileDownload agent configurations should be derived automatically, and the Oracle database populated with the relevant topological information.

Possible Action Items

for PhEDEx

  • factor out the network topology, so sites can be added and removed dynamically.
  • decouple file-movement from replica discovery at the level of the database, so file-movement can be triggered by sources other than those known in the database.
  • decouple transfer result reporting too, so PhEDEx can report transfer results to services other than itself.
  • reduce latencies for start of transfer from ~30 minutes to less than a minute, assuming the destination is not over-booked at the time.
  • it would be good to port the database schema to MySQL, Postgres, or some other RDBMS which is cheaper than Oracle!

-- TonyWildish - 31 Mar 2014

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2014-07-25 - HassenRiahi
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback