Some ideas on network-aware data-management for CMS

Background

Network data-management for CMS means primarily PhEDEx, but also recently xrootd. PhEDEx is almost ten years old now, and was designed at a time when the network was considered to be the least reliable component of our distributed computing system. Now we see the network is the most reliable component, more transfer failures are caused by storage element failure or misconfiguration than on the wire itself.

We also anticipated links of ~100 MB/sec, now we have links up to three orders of magnitudes faster than that, pervading the entire hierarchy of the network. This makes much of the core logic of PhEDEx sub-optimal. PhEDEx will back off and retry with algorithms similar to TCP itself, which was also designed for unreliable networks The result is that we can efficiently transfer datasets in terms of (multiple TB) X (a day), but transfers of smaller timescales are not so predictable, there are too many latencies built into the system.

PhEDEx also uses internal statistics to measure link performance, rather than using any external metric. There are pros and cons to both approaches, but there is also no reason not to combine the two. We currently run LoadTest transfers in the debug instance specifically for maintaining knowledge of working links, and this constitutes about half of our traffic CMS-wide, just for that monitoring information. (N.B. The debug and production instances do not communicate except via the minds of the operations staff).

PhEDEx transfers about 1 PB/week CMS-wide, a rate that has been flat for over 3 years. This corresponds to transferring the entire (current) CMS dataset about 5 times in that same interval. We are far from saturating the existing network bandwidth (it's interesting to ask why that is, see below).

Given that our CPU needs are going to increase dramatically post-LS1, it makes sense to see how we can use the network to improve things by making data more promptly available. The network is so reliable, and is over-provisioned compared to our current usage, so it should be relatively easy to make better use of it. One obvious target is co-scheduling of jobs with data, which is something we should have been doing a loooong time ago.

Network-awareness

Fabric-level monitoring and control

Network-aware systems offer a few new possibilities:
  • better monitoring: perfsonar among other things can give us a clearer view of the state of the network. That can guide us in choosing sources for transfers better than simply relying on internal history. PhEDEx has hooks to make use of that information already, and is awaiting a source of information to plug into it. This is the easy bit.
  • network-control at the link-level can make transfers much more deterministic (N.B. careful of the terminology overload, 'link' here is a layer-2 protocol, not a PhEDEx link). By reserving a chunk of network bandwidth between two endpoints, we would be able to say that a given transfer will complete within a given time-window with a high degree of certainty. This is the 'Bandwidth-on-Demand (BoD) concept.

Analysis throughput

One thing we do not measure in CMS is analysis throughput. We measure transfer throughput, and job performance, even to the level of groups of jobs in a given CRAB analysis, but we do not measure the entire end-to-end analysis throughput. By that I mean we do not consider the set of datasets that are needed to produce a single physics number (signal, background, MC). Doing that requires a higher-level view of the system than we have at the moment, and the network then becomes an integral part of the data processing.

Advanced Network Services for Experiments (ANSE)

There are many regional projects that (claim to) provide BoD over limited parts of the network (OSCAR, DYNES, AutoBAHN) but none that provide end-to-end services across a significant chunk of the CMS network topology. There is an ongoing project, ANSE (Harvey Newman et.al) which works in conjunction with the LHCONE/LHCOPN WLCG networking group that aims to fill that gap by providing high-level 'network middleware' for the experiments. They've hired a developer to work on PhEDEx for ANSE over the next 18 months, that developer will be working (with me) on the FileRouter algorithms to exploit these capabilities, as well as working with the ANSE team to provide the middleware framework that makes it run.

ANSE will provide a useable prototype if we invest enough effort in guiding its efforts. There may then be a follow-up project to make a production-scale product, that is yet to be determined.

PhEDEx and the network

Apart from the somewhat obsolete attitude towards the network, PhEDEx has other features that can be re-examined. In particular, the static network topology can be revised.

A PhEDEx node today serves two purposes. It owns the data at that site, with responsibility for maintaining it and making it available for use. It also serves as a network endpoint, a source and/or sink of data. Much of the overhead of managing a PhEDEx node comes from the static view, that of the node as something that maintains data over time. There is no reason we cannot factor that out and create a totally dynamic type of PhEDEx node, one which appears, manages data-transfer to/from a point in the network, then disappears. This allows the possibilty of using opportunistic resources with serious data-movement too, but only if we can achieve the goal of moving TB of data in the timescales of tens of minutes, not hours.

This isn't difficult, in fact it's essentially what we do with PhEDEx when scale-testing, submitting a batch job to run the agents for a given site.

The bottom line is that PhEDEx would become less like Asimov's MULTIVAC and more like a lightweight personal service, like a CDN.

Conclusions

  • the network is far more reliable and performant than we anticipated a decade ago. We should exploit that.
  • there is effort available via ANSE and LHCONE/LHCOPN to provide us with much more functionality from our networks than we consider today. We should exploit that.
  • If we succeed, we may significantly change the way we work on the 4-6 year timescale.
  • we should consider re-factoring PhEDEx to separate out the static data-stewardship role from the dynamic data-movement role, with a view to making the data-movement more dynamic both in terms of network speed and network topology

See also...

https://twiki.cern.ch/twiki/bin/view/CMS/PHEDEXSupportForDynamicCircuits
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2013-07-15 - TonyWildish
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback