Support for dynamic circuits and network-aware topologies

Notes from an email

I have had a few email discussions of my ideas about PhEDEx interaction with a network where bandwidth could be reserved on demand. Here's a summary of the ideas I came up with:

PhEDEx is the CMS data-placement management tool, it is responsible for scheduling the transfer of CMS data across the grid, using FTS, SRM, FTM, or whatever transport package people like. PhEDEx typically queues data in blocks of several tens of TB, up to about 12 PB, for a given source and destination. As such, it's a natural candidate for using dynamic circuits, and could be extended to make use of such an API.

The first question for me is where in the stack such a dynamic reservation would be made. It could, in principle, be in the guts of PhEDEx, or at some lower level (FTS server?) such that PhEDEx remains oblivious to its existence. It depends if the circuit is to be built before the transfer starts, or in response to the observed traffic on the wire.

In either case, I need to know how the experiment can set policies for reservation. IIUC, the following statements are true:

  • reserved circuits may give access to bandwidth which is not available to non-reserved traffic, at least in some circumstances, depending on the endpoints in question.
  • whether using a reserved circuit or not, there is competition for bandwidth. On a non-reserved circuit, users fight it out in the usual TCP manner. On a reserved circuit, users compete to make a reservation - I may not manage to reserve a circuit if the resources are fully booked.

If those are true, then there has to be a concept of policy, budget, or penalty for under-utilised reservations. Otherwise, my strategy has to be to book as much as I can, everywhere, always, whether I use it or not. When I do use it, I will get maximum performance. However, that will hurt other users, who can then not book their own circuits, so they will attempt to follow the same strategy. It's a classic tragedy of the commons.

So I need a policy for deciding when to book a circuit. and the simplest policy I can imagine is: use a circuit if the performance I would get with a circuit significantly exceeds that which I would get without one.

That immediately implies I need performance figures for transfers with and without circuits. I can manage that myself, providing I know when a circuit is in use. The interesting bit comes when I do decide that it's worth using a circuit.

Take, for example, the case where I have 1 PB of data to transfer between two points. If I don't reserve a circuit, I expect to transfer at 1 Gbps (for example), so it will take 8 million seconds to arrive. Let's assume I can book a circuit with bandwidth from 0 to 10 Gbps. Clearly booking a circuit below 1 Gbps is not worthwhile (unless I prefer determinism over speed, which I don't). But what if my storage elements cannot handle more than 5 Gbps? I may well have an upper limit too.

So, for this transfer, I want to book a circuit which will give me more than 1 Gbps, less than 5 Gbps, and keep that circuit long enough to transfer 1PB. I want to call an API:

book_circuit(bw_min=1 Gbps, bw_max=5 Gbps, volume=1 PB)

If I can't book a circuit with those constraints, I may be better off not booking the circuit at all. My transfer will be slower, but I will not be locking up resources by booking a circuit which will not improve my situation.

A variant on that is that it may be worthwhile booking the best circuit I can, within the bandwidth constraints, that will get a significant fraction of the data through. The rest can then fight for bandwidth on the general network, or I may be able to book a new circuit at that later time (if someone else has freed a resource). So I may want to call an API:

book_circuit(bw_min=1 Gbps, bw_max=5 Gbps, volume_min=100 TB, volume_max=1 PB)

Whichever API I call, I expect to get back a structure which tells me:

  • what bandwidth I actually have reserved
  • how long it is reserved for
  • an ID, or some handle that I can use to cancel this reservation (useful, for example, if my SE goes down)

N.B. the API may return a circuit optimised for maximum bandwidth or for maximum data-volume it can carry. That's another discussion!

Once I get that functionality, there's another question that pops up: how often should I try to book a circuit? Should I make long-lasting bookings, or a series of short ones?

In the example above, if I book a circuit and get a 1Gbps reservation, it will take me 8 million seconds for the entire transfer. That's 3 months! I may well be better off if, instead of using that reservation for the full 8 million seconds, I try to re-book my reservation after a while, when someone else has freed some bandwidth. But, if I can't re-book, I want to keep the previous reservation. So, after transferring the first 100 TB, I might want to try this API:

rebook_circuit(bw_min=1 Gbps, bw_max=5 Gbps, volume_min=100 TB, volume_max=900 TB, circuit_id=MyID)

Alternatively, I may wish to specify an upper time limit in my original booking, after which I will try again:

book_circuit(bw_min=1 Gbps, bw_max=5 Gbps, volume_min=100 TB, volume_max=1 PB, duration_max=1 million seconds)

then, after a million seconds, I try to book another circuit.

For obvious reasons, I prefer the first strategy, book a circuit for a long time and try to re-book for more after a while. If I only book for a short time, I may not be able to get anything at all later, in which case I'm worse off. I imagine a sensible policy for me would be to book the best I can get that covers the whole transfer, and try every hour or so to see if I can get something better if I am not happy with what I get.

A variation on rebooking, which would be easier for me to use, is if I can 'stack' bookings. If I have a long-lived booking for a low bandwidth, and can make a second booking for another slice of bandwidth, overlapping with the first, would I get the sum of the two bandwidths for the duration of the overlap? I could use that for sure!

Finally, another question, do I need to book-before-transfer, or can I book a circuit for a transfer that's on the wire, and see it benefit from the circuit? My policy will clearly be influenced by this. If I book-before-transfer, can I book in advance? Or how does it work?

So, hopefully this gives you an idea of how I foresee using dynamic circuits in PhEDEx. I cannot imagine a scenario in which I do not have to apply a policy for the minimum and maximum bandwidth and data-volume, and for the total duration of the circuit lifetime. Any such scenario implies that there is sufficient bandwidth that I do not need a 'dynamic' circuit, and can simply make a static booking with no regard for the consequences.

Possible approaches within PhEDEx

There are essentially four possible approaches within PhEDEx for booking dynamic circuits:
  1. do nothing, and let the fabric take care of it
  2. book a circuit for each transfer-job
  3. book a circuit at each download agent, use it for multiple transfer jobs
  4. book circuits at the router level, where a global view is maintained

These each have their pros and cons:

Let the fabric take care of it

This is trivial, a NOP for PhEDEx. The result is likely to be sub-optimal in terms of network performance, as it may allow the fabric to reserve all the bandwidth on a given link for a low priority transfer, thereby starving higher priority transfers for resources. In short, this is only an option if bandwidth is effectively infinite, as discussed above

Book a circuit for each transfer-job

This is how the FDT module currently works. A transfer-job is executed by a wrapper that first attempts to book a circuit for the transfer, taking as much bandwidth as it can get. This is fine if there is only one transfer job, and it has a lot of data in it. If, on the other hand, we have many transfer jobs launched in parallel (the usual case for us), then the first transfer job to get a circuit will starve the rest, which will fall back to using the GPN.

Obviously there are possibilities to tune this, but since it will only ever be a local optimisation, it may always be somewhat less than optimal.

Book a circuit for each download agent

This has advantages over the previous approaches in that it can maintain a stable circuit that can be used for all the transfers on a given link. It's a fairly trivial thing to implement too, just by inspection of the queue to determine how much bandwidth you will need and how long for. The downside is that it is still a local optimisation, using only information available at the destination site, so it may not be globally optimal for CMS. Using it intelligently may require changes in the way the router allocates transfers.

Book a circuit at the router level

The router agent is the place that knows about the entire CMS dataflow activity. It is clearly the place, therefore, to add a global optimisation of bandwidth use. However, it cannot know when the download agents will be ready for a given transfer task, so it will effectively be booking a slot far in advance (up to 1 day) of when it is actually to be used.

Proposal for a prototype

Booking a circuit for each download agent is a reasonable first step. That makes the booking just in time, avoiding issues of predicting network behaviour several hours into the future. For a second step, the router can be tuned to:
  • optimise the queue for a site to utilise fewer sources, so a bandwidth reservation can be used to the maximum effect.
  • provide hints to the site (through the agent_message table?) to tell it how much bandwidth it should reserve, in the case that there is competition among several sites on the same links.
  • optimise the queue for a site by optimising queues for 'adjacent' sites to avoid competition. Thus, instead of choosing a source-node because it is the 'best' choice for transfers to several sites, select it only for the one that corresponds to the highest priority (and which therefore books itself a fat pipe) and take the second choice for transfers of other data to other destinations.
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2012-12-13 - TonyWildish
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback