White paper that describes Congestion Control (CC) and Quality of Service (Qos) in Infiniband: http://www.mellanox.com/pdf/whitepapers/deploying_qos_wp_10_19_2005.pdf

Deploying Quality of Service and Congestion Control in InfiniBand -based Data Center Networks

Summary: The mellanox description (over compares against Ethernet) of QoS and CC for Infiniband. Detailed description of the Queueing schemes (which allow QoS): Virtual Lanes with WRR VL arbitration (for QoS), packet marking and source rate limitions (for CC).
The link-level credit-based system is not explained (also CC). Fat-tree topology routing is briefly explained.

InfiniBand Architecture Basics

- QoS implementation is the concept of traffic classes or flows.
- Congestion management schemes raise the priority of a flow by queuing and servicing
queues in different ways. Policing and shaping provide priority to a flow by limiting the
throughput of other flows. Finally, the rate of packet injection of lower priority flows can be
controlled at the source nodes

Compared to Ethernet: Infiniband is loss-less (no packet drop, no timeout&retransmissions).
Allows other routing algorithms than spanning tree (although I know Ethernet too, ej: SBP) that allow redundant links. This allows to grow bandwidth indefinetly with fat-tree topologies.
Less buffers (cheaper) switches because buffering is mainly on the source host. This is because of the credit-based system which is not explained here.

-Queue Pair (QP): IBA utilizes a memory-based user-level communication abstraction. The communication
interface in IBA is the Queue Pair (QP), which is the logical endpoint of a communication link. A QP is implemented on the
host side of an InfiniBand channel adapter (such as an HCA)
- Virtual Lanes (VL): The port side of the channel adapter implements what are called Virtual Lanes. Virtual Lanes enable multiple independent data flows share same link and separate buffering and flow control for each flow. A VL Arbiter is used to control the link usage by the
appropriate data flow. InfiniBand links are logically split into Virtual Lanes (VLs). Each VL has its own dedicated setof associated buffering resources.

- Packet Marking: IBA provides two fields for marking packets with a class of service: the service level (SL) field in the LRH and the traffic class field (TClass) in the GRH.
- SL to VL mapping table: selects the output port virtual lane based on the packets SL, the port on which the packet was received, and the port to which the packet is destined.
- Virtual lane arbitration: is the mechanism an output port utilizes to select from which virtual lane to transmit. IBA specifies a dual priority weighted round robin (WRR) scheme. In this scheme, each virtual lane is assigned a priority (high or low) and a weight. Packets from the high priority
virtual lanes are always transmitted ahead of those from low priority virtual lanes. Within a given priority, data is transmitted from virtual lanes in approximate proportion to their assigned weights (excluding, of course, virtual lanes that have no data to be transmitted).

Link Level Flow Control
InfiniBand Link Level Flow Control (LLFC) is implemented per VL. LLFC works as a first-
degree mechanism to deal with congestion without dropping packets. Transient congestion is
effectively dealt with by LLFC. Feedback-based mechanisms cannot deal with the time constants
of transient congestion. Since it is on a per VL basis, LLFC maintains complete isolation of
different traffic from each other. Transient congestion on one VL does not have any impact on
traffic on another VL.
Subnet management packets have a dedicated VL, are not subject to LLFC, and are treated as the
highest priority (thus guaranteeing management traffic progress regardless of fabric state).

VL Arbitration

Packets on different VLs share the same physical link. Arbitration is done through a dual priority
WRR scheme. The scheme provides great flexibility and was designed with a HW
implementation in mind.
VL Arbitration is controlled through a VL Arbitration table on each InfiniBand port. The table
consists of three components: High-Priority, Low-Priority and Limit of High-Priority. The High-
Priority and Low-Priority components are each a list of VL/Weight pairs. Each list entry contains
a VL number (values from 0-14), and a weighting value (values 0-255), indicating the number of
64-byte units which may be transmitted from that VL when its turn in the arbitration occurs. The
same VL may be listed multiple times in either the High or Low-Priority component list, and it
can be listed in both lists (see VL Arbitration example below). The Limit of High-Priority
component indicates the amount of high-priority packets that can be transmitted without an
opportunity to send a low priority packet.

The High-Priority and Low-Priority components form a two-level priority scheme. Each of these
components (or tables) may have a packet available for transmission. Upon packet transmission, the following logic is used to determine which table will be enabled to transmit the
next packet:
o If the High-Priority table has an available packet for transmission and the HighPriCounter
has not expired, then the High-Priority is said to be active and a packet is sent from the
High-Priority table.
o If the High-Priority table does not have an available packet for transmission, or if the
HighPriCounter has expired, then the HighPriCounter is reset, and the Low-Priority table
is said to be active and a packet is sent from the Low-Priority table.
Weighted fair arbitration is used within each High or Low Priority table. The order of entries in
each table specifies the order of VL scheduling, and the weighting value specifies the amount of
bandwidth allocated to that entry. The table entries are processed in order.

Link Level Flow Control: Who assigns the credits????
IBA is, in general, a lossless fabric. To achieve this, IBA defines a credit based link-level flow control. Credits are issued on a per virtual lane basis; consequently, if the receive
resources of a given virtual lane are full, this VL will cease transmission until credits are available again. However, data transmission on the other virtual lanes may continue.

Congestion Control in InfiniBand Architecture

Read section completly from the white paper. It is the details explanation of how the FECN and BECN bit are used to mark congestion packets, and how the source limits the sending rate.
All is configured with different parameters using the Congestion Control Manager (CCM).

The credit-based system is not explained.

Use of Fat Tree Topologies in InfiniBand Based Clusters

Fat Tree Topologies = Clos networks “Fat Tree” = Constant Bisectional Bandwidth (CBB) networks

-In a fat-tree, the link bandwidth increases when upward from leaves to root. And thus, the root congestion problems can be relieved.
-Routing in a fat-tree is relatively easy since there is a unique shortest path between any two processing nodes

- The use of Fat Tree topologies in InfiniBand clusters enables it to scale bandwidth as needed and provide packet forwarding using the shortest available path.

Interconnect Requirements for the Data Center Network
High level general requirements. Not so interesting technically.

-- MatiasAlejandroBonaventura - 2016-07-08

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2016-07-08 - MatiasAlejandroBonaventura
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback