Algunas notas sobre Infiniband, especialmente enfocado en la simulacion para PhaseI. El foco esta en el control de congestion (CC) y QoS.
El CC depende de varios parametros que se deben setear apropiadamente. Hay un paper interesante que modelan esto con OMNET++ para estudiar los parametros y sus efectos.

The full specification: InfiniBand™ Architecture Volume 1 and Volume 2. Chapters to read:

  • Chapter 3: Architectural Overview
  • Annex A10: Congestion Control
  • 7.6 Virtual Lanes Mechanisms..

  • 7.9 Flow Control
  • 8.2 Packet Routing
  • A13: Quality of Service (QoS)
  • 9.7 Reliable Service
  • 9.10 Static Rate control

General concepts


  • Los Channel Adaptors (CA) minimizar la interacción (interrupciones) al kernel escribiendo directamente en los buffers. Puede utilizarse Remote Data Memory Access (RDMA)
  • Un CA contiene varios puertos, que se dividen logicamente en Virtual Lanes (VL).
  • A nivel de applicacion la es a travez de Queue Pairs (QP) que estan compuestos por un Send work queue y un Receive work queue. Una coneccion se establece conectando una QP local con una QP remota. Una QP encola instrucciones que son ejecutadas por Host Channel Adapter (HCA), y se identican con un QP number.
    Utilizan como buffer una cantidad de memoria limitada unicamente por la memoria fisica. "The queue pair is the mechanism by which you defi ne quality of service, system protection, error detection and response, and allowable service. Each queue pair is independently co nfigured for a particular type of service: 1)Reliable connection or 2) Unreliable connection or 3) Reliable datagram or 4) Unreliable datagram"
  • The Subnet Manager configures and maintains fabric operation (one master, multiple slaves). The Subnet Manager is the central repository of all information that is required to set up and bring up the InfiniBand fabric.
    The Subnet Manager discovers the topology and end nodes, and configures its IDs. It configures the switch forwarding tables. Stays sweeping network for changes.
  • Subnet Manager Agents is provided in each host and process packets from the Subnet Manager. Only agents from the master subnet manager are active.
  • Slave Subnet Managers can syncronize the information with the master to take over in case of failover. Database syncronization occurs in 2 stages: 'cold syncronization' to copy unsyncronized tables and 'transactional syncronization' for the slave to replicate the master in each operation.

Flow Control mechanisms at different levels

Infiniband implements several mechanisms that control the flow of packets:

  • Link-level flow control: Refer to "7.9 Flow Control" section in InfiniBand™ Architecture Volume 1 and Volume 2.

    Flow control is used to manage the flow of data between links in order to guarantee a lossless fabric. Each receiving end of a link (Virtual Lane) supplies credit to the sending device to specify how much data can be received without loss of data. A dedicated link packet manages the credit passing between devices in order to update the number of packets the receiving device can accept. No data is transmitted unless the credit that is announced by the receiving end indicates sufficient buffer space to receive the entire message.

  • End-to-End (Message Level) Flow Control: Refer to " END-TO-END (MESSAGE LEVEL) FLOW CONTROL" section in InfiniBand™ Architecture Volume 1 and Volume 2.
    Only for repliable connections and disabled for any QP which is associated with a shared receive queue.
    Used by a responder to optimize the use of its receive resources. Essentially, a requester cannot send a request message unless it has appropriate credits to do so. Encoded credits are transported from the responder to the requester in an acknowledge message. Each credit represents one WQE posted to the receive queue. Credits are issued on a per message basis, without regard to the size of the message.

  • Quality of Service (QoS): refer to "ANNEX A13: Quality of Service (QoS)" section in InfiniBand™ Architecture Volume 1 and Volume 2.
    Support for this Quality of Service framework is optional.
    Allows to manage the resources such that each workload receives a portion of the resources commensurate with their relative goals and importance. Packets on different VLs share the same physical link. Packets may contain a Service Level field which maps them to a VL (using the SL to VL mapping table in each switch). The switch does arbitration through a dual priority Weighted Round Roubin scheme: first serve all high priority VLs and when empy then serve low priority VLs. Within high and low VL sets use a WRR to choose the VL to use.

  • Congestion control: refer to "ANNEX A10: CONGESTION CONTROL" section in InfiniBand™ Architecture Volume 1 and Volume 2.
    An optional function for InfiniBand devices.
    Switches detect congestion on a VL and mark packets Forward Explicit Congestion Notification (FECN). The receiver of the marked packets sents a BECN to notify the source. The source of the congested packet reacts by temporarily reducing its injection of packets into the network. The injection rate reduction is based on a table that indicate increasing delays applies when deliveing each packet. The more BECNs received the bigger delays are used, and they decrease over time when no BECN is received.



  • Routing:
    • Step 1 -The Subnet Manager discovers all the InfiniBand switch chips in the network.
    • Step 2 - The Subnet Manager groups the internal switch chips within each chassis into a switch element.
    • Step 3 - The Subnet Manager process continues until all the InfiniBand switches are grouped into switch
    • Step 4 - After all the switch chips are grouped, the Subnet Manager routes the switch elements according a routing algorithm.
    • Step 5 - The internal network of each InfiniBand switch is then routed based on the best algorithm for each switch element.
  • Routing Algorithm: Minimum Contention, Shortest Path, and Load Balancing algorithm
    • Step 1 - The shortest path for each of the host ports is calculated.
    • Step 2 - Contention is calculated for all the available paths that are within the (shortest path + tolerance) distance.
      • a.The path with the least contention is selected.
      • b.If two paths have the same contention, the path with less distance is selected.
      • c.If two paths have the same contention and the same distance, the port usage count is used to provide load balancing over the two paths.The usage count is a measure of how many LIDs have been configured to use that particular port
OpenSM now offers nine routing engines (for details on each one go to the link):

1. Min Hop Algorithm - based on the minimum hops to each node where the path length is optimized.
2. UPDN Unicast routing algorithm - also based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and deadlock may occur due to a loop in the subnet. More here:
3. DNUP Unicast routing algorithm - similar to UPDN but allows routing in fabrics which have some CA nodes attached closer to the roots than some switch nodes.
4. Fat Tree Unicast routing algorithm - this algorithm optimizes routing for congestion-free "shift" communication pattern. It should be chosen if a subnet is a symmetrical or almost symmetrical fat-tree of various types, not just K-ary-N-Trees: non-constant K, not fully staffed, any Constant Bisectional Bandwidth (CBB) ratio. Similar to UPDN, Fat Tree routing is constrained to ranking rules.
5. LASH unicast routing algorithm - uses Infiniband virtual layers (SL) to provide deadlock-free shortest-path routing while also distributing the paths between layers. LASH is an alternative deadlock-free topology-agnostic routing algorithm to the non-minimal UPDN algorithm avoiding the use of a potentially congested root node.
6. DOR Unicast routing algorithm - based on the Min Hop algorithm, but avoids port equalization except for redundant links between the same two switches. This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh (see details below).
7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm specialized for 2D/3D torus topologies. Torus-2QoS provides deadlock-free routing while supporting two quality of service (QoS) levels. In addition it is able to route around multiple failed fabric links or a single failed fabric switch without introducing deadlocks, and without changing path SL values granted before the failure.
8. DFSSSP unicast routing algorithm - a deadlock-free single-source-shortest-path routing, which uses the SSSP algorithm (see algorithm 9.) as the base to optimize link utilization and uses Infiniband virtual lanes (SL) to provide deadlock-freedom.
9. SSSP unicast routing algorithm - a single-source-shortest-path routing algorithm, which globally balances the number of routes per link to optimize link utilization. This routing algorithm has no restrictions in terms of the underlying topology.

OpenSM also supports a file method which can load routes from a table. See 'Modular Routing Engine' for more information on this.

The basic routing algorithm is comprised of two stages:

1. MinHop matrix calculation How many hops are required to get from each port to each LID ?
The algorithm to fill these tables is different if you run standard
(min hop) or Up/Down. For standard routing, a "relaxation" algorithm is used to propagate
min hop from every destination LID through neighbor switches For Up/Down routing, a BFS from every target is used. The BFS tracks link
direction (up or down) and avoid steps that will perform up after a down step was used.

2. Once MinHop matrices exist, each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID. This step is common to standard and Up/Down routing. Each port has a
counter counting the number of target LIDs going through it. When there are multiple alternative ports with same MinHop to a LID,
the one with less previously assigned LIDs is selected. If LMC > 0, more checks are added: Within each group of LIDs assigned to
same target port, a. use only ports which have same MinHop
b. first prefer the ones that go to different systemImageGuid (then
the previous LID of the same LMC group) c. if none - prefer those which go through another NodeGuid
d. fall back to the number of paths method (if all go to same node).

QoS and CC: Deploying QoS in InfiniBand based Data Center Networks

White paper that describes Congestion Control (CC) and Quality of Service (Qos) in Infiniband:

Summary: The mellanox description (over compares against Ethernet) of QoS and CC for Infiniband. Detailed description of the Queueing schemes (which allow QoS): Virtual Lanes with WRR VL arbitration (for QoS), packet marking and source rate limitions (for CC).
The link-level credit-based system is not explained (also CC). Fat-tree topology routing is briefly explained.

Complete review here.

M&S of Congestion Control in Infiniband

from this paper: InfiniBand Congestion Control, modeling and validation

Esta muy relacionado con este otro paper, que realiza los mismos experimentos pero mas en profundidad pero unicamente en hardware: First Experiences with Congestion Control in InfiniBand Hardware (tiene graficos muy interesantes para el sweep de parametros!!)

- Simulacion de CC en OMNET++
- Compara realidad con modelo y le da muy bien
- Escalabilidad: simulan 0.5s de una red de 650 nodos en 2 dias.
- Se basan en un modelo provisto por Melanox, al que le agregan CC segun la especificacion: marcado de paquete en los switch y limitacion de rate en la fuente.
- Muy interesante el uso de simulacion para escanear el espacio de parametros que se pueden configurar en CC de infiniband
- "The IB CC mechanism is based on a closed loop feedback control system where a switch detecting congestion marks packets contributing to the congestionby setting a specific bit in the packet headers"
- Parameters governed by a Congestion Control Manager

Source Code del modelo:

Otros papers sobre CC en infiniband:

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2016-07-13 - MatiasAlejandroBonaventura
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback