My notes on the CMS fedbuilder
Misc.
 Switches seem to be twelve CLOS256+256 (see e.g. http://www.myri.com/myrinet/product_list.html). The whole network seems to be implemented in two identical copies (for bandwidth doubling). One instance consists of three switches (crates) in USC and three in SXC.
 The line cards are probably M3SW3216F. Each of these has one xbar32 (crossbar) chip which can route 16 ports (front panel) on 16 ports (backplane) where each port seems to consist of an input and an output.
 A clos256+256 consists of (up to) 16 M3SW3216F connected to the FRLs (USC) or Readout Units (SXC) and four M34SW3216Q which make 256 connections to another myrinet switch (i.e. the connect the switches in USC to those in SXC).
 Latest snapshot of the UCSD myrinet routing software: click here (Java webstart and Java 1.5 or newer required. If you can run EVO, this link should probably work).
Thoughts on routing
 in order to avoid collisions, the 16 inputs of an xbar must be distributed to the 16 outputs (and vice versa) of the xbar
Thus one category (corresponding to one SCX output xbar) must be distributed across the 16 SCX input xbars.
Thus each category must only cross an SCX input xbar once.
 for the moment, one category also corresponds to an USC input xbar, thus the same argument (for category crossing) applies for the USC output xbars as well.
Links
20100316
20100531
 routing seems to work as follows:
 FedBuilderMultiCrate.route(..) seems to loop over all USC output xbars by calling AssignInterXbar(..)
 AssignInterXbar(..) calls FedBuilder.doAssignL1(..) with the first USC input xbar (and the given USC output xbar) to assign routes. This method then calls itself with the next USC input xbar.
 FedBuilder.doAssignL1(..) calls itself and FedBuilder.doAssignL2(..) which assigns a single line.
 FedBuilder.doAssignL2(..) essentially calls SetUSCOutput(..), performs some consistency checks and undoes the assignment if there are conflicts.
 user specified routing constraints (apart from physical configuration such as cabling and broken fibers):
 for assigning/unassigning: the only thing which should be unassigned is the actual route, other auxiliary quantities (such as the number of used/unused outputs etc.) should be determined on the fly from the routing...
20100601
 another thing which is not clear to me: why for example input lines can be marked as 'done' while they don't have a category assigned ?
20100607
 the nameserver seems to know 640 ru's.
Talking to Sham:
20100702 myrinet monitoring
* some notes by Sham are here: https://twiki.cern.ch/twiki/bin/view/CMS/CDAQOnCallHowTOs#Monitoring_Myrinet_Optical_Links
(from a mail sent on the cmsdaqdaqoncall list on this date)
20100715

 single myrinet link goes down, single myrinet transceiver (NIC or FRL) card goes down: can be masked because there is a parallel link (gives reduced bandwidth though)
 RU (entire PC, e.g. power supply) goes down: what happens then ? Must mask the slice until the PC is repaired ?
 FRL goes down: must fix this immediately...
 counted the number of RU's per slice: 69 of them. This gives 69*8 = 552 links. On the other hand, we have 12 linecards per switch on the Myrinet router, this gives 12 linecards * 3 crates * 16 lines = 576 links. This means we have 24 more links than currently needed by the RUs.
20100810
20100811
 Jim pointed out that the Java program (with the classic algorithm) routes 4 USC line cards to each SCX crate which is not optimal (because there are twice the links to the same number crate as there are for nonsame number crates).
 Marco has sent the table used at point 5 and the program which he used to produce this table (the version I was using before did not have the actual constraints used at point 5).
20100812
 discussion on (amongst others) the requirements for the routing program
 Frans (or somebody else) will have to implement into the Myrinet driver / firmware flexibility for configuring the entire route (not compile it into the driver/firmware)
 for the moment, still route linecards to linecards. In the future, one could be more flexible.
 One Fedbuilder seems to be 116 FRLs (on the FRL side) and N RUs on the RU side (where N is the number of slices which is currently 8). Note that a Fedbuilder on the FRL side can have more than eight FRLs.
grep i frlpc hostnames  grep v CNAME  cut d. f1
in
~/hostnames
(on the online cluster), I see 3
dvfrlpc
and 50 FRL PCs
20100813
New routing algorithm fails with the following four links broken:
failed to find a route with the following list of broken links:
edu.ucsd.hep.fedbuilderroute.routingalgorithms.RoutingAlgorithmException: too many categories to be routed between USC crate 2 and SCX crate 0 (available: 2, needed: 3
(USC to SCX link: USC crate=2,USC xbar=8,USC line=2 <> SCX crate=0,SCX xbar=9,SCX line=0)
(USC to SCX link: USC crate=2,USC xbar=11,USC line=13 <> SCX crate=0,SCX xbar=10,SCX line=12)
(USC to SCX link: USC crate=2,USC xbar=14,USC line=8 <> SCX crate=0,SCX xbar=15,SCX line=8)
(USC to SCX link: USC crate=2,USC xbar=15,USC line=14 <> SCX crate=0,SCX xbar=13,SCX line=12)
(and other examples). Note that the link breaking recipe was:
 break four links between nonsamenumber crates
 make sure no xbar has more than one broken link
So in principle, the routing algorithm should still find a solution...
20100826
 the problem of finding the 4x4 routing categories (with the presence of broken links) can be formulated as a Latin Square problem:
 columns are USC xbar
 rows are SCX xbar
 small squares are the links between an USC and an SCX xbar
 colors are memberships to a 4route
 broken links are assigned their own color. If there is more than one broken link per xbar, more than one color (i.e. more than one category for broken links) is needed
Not clear how to generalize to the case of 32 links between 4 and 4 xbars (between same numbered crates)
 From the Latin Square wikipedia page:
The problem of determining if a partially filled square can be completed to form a Latin square is NPcomplete
which means that it is in NP which in turn means that there is no polynomial time algorithm known (see
http://en.wikipedia.org/wiki/NP_%28complexity%29).
 Sudoku can be solved by a backtracking algorithm, which seems to be essentially a tree (of all combinations) with some pruning (i.e. when it becomes obvious that the current solution can not be completed). This seems to be what Elliott did (with the recursion, but on all 16 xbars, not the 4x4 ones).
* Note that the Latin Square of size 4 has only 576 different squares (that corresponds to the
case of nonequal crate numbers). And there are only 2 isotopy classes (where isotopic
is defined as equal after permutation of rows/columns and the label names).
20100827
 in fact the FRL PCs seem to have several 'myrinet network cards' according to lspci.
/sbin/lspci  grep i myrinet
on frlpcS1D0601 gives:
05:0a.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
05:0b.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
05:0c.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
05:0d.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
05:0e.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
05:0f.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
06:08.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
06:0a.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
06:0b.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
06:0c.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
06:0d.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
06:0e.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
06:0f.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
07:0d.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
07:0e.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
07:0f.2 Network controller: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect (rev 05)
 the driver struct seems to be of type
myrfb_dev_t
.
20100903
 The chromatic polynomial gives the number of possible colorings of a graph as function of the number of colors to be used. When setting 'color' = category route and the graph has vertices which corresponds to myrinet links and edges which mean 'go to the same USC or SCX xbar', the number of colorings is the number of different ways of grouping the links into categories.
 There are also efficient algorithms to construct this polynomial for a given graph.
20100906
 Calculating Chromatic Polynomials in Mathematica is described here. This function seems to count all permutations of colors, so in order to get the number of different possible distributions of categories, one would have to divide by the number of permutations for the given number of categories.
The example shown on the Mathematica page shows 114 possible colorings for three colors which is what one also gets from the corresponding Chromatic polynomial:
def poly(z):
from math import pow
return pow(z,8)  12 * pow(z,7) + 66 * pow(z,6)  214 * pow(z,5) + 441 * pow(z,4)  572 * pow(z,3) + 423 * pow(z,2) 133*z
 starting Mathematica on lxplus:
/afs/cern.ch/project/parc/math70/bin/Mathematica
NOTE: the Chromatic Polynomial does not say ANYTHING about how many vertices ('links') are assigned to a given color. In our case, we need to have exactly (at least)
four vertices assigned to the same color, but the chromatic polynomial does not care about this.
20100907
 one could formulate e.g. the 32 link routing problem between 4+4 xbars as a linear programming problem (make sure that each tuple of (USC xbar,SCX xbar) does not occur more than twice). But then it's an INTEGER linear programming problem which is also NPhard according to Wikipedia.
 20100907quadxbarroutingexample.png:
20101025
 found the following line of code:
int myriAddr = (int) (1024 + 16 * outputLineCardId + 2 * getSliceNumberFromName(sliceName));
in
http://isscvs.cern.ch/cgibin/cvsweb.cgi/TriDAS/RunControl/framework/utilities/hwcfg/fillers/src/rcms/utilities/hwcfg/fillertools/EVBOptimizer.java?rev=1.3;contenttype=text%2Fxcvswebmarkup;cvsroot=tridas
20101031
For the 'between same crates' routing problem: each of the links can be used at most twice (or less if some of them are broken). We can formulate this as an integer programming problem (see also http://en.wikipedia.org/wiki/Linear_program#Integer_unknowns ):
maximize
c^T * x
with constraint
A * x <= b
where x is the vector of 'how many times a pattern in the above list of 24 patterns is taken' (i.e. has length 24), b is the number of times a link is available (i.e. has length 16) and A_ij is 1 if link i is used in pattern j (or zero if it isn't).
Note that this matrix is NOT square. The vector c consists of all ones such that c^T * x is the number of fourpatterns included in the solution.
See e.g. also http://www.cs.brown.edu/courses/csci1490/slides/CS149TUM.pdf
20110321
New CMS wide twiki page created: https://twiki.cern.ch/twiki/bin/view/CMS/MyrinetRoutingProgram
 AndreHolzner  09Jan2010