Saturday, 26 July 2014

What's Up With Number Bases?

Humans evolved to have 10 fingers, so it's no surprise that the arithmetic which surrounds our day-to-day life uses numbers that are to base 10. What does this mean? In representing numbers, we really have two options. The first is to have a single symbol to represent every number, for example you could go 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,G,J,K,L,M,N,O,P... and after you're out of english letters you can use greek, then chinese and eventually elaborate paintings of assortments of fruit as you count higher and higher. The second option is to use a limited set of symbols which carry extra meaning when placed alongside eachother into digits. Base 10 uses 10 symbols: 0,1,2,3,4,5,6,7,8 and 9. You could decide that each digit holds the same weight i.e. 139 could mean 1+3+9, but we would obtain the same number using 1+4+8 i.e. 148, meaning testing for equality wouldn't be easy, not to mention large numbers would require many many digits. However we can solve both these problems by just saying the first digit in a number (starting from the right) represents its [symbol's value] * [the base]^ 0, and then say the second digit represents the [symbol's value] * [the base]^ 1 and so on, and the whole number is all of these things added up. In base ten this means that the number 139 is actually:

9 * 10^0 = 9 * 1 = 9
+
3 * 10^1 = 3 * 10 = 30
+
1 * 10^2 = 1 * 100 = 100
=
100 + 30 + 9

The reason that this setup can allow for each integer to be represented is that each digit to the left of the core digit (I made that term up but I'm sure you can intuit that I'm referring to the 9 in the above example) can be expressed as just a heap of 'ones'. Consider the number 39:

There's certainly nothing special about using a base of 10 as opposed to another other base. All the above rules apply to something like base 8 (octal numbers). Base 8 uses the symbols 0,1,2,3,4,5,6 and 7. If you're working with base 8 and you see the number 139 it means 9 * 8^0 + 3 * 8^1 + 1 * 8^2 = (in base 10) 9 + 24 + 64 = 97. Just like with base 10, base 8 can represent every integer and each integer has a unique representation which is pretty cool in my opinion. We could have evolved to have 8 fingers rather than 10 and all the little tricks we learn from multiplication like 'when timesing by 5 just halve the number and add a 0 digit' would be gone and we'd have other tricks to easily work our way around arithmetic using the base.

You may wonder 'why would we ever need to worry about bases other than 10 if it's obvious that we can do everything we need to using the base'? The answer is that computers can only represent data in the form of bits (1s and 0s) meaning that if you want to store the number 139 (in base 10) in a computer you'll need to find a way to convert it into 1's and 0's. Well how about a base which has two symbols, 0 and 1, called 'binary'. A lot of this post will be about converting between the two.

First off let's visualise what's going on with different bases of numbers. The following picture shows what is looks like when a heap of bases represent the number that base 10 calls 39. On the left of each level is the size of that level's 'chunks' i.e. what one chunk represents in terms of ones. The numbers on the right hand side of each level are the symbols that represent the number of chunks in the level. Reading top-down through the symbols, you get the number that the base spits out when you ask it what its representation of 39 is. For base infinity i.e. the base with only a single digit, I just made some crazy looking symbol.


As you can see, the bottom level of base 4 is never going to have 4 chunks because that can be represented by a single chunk in the level above it, meaning it can only ever have a max of 3 chunks. The same can be said for the second (purple) level; it can only have a max of 3 chunks because if it has 4, that can be represented by a single chunk in the green level. Not only 'can' it be represented like that, but it must be, as base 4 only had 4 symbols: 0, 1, 2 and 3, meaning you cannot have 4 chunks on one level and if you do, you just give it to the level above.

let's look at how you could construct these numbers if you just started with the row of ones from base infinity. We'll do base 3 first: you know you can't represent all 39 of the chunks on the bottom row because your max symbol is 2 so you tell all of the chunks on the bottom row to find two friends and form a chunk on the above row. Luckily for us there aren't any leftovers so we can just go to the next row and see if we've got few enough chunks this time. So we look at the purple row and we have 13 chunks which is still too much meaning we ask the chunks to form groups of 3 and jump to the next row. In this case we have 1 leftover (remainder) who didn't have the social skills to get in early with the grouping, so he gets left on that level (no problem for us though because there's 1 chunk and we can use the symbol '1' to represent that). So we had 13 chunks and divided it into 3 groups and ended up with 1 remainder, so we've got 4 groups on the green level. Again, it's too much so we get them to form groups of 3 again, and again 1 gets left behind. We go up to the cyan level and because we have a number of chunks (1) that we have a symbol for, we can stop there (in fact we wouldn't be able to go up another level if we wanted to because 1 chunk on the above level represents 3^4 = 81 ones which we clearly don't have on the cyan level).

So we end up with the number 1110 in base 3, with each '1' representing the chunks that didn't go on to the next level. We know each chunk contains 3 times as many ones as a chunk in the next level down and that in the bottom level 1 chunk = 1 one so going right to left we can count in base 10 to see we have
0 * 1 + 1*3 + 1*9 + 1*27 = 3 + 9 + 27 = 39
which is what we started with, meaning our method works.

So what exactly was our method? We start off with no digits and a string of ones (which we talk about as 39 in base 10 terms because we're so familiar with it) and we add a digit which is the remainder when we divide the 39 by our base (in this case 3), and then we actually divide the 39 by 3 and continue the process with the quotient (the non-remainder part of the division), until the result of dividing is 0 (because that's the point when there are no more chunks which can group up and go to the next level).

If we were to make an algorithm for this, starting with the number n and using base b it would look like this:

Convert_From_Base_Ten(n,b):
    number <-- ""
    while n != 0:
        number <-- str(n % b) + number
        n <-- n/b
    return number

Above we did the reverse process where you take each digit and multiply it by its associated value, and then sum them all up e.g. 0 * 1 + 1*3 + 1*9 + 1*27 = 3 + 9 + 27 = 39. Let's see how the algorithm would look for that:

Convert_To_Base_Ten(number,b):
    n <-- 0
    for i in xrange(0,len(number)):
        n <-- n + number[i] * b^(len(number)-1-i)
    return n

So we've taken care of integer conversions, BUT unfortunately we've left out one whole half of the digits: the ones one right of the decimal point. Just as the core digit represents your base to the power of 0's and digits to the left have ascending powers, digits to the right i.e. the fractional digits have descending powers. the number 0.325 in base 10 is 3*10^-1 + 3*10^-2 + 3*10^-3. Likewise the number 0.101 in base 2 is 1*2^-1 + 0*2^-2 + 1*2^-3 = (in base 10) 0.5 + 0.125 = 0.625. This calculation should be evidence enough that going from any base to base 10 isn't difficult, but the same cannot be said for the reverse process.

To understand the process of going from base 10 to some other base, we must first realise that the digital number system lets us shift digits left or right by multiplying or dividing by the base. For example 104 in base ten becomes 1040 if you multiply by ten or 10.4 if you divide. If you consider the visualisation above, when you multiply a number by its base you're going to each level and giving each chunk (base - 1) friends which is precisely enough to jump up to the next level so everything gets shifted up one level, and the reverse is true for division.

I'll focus on binary because that's really the one big base that we care about: If you want to know what the fractional digits of a binary number are but you can only see the core digit, you can just multiply the number by 2 and write down the digit that shows up in the core digit, then repeat that process until you're left with just 0's coming in or some repeating sequence of digits. BEAR WITH ME. Alright so say you wanted find out the binary equivalent of the number 0.582 in decimal. You can follow the exact same process: just times by 2 and you'll get 1.164. The thing here is that the core digit doesn't lie i.e. it means the same thing regardless of what base you're talking about, meaning that 1 would have showed up if you started with 0.582 in binary, so you just follow the process and write it down. Now I would say times 1.164 by two but the problem here is that the first '1' isn't going anywhere; in binary it would be shifted to the right and gotten out of our way by another multiplication by 2. Considering that's not going to happen here we'll just need to manually get it out of our way by subtracting it off. So once we're left with 0.164 we can multiply by 2 again, giving us 0.328. We write down the digit (0) which means we now have '10' written down corresponding to 0.10 in binary. repeat the process; now we have 0.656, so write down another 0. After a while I ended up with 0.100101010111 which didn't look like it planned on terminating any time soon.

So that's our process for going from decimal to binary. Note that this works with all non-decimal numbers; however if you want to use something like hexadecimal (base 16) then you'll need to consider the two digits to the left of the decimal point rather than just the 1st (because if you times say 0.9 by sixteen the number you're left with overflows the core digit). Remember what I said above: the core digit is the only one which doesn't lie, so if you're dealing with hexadecimal you'll need to do a conversion from the decimal to hexadecimal i.e. a conversion within a conversion. Luckily this isn't difficult because you can just look at the first two digits and go straight to the hexadecimal equivalent i.e. 01 --> 1, 02 --> ... 10 --> A, 11 --> B ... 15 --> F. I would give an example but you would be surprised how hard it is to find a fractional number in base 10 which doesn't give a recurring sequences in base 16.

The general algorithm for converting a fractional number n from base 10 to base b is:

Convert_Fractional(n,b):
while n!=0 and (there aren't any repeating digit patterns):
    i <-- -1
    n <-- n * b
    number <-- number + Convert_To_Base_Ten(<digits to the left of the decimal point>,b)*(b^i)
    i <-- i - 1
    n <-- n - <digits to the left of the decimal point>

Anyway so we know how to convert to and from base 10 in terms of both integer and fractional parts. But in truth we can convert between numbers of any two bases because our algorithms just happened to be thinking in terms of base 10 when you could be thinking in terms of any base.

Now going straight from binary to decimal and back is pretty laborious because it takes a lot of steps, but going from hexadecimal to decimal and back is very easy (far fewer digits) meaning it would be very convenient if converting between binary and hexadecimal was easy because it would allow us to use hexadecimal as a middleman between the two bases. Let's see if it is:

four binary digits can cover the same range of numbers as a single hexadecimal digit:

0000 = 0    0001 = 1    0010 = 2    0011 = 3
0100 = 4    0101 = 5    0110 = 6    0111 = 7
1000 = 8    1001 = 9    1010 = A   1011 = B
1100 = C   1101 = D   1110 = E    1111 = F

meaning the number 1101 1010 0101 . 0100 0011 1111 is the same as DA5.43F. We could have used the conventional techniques to work that out but obviously the ability to group the digits up into sequences of 4 and convert each one makes our job a lot easier. The story is similar with binary and base 8; except there are three binary digits to each octal digit. It works because the bases are all the same number to a different power. With hexadecimal, each digit (2^4 symbols) encapsulates the information of 4 binary digits (4 digits of 2^1 symbols each giving 2^4 total permutations).

So we know that going from converting between binary and hexadecimal is trivial, and we know converting between hexadecimal and decimal is easier than converting between binary and decimal, so it's clear that converting between binary and decimal is best done with the hexadecimal middle-man.

So far example we can use the number from above: 1101 1010 0101 . 0100 0011 1111 as an example. We get to the hexadecimal number DA5.43F and then from there we want to be decimal so we first look at the DA5 part:

5 * 16^0 + 10 * 16^1 + 13 * 16^2 = 160 + 832 = 3493

 Looking at the 43F part :

4 * 16^-1 + 3 * 16^-2 + 15 * 16^-3 = 1/4 + 3/64 +15/1024 = 0.26538085937 in decimal.

So our final answer is 3493.26538085937. Very cool. Going back the other way we can start with 3493.26538 and try to get to hexadecimal:

3493 % 16 = 5
3493 / 16 = 218
218  % 16 = 10 = A
218 / 16 = 13
13 % 16 = 13 = D
13 / 16 = 0
so we get DA5 (good thing too because it's what we started with)

now for the fractional part:

0.26538 * 16 = [4].24608
0.24608 * 16 = [3].93728
0.93728 * 16 = [14].99648 = [E].99658

We can stop there if we think we're accurate enough (the fact that we truncated the value earlier means any further digit calculations wouldn't be very representative anyway). We ended up with 43E which isn't too far off the 43F that we started with; obviously the cause of this was also the truncation above. But nonetheless we're left with DA5.43E. Next step is getting this into binary: which digit for digit is just:

1101 1010 0101 . 0100 0011 1110
Not too bad! Well making this post has done wonders for my intuition on the subject and hopefully you can say the same. Next up I'll hopefully get around to posting about ways to speed up Prim's and Kruskal's Algorithms.

Thursday, 17 July 2014

What's Up With Minimum Spanning Trees?!

If you take a graph and select certain edges to form a tree which includes all of the graph's vertices, you've got a spanning tree. If this tree, when compared to all the possible spanning trees you can make from the graph, has the minimum weight, then you've got a minimum spanning tree (MST). We're going to talk about two methods for going from a graph to its MST: Prim's Algorithm and Kruskal's Algorithm. For simplicity we'll consider trees where each edge weight is unique, meaning for any graph there is only one MST.

Prim's Algorithm:
This was the algorithm used by AT&T and other notable telecommunications companies for years, as telecommunication companies often need to find the shortest path for transferring information from one point to another. The algorithm goes vertex by vertex.

We'll start at vertex A (the one you choose is arbitrary because it will end up in the MST no matter what). We then add the adjacent vertex with the lowest edge weight (in this case B) and the edge itself.
So now we've got two vertices and we can choose the next vertex from either. The next smallest weight is connecting A or B to an adjacent vertex is 2 so we can add that edge and K to our MST.

Here's a good time to acknowledge why we don't just add all the adjacent edges from our initial vertex e.g. the one linking A and K: Because the edge we end up choosing (the one linking A and B in that case) will be lower in weight than all other adjacent edges to the initial vertex. This means it's possible to add K to the MST without needing to use the edge of weight 4, and we can find a cheaper way. But then you might ask when we added the edge of weight 2 to the MST, how do we know it's not cheaper to link K and J with the edge of weight 1 rather than use the edge of weight 2? The answer is that from B we chose the cheapest edge, meaning to get to K through J we know it would be a longer path than just going from B to K, and if the edge of weight 1 is going to be added to the MST it'll be via our vertex K rather than the other way around, meaning we don't need to add the comparatively expensive edge of weight 5 connecting B and J.

Speaking of which the next cheapest edge connecting to an unvisited adjacent vertex is edge {K,J} so we add that and J. Then we can add H and {J,H}, then C and {B,C}, then D and {C,D}.

Looking at I, we have two choices: 10 or 13. Obviously we're going to pick 10 because it's lower so we can add I and {D,I}. A trend we're seeing here is that Prim's Algorithm always makes the best decision even if you're just looking locally. This makes it a greedy algorithm and as the problem has optimal substructure (i.e. you'll the optimal solution to a local problem will be optimal in terms of the whole problem) this is very convenient. Notice that if we had started the MST with I, we would have chosen the edge 10 as well and we would have made our way around the graph until J had been added, but at no point would the edge 13 be added as the method of choosing the cheapest edges would take us from D to C to B to K to J (with a side-stop at A along the way).
So now all we have left to worry about is E and F. Forgetting about the algorithm for a second, we know we can finish this tree by adding any two of the edges 9, 11, and 12 (throughout this post I'll refer to edges by either their weight or their two adjacent vertices). As we want an a minimum spanning tree it makes sense to choose the two with the lowest weights as this ensures the minimum part of the spanning tree is retained. Because we have access to the edges 11 and 12 in the algorithm (due to them being adjacent to D and H which are already in the MST) we add the cheapest which is 11, and then we can add edge 9 meaning our algorithm has given us the two smallest edges naturally. You might be thinking that it was lucky both D and H were part of the MST already so we didn't get stuck with edge 12 e.g. if D wasn't already in the MST but just like above, if this was the case, starting from H and picking all the cheapest edges will naturally form the partial MST above, with D in the MST, meaning under no circumstances would the edge 12 ever be added.

So there's our final MST. Note that it is indeed a tree and that also note how poorly that green colour stands out against the white background. Oh well, I blame Paint for its limited colour palette.

So here's the general algorithm in words:

Pick some vertex and add it to the MST
While not all vertices are in the MST:
    Add the unvisited vertex of minimum distance to the MST and the corresponding edge
return the MST

Too easy.

Now let's look at Kruskal's Algorithm:

With Prim's Algorithm, we started a tree and grew it, adding the optimal edge with each iteration which didn't connect two vertices already in the tree (i.e. we ensured never to make a cycle). With Kruskal's Algorithm, we create a forest i.e. several trees which all grow and eventually merge together to make the MST. Instead of looking for the next cheapest edge adjacent to our partial MST, we simply go for the cheapest edge regardless of where it is in the graph, and this obviously can create more than one tree at the initial stages of the algorithm. Just like in Prim's algorithm, it is important to not make any cycles (as then it wouldn't be a tree). But unlike Prim's algorithm it's not as easy to know whether you'll be making a cycle because in Prim's you could just check whether the two vertices incident to the edge where already both in the MST and if so adding an edge would be redundant. In Kruskal's Algorithm you may need to link two forests together meaning you'll need to link two vertices already in the MST which means you'll need a means of knowing whether they're in the same cluster or not.

Starting with the same graph, our smallest edge is 1, connecting J and K. Obviously adding this edge won't make a cycle so we add the edge and the two vertices to the MST.

Now we go to the next smallest edge: 2. Adding this won't make a cycle so we do it. Then we can add 3, but we can't add 4 because it will make a cycle. We can't add 5 either for the same reason, so we continue on to 6 which we can add.

Next cheapest edge is 7 so we've finally gotten to a new cluster. The next is 8 and here we have an interaction between our two clusters. Although it's visible to us that no cycles are going to be made by adding the edge, the algorithmic approach would be to check all of C's neighbours and those neighbours' neighbours and so on until you know all of the vertices in that cluster, and if B isn't in that cluster it's okay to add the edge. There are two parts to kruskal's algorithms: creating/expanding a tree and linking two trees. The creating/expanding part's correctness is provable by analogy to Prim's Algorithm: If the next cheapest edge is coincidentally going to be appended to an already existing cluster then Prim's Algorithm would have added that edge anyway because it chooses the locally cheapest edge and if there is a globally cheapest edge which is local then it will also be the locally cheapest edge. As for the linking part, two clusters must be linked one way or another and Kruskal's Algorithm always opts for the next cheapest edge meaning it's impossible for there to be a cheaper way of linking two clusters, because if there was a cheaper path to link two clusters then the individual edges along that path would all need to be cheaper than the edge in question in which case they would have already been considered by the algorithm.

Next up we add 9 so now again we have two clusters in our forest. then we go to 10 which gets added, then 11 for which we link the two clusters into one as we know no cycles can be made. We then go to 12 which would produce a cycle because F and H are now in the same cluster, so we ignore 12. Finally we look at 13 which also will create a cycle if added, as J and I are already in the cluster. Our algorithm wouldn't even need to consider these last two steps if we tell it to stop as soon as all the vertices are in the MST which they are by the time we've added 11.
There you have it, we've ended up with the exact same MST that we did with Prim's Algorithm. Here's the pseudocode for Kruskal's Algorithm:

While not all vertices are in the MST:
    if the cheapest edge not in the MST will not produce a cycle via its addition:
        add the edge to the MST
return the MST

I've given some wordy explanations of why certain things work about these algorithms but here I want to get into more of a formal approach to proving that both of these algorithms indeed are 'correct' i.e. they do indeed produce minimum spanning trees.

I'll start by defining a 'cut'. Recalling that the set of vertices in our graph is V, a cut is a partition of V into two non-empty sets, meaning you decide for each vertex whether it's in group A or B, and you call the cut (A,B). You can refer to the cut as the boundary between the two groups or the two groups themselves; whatever suits your needs. Each edge will either be linking two vertices in A, two vertices in B or a vertex from each group. The edges which interact with both groups are said to 'cross the cut'.
How many cuts can you get out of a graph with n vertices? Well we can go to each vertex and decide whether it's in group A or B. Because there's two possibilities with each vertex we'll end up with 2*2*2*2... n times or 2^n. But we need to consider the condition that both sets cannot be empty meaning the possibility that all vertices are in A and the possibility that they're all in B must be subtracted, leaving us with 2^n - 2 possible cuts.

We'll do Prim's Algorithm first and before we prove that it produces a minimum spanning tree we must prove that it produces a spanning tree.

We'll start with  the simple lemma (a lemma is a stepping stone towards a proof) that if a graph is disconnected, you can make a cut which has no edges crossing it, and vice versa. By definition a graph is connected if there is a path from every vertex to every other vertex. This is called the empty-cut lemma.

We can select two vertices u and v, and assuming there is no path from u to v (the graph is disconnected), it's easy to show that you can make a cut between the two vertices and have no edges crossing it (i.e. an empty cut), by defining A as all reachable vertices from u and B as all reachable vertices from v and then acknowledging you've got two non-empty sets (a cut) for which no edges cross. Going the other way, we can assume there is a cut between u and v for which no edges cross and it's clear that there cannot be a path from u to v because such a path requires an edge to cross the cut. I am aware of how blatantly obvious this lemma is but we must play by the mathematical proof-making rules if we are to create a rigorous proof.

So now we've got a way of talking about connectedness in terms of whether you can make an empty cut from the graph.

Second lemma: the double crossing lemma:

This one states that if you have a cut and you have a cycle which includes vertices in A and B, then there must be more than one edge crossing the cut participating in that cycle. The reason is that a cycle is a path that begins and ends at the same vertex, without re-using edges. This means that if your cycle begins in A and goes to B, it must come back via another edge and vice versa if you consider it to start in B. This lemma is actually a weaker version of saying that there will be an even number of edges crossing the cut because each time you cross the cut, you need to cross back again.

This leads us to the Lonely Cut Corollary (a corollary is a proposition which follows from something already proven), which states that if there is only one edge crossing a cut, it is not part of a cycle.

Alright so we've set up the two basic lemmas that we need, now let's consider a bit of terminology. The set of vertices which our algorithm has added to the MST is X and the set of edges is T. This means the vertices not yet added are in the set V-X. If a set of edges 'spans' a set of vertices then the graph produced from the edges and vertices is connected (no empty cuts).

1) T always spans X. This is because the algorithm always adds one vertex at a time and when it does, it connects that vertex to the MST via an edge, so considering the algorithm never leaves the MST to go and start a new tree (like in Kruskal's Algorithm), we'll always have a connected graph.

2) When the algorithm's done, X = V. This is because for us to not have all the vertices in X by the end, then at some point we'd have an empty cut (X,V-X) and if that's the case then our graph is disconnected and we only take connected graphs as input. Combining 1 and 2 tells us that by the end we'll have a connected graph containing all the vertices.

3) No cycles ever get created in T. This is because at each point in the algorithm you're adding an edge that crosses the cut (X,V-X) and in doing so, you grow X by one vertex and shrink V-X by one vertex. This means that next time you're going to be doing exactly the same thing with the new cut (X,V-X). As we stated in the double-cross lemma, a cycle requires two edges crossing the same cut so in order to create a cycle you would need to add the edge to your MST without changing the two vertex sets; thus we always get a lonely cut and never create a cycle.

Pictorially let's look at why cycles don't get made:
Recalling that Prim's Algorithm only considers edges which cross the cut, let's say that the edge to j is the next cheapest one.
So now we have two possible edges leading to k. If we were to add both we'd get a cycle. Because the choice is arbitrary, let's say the edge from j to k is the cheapest:

And now the edge that went from i to k is black. Why? Because X expanded to include k meaning that the edge no longer crosses the cut and thus will never be considered again by the algorithm, so the cycle cannot be made. The same can be said for all edges which could complete a cycle: you need two edges across the cut to make a cycle and with each iteration the algorithm will choose one edge and expand X, forgetting about previous candidate edges which once crossed the cut to the same vertex.

So there's our long-winded proof that Prim's Algorithm indeed makes a spanning tree. Speaking of trees this reminds me of the scene in Lord of the Rings where Treebeard tells the two hobbits after a very long discussion with fellow trees (okay they're actually 'ents' for the lotr fans) that they've all decided the hobbits aren't orcs. Nonetheless let's power on to finding out why the algorithm gives us a minimum spanning tree.

So the first thing we need to do is prove the 'cut property' which states that if you have a cut (A,B) and of the edges crossing it, e is the cheapest, then e will be in the MST. We'll do this by contradiction: assuming we have an optimal solution, then exchanging an edge for e to get a better solution, which contradicts the assumption that our original solution was optimal.

So let's say in the MST which we'll call T* we have a cut (A,B) and although e is the cheapest edge crossing that cut, we instead have f in T*, which is more expensive. Obviously we need some edge or otherwise we'd end up with a disconnected output which wouldn't be a spanning tree (based on the empty cut lemma).
It would be easy here to make the exchange argument that we can just swap f and e and get a spanning tree with a lower weight than T* thus contradicting the assumption that it was the MST in the first place. However the double-edge lemma tells us that it's possible to create a cycle in doing so: because it's possible that g is an edge also in T* and you can imagine that carelessly exchanging f for e can allow for a cycle to be made which wasn't present before.

Not only this, but if we exchanged f for e we'd leave a vertex disconnected in B. This tells us that we can't just swap any two edges and hope to be left with a MST. Having said that, it is always possible to exchange e with some edge, which in this case is g. If we took g out and put e in, meaning we'd have e and f crossing the cut, we wouldn't have any cycles or disconnects and as e is the cheapest edge crossing the cut we would have reduced the total weight, thus contradicting the assumption that T* was an MST.
But how do we know that if T* is not an MST, there is an edge g that can be exchanged with e in some cut to give a lower weight spanning tree without any cycles or disconnects?

Observe that when you have a spanning tree, adding any extra edge will produce a cycle because if the endpoints of the added edge are u and v, then as they already had a path from one to the other (which is part of the definition of a tree: each vertex has a path to every other vertex) the new path given by the edge will produce a cycle (as two separate paths from u to v create a cycle). So if we add e to our spanning tree, we know a cycle will be made. Using the double edge lemma we know that a cycle which crosses a cut at least once must cut it twice, meaning there will be another edge (in this case g) which we can remove to break the cycle. In doing so we will certainly reduce the total weight of the spanning tree because we knew at the beginning that e was the cheapest edge crossing the cut.

Now we need to consider the chance of any disconnects after exchanging the two edges. Because the vertices in the cycle had two paths between them, after removing g and thus breaking the cycle, there will still be one path between them, meaning we don't lose any connectivity in the spanning tree.

So if we have a T* for which such an exchange is possible, then T* is not a minimum spanning tree, and if no such exchanges are possible, it is a minimum spanning tree. This gives us the cut property where for any cut in a graph, the cheapest crossing edge will be included in the MST. As Prim's Algorithm always chooses the cheapest edge crossing the cut (X,V-X) (where X=A and V-X = B in the above exchange approach) due to utilising the cut property, it will end up without any opportunities for such an exchange meaning that it indeed produces a minimum spanning tree.

Now onto Kruskal's Algorithm. Again we need to prove it indeed produces a spanning tree i.e. it has no cycles and no disconnects. Again, we'll say T is the set of edges produced by the algorithm at any point in that algorithm's runtime.

The cycles part of this proof is easy because Kruskal's algorithm specifically prohibits the adding of any edges which will produce a cycle, so this is self-evident.

The connectedness part is a little bit more tricky. We can use the definition of connectedness that for any cut (A,B) in our graph G, there must be an edge crossing it in the MST. Assuming Kruskal's Algorithm considers all edges at one point, and that our input graph G was connected (and thus has at least one edge across every cut) all we need to say is that via the lonely cut corollary the algorithm will have no problem adding an edge across any given cut to T when previously there were no such edges in T because doing so won't create a cycle, as a cycle which crosses a cut requires two edges to cross it. So as each cut in our output will have at least one edge crossing it, we know the output is connected.

This is assuming we consider each edge which isn't true, as we can stop as soon as we have added all the vertices. We'll know we've added all the n vertices when we've added n - 1 edges. Why? Consider a tree and you pick a random vertex. Then to reach the other vertices you need one edge each hence you end up with n - 1 edges. If you have n - 1 edges and no cycles (which this algorithm will give you) then you necessarily have a spanning tree.


As for Kruskal's Algorithm producing a minimum spanning tree, it's not as simple as it was with Prim's. With Prim's the pseudocode explicitly appeals to the cut property where you add the cheapest edge crossing the cut (X,V-X), but here the algorithm is just adding the cheapest edge which doesn't create a cycle. We need to show that the algorithm inadvertently appeals to the cut property.

So let's say we're going to add an edge e which links u and v. Obviously u and v will be in different clusters or otherwise there would already be a path between them and by adding the edge we'd be producing a new path and hence a cycle. This means we can manufacture a cut between the two clusters for which our edge crosses. As for the other clusters, whether they're in A or B is irrelevant. The fact that the algorithm is considering e before any of the grey edges means that it is the cheapest. The fact that e is the cheapest edge not yet added to T means it must be the cheapest edge crossing the cut and by the cut property, the cheapest edge across any cut belongs in the MST. The only reason the Kruskal's Algorithm has to not include the edge is if it creates a cycle which is impossible because the cut was empty at the beginning of the current iteration of the algorithm and adding an edge across it cannot possibly form a cycle by the lonely cut corollary.
So we know that Kruskal's Algorithm doesn't produce any cycles and doesn't output a graph with disconnects, and satisfies the cut property, meaning it indeed produces a minimum spanning tree.

I'll leave the post there because it's gotten quite long, and next post I'll talk about faster implementation methods such as heaps in Prim's Algorithm and the Union-Find data structure in Kruskal's Algorithm.

Special thanks to the youtube channel YogiBearian for great explanations on minimum spanning trees.