Tuesday, 10 June 2014

What's Up With Graph Representations?!


This post was going to be all about trees, but It got pretty long with the discussion on adjacency matrices and lists so I decided to just make this a post on representing graphs. I'm going to handle the two basics here and there are some modified versions which make use of priority queues but as I haven't discussed that yet it will need to wait for later. Here we go!

Adjacency Matrix (it gets its name due to the fact it shows which vertices are adjacent i.e. have an edge connecting them):

  FROM  0  1  2  3  4
TO
0       0  0  2  0  0 
1       3  0  0  9  0 
2       0  0  0  0  0 
3       0  8  0  0  0 
4       7  0  0  5  0 

The above 2D array gives tells us everything we need to know about a directed, weighted graph. The indices 0-4 represent the indices of the vertices (which I'm going to call nodes from now on because it sounds cooler and is faster to type). Scanning along the top row I can see that there's an edge going from node 2 to node 0 with a weight (or 'length') of 2. If I read the whole thing, I'll be able to actually draw the graph:

If you're dealing with an undirected graph, you can either duplicate the edge values like so (we'll ignore the edge going from 3 to 1 because you only need 1 edge between nodes in an undirected graph, also 0 means there's no edge):

  FROM  0  1  2  3  4
TO
0       0  3  2  0  7 
1       3  0  0  8  0 
2       2  0  0  0  0 
3       0  8  0  0  5 
4       7  0  0  5  0

Notice the sweet symmetry, but also notice it's fairly wasteful to duplicate values like that because we're using twice the space that we need. When adjacency matrices get enormous (and I mean enormous), it's a good idea to use a triangular matrix. We can just get rid of the bottom-left repeated elements and keep the top-right ones (i.e. we'll use an upper triangular matrix). I'm going to keep the diagonal line of 0's there in case for some reason certain nodes had loops (edges that lead to themselves).

  FROM  0  1  2  3  4
TO
0       0  3  2  0  7 
1          0  0  8  0 
2             0  0  0 
3                0  5 
4                   0

Now we just need to find a way to store this in a clean array. We can use a 1D array (called E) with each row head-to-tail like this:

0  3  2  0  7  0  0  8  0  0  0  0  0  5  0

If there are n nodes, the first row has n and the second row has n-1 and then n-2 and so on until there's only one left. You could also see it as starting from 1 at the bottom, to 2, then 3, and eventually n. What's the formula for adding up 1+2+3+4...+n? If you can recall from my post about this exact formula, it's n(n+1)/2. So in this example where n = 5 we'll have 5(5+1)/2 = 15 elements in our 1D array.

The next step is to make a function which lets us put in the row number (i) and the column number (j) and get the value the edge's weight. After some whiteboard wizardry, here's the algorithm I came up with to do it:

0 GetWeight(i,j):
1     if (i > j) swap(i,j)
2     k <-- 1
3     index <-- 0 //this is the index in the 1D array
4     while (k <= i) {
5         index += n - k + 1
6         k += 1
7     }
8     index += j - i
9     return E[index]

Line 1 checks whether you're referring to the bottom-half of the array and if so, swaps i and j to redirect you to the upper-half of the array where the actual edge weight is stored. Whenever you're referring to somewhere in the bottom-half, i is going to be greater than j (check for yourself).

As for the while-loop, the idea is that for every row you go down the number of elements decreases by 1 so if you're looking at the second row i.e. i=1 you need to add n or n-1+1 to your index but if you want to go to the next row i.e. i=2 you need to add another n - 1 or n - 2 + 1 to your index. The relationship here is that you're adding n - i + 1 for each extra row you go down (and 'k' in the code takes the place of i because we need to preserve i for line 8). As for the column position, let's say i = 2 and j = 3, well the while loop will get you up to the first element in the row i=2 but because of the missing elements to the left you need to re-calibrate your column index by subtracting i from j. You don't want to go adding 3 to the index you're already at, you just want to add 1 to go from the element [2][2] to [2][3] and to do that you can just add j-i to the index.

I am very much hoping that's the standard way of going about this problem, it seems like a pretty tight algorithm to me (so long as the user ensures i and j are less than n i.e. they don't go outside the bounds of the matrix). If you don't want the diagonal line of 0's in there you can just replace the i with i + 1 because effectively you'll be dealing with the same number of elements as if you had ignored the row i = 0. Another way to look at it is to just chuck in a -1 in line 5 and 8 because you've got one less element in each row above the current element (line 5) and one less element in the element's row (line 8):

0 GetWeight(i,j):
1     if (i > j) swap(i,j)
2     k <-- 1
3     index <-- 0 //this is the index in the 1D array
4     while (k <= i) {
5         index += n - k
6         k += 1
7     }
8     index += j - i - 1
9     return E[index]

Alright so we've handled adjacency matrices and specifically triangular matrices, now onto the next way of representing graphs:

Adjacency Lists:


Instead of storing our edges in a table we can just have a 1D array of all the nodes and each element will be a pointer to a linked list of edge weights and destination nodes. What is a linked list you say? It's a very flexible data structure which is all about modifying after creation. In an array you can modify the values of each element but you can't change the size of the array itself (and if you can, your programming language is lying to you about it being an array; it's just some variation of a linked list). A linked list stores each element in wherever the next accessible chunk of memory is with a 'pointer' which will lead to the next element. Let's say I want to make some linked list which describes the outwardly directing edges that my node 0 shares with with other nodes. Considering that in a linked list, the elements plus the pointer form a 'node', I'm going go back on my original statement and call my actual graph's nodes 'vertices' from here on. So I need two elements in each node, the index of the destination vertex and the weight of the edge.



...can be represented as...

A pointer is just a number referencing the next node's address in RAM. The RAM is just the one array to rule them all so when I say 'address' I really mean 'index'. The null pointer could be anything from a special command to just a '0' as it often is in C++ (the compiler won't do anything to those parts of the RAM because the operating system takes up all of the early addresses). The elements in each node (the destination vertex's index, the edge weight and the pointer to the next node) are all contiguous i.e. adjacent in the RAM but nodes are not necessarily contiguous. If you put vertex 0's first node (the '1 |3 | -|->' one) into memory and then let your program do some other stuff, more memory could have been placed after it. And if then you decided to add vertex 0's second node to memory, you would need to find the next available space for 3 integers i.e. 12 bytes (recalling the pointer is also an integer) and then start from the pointer at index 0 in the array of linked lists and follow the chain until you get to the first null pointer and then set the value of that pointer to the address of the new node. This kind of memory allocation is called 'dynamic' because it can deal with the changes to memory that occur over time.

The array of indices is static (i.e. not dynamic) and won't change size but if you instead used a linked list of vertices then you would be able to add and remove different vertices whenever you wanted.

All you really need to deal with a data structure like that is a pointer which points you to the address of the top-left node (i.e. the first element in the top left node) and a knowledge of what each element position in each node represents i.e. the second element in the vertex nodes (on the left) is a pointer to the next vertex, and the third position is a pointer to the first outward edge coming from that vertex.

Slight aside about how memory addresses work: For all the RAM conventions (definitely for 32bit processors) that I've heard of, each address is the index of a 'word' or 4 bytes or 32 bits which is the standard size of an integer. So if my pointer's value is 2000 it's pointing to the integer at the index of 2000 i.e. the 2001st word (there's a word when the address is 0). If I know the next word is an element referring to a vertex index I can just add 1 to my address and if I read that value I'll find out what the vertex's index is. It's just like working with smaller arrays. All this talk of pointers has made me think I should have just done a post on them earlier but instead I'm going to leave that for later and charge ahead!

Let's compare these two approaches, using basic adjacency matrices (not the triangular ones) and the first adjacency list I showed (where the list of source vertices is an array). The reason I'm choosing these versions is that neither can add vertices so we can focus purely on the edges.

Adjacency Matrices are great when you have a lot of edges because lots of the grid spaces are being utilised whereas if you have 100 vertices and only 3 edges, you're using 3 out of 100^2 elements in your matrix which is a huge waste of space because those extra 9997 elements don't tell you anything extra about the graph. An adjacency list on the other hand will take up barely any space with just three edges because the space required is just 1 integer for each vertex (plus the array's header) and another 3 for each edge. On the other hand if you have a huge amount of edges (like an almost complete graph) then those elements on the adjacency matrix will be well utilised and the extra baggage that comes with the adjacency list will add up (i.e. all those pointers which aren't present in the randomly-accessible adjacency array).

In terms of big O notation, with an adjacency matrix you will require a space of O(|V|^2) (where V is the set of vertices and so |V| is the length of this set i.e. the number of vertices) and with an adjacency list it's O(|V| + |E|). Remember that there can be |V|^2 edges and the coefficients are larger for the latter so if there are a lot of edges compared to vertices an adjacency list is a good idea. Also, adding, modifying or deleting an edge from the adjacency list can all be done in constant time whereas with an adjacency list, removing or modifying an edge requires you to find it which takes O(|K|) time where K is the set of all vertices adjacent to a particular vertex. However like with adjacency lists, adding an edge can be done in constant time if you have stored the address of the null pointer at the end of the edge list somewhere (meaning you can store the new edge in the next available place in memory and set the pointer at the end of the list to the address of the new edge). If you want to count the number of adjacent vertices to a particular vertex (i.e. find |K|) with an adjacency matrix you'll need to scan the whole row  (or column or both) meaning the time complexity is O(|V|) whereas with the adjacency list you will only need to scan along the edge list for a particular vertex i.e. O(|K|). OK? For everything other than finding |K| for some vertex, the adjacency matrix outperforms on time but the adjacency list wins on space, assuming the graph doesn't have too many edges.

If you really wanted you could have just made a single-chain linked list of edges including the source vertex, destination vertex and edge weight but that doesn't sound very efficient to me. Also these edge lists can have pointers that allow you to go in the reverse direction (which would make them doubly linked lists) but there is rarely a need (that I know of) when it comes to graphs like this.

Alrighty next up will be spanning tree algorithms and then binary trees.

No comments:

Post a Comment