Description

Published date: 10 October 2014

When Google launched in 1998, a prime ingredient in their not-so-secret sauce was the question: if a user randomly clicked links where on the web might they end up?

They called the answer PageRank. This involved treating the web as a network rather than a bunch of isolated documents containing keywords. The outcome was a new verb and the near destruction of their competitors. Could repeating and refining ‘the Google trick’ help cultural bodies with research, collection care or digitisation?

Matthew Pearce, from The National Archives, works on public sector information – in particular, its economics. He is researching the statistics and algorithms needed for personalised predictions.

Transcription

This came about last year, I took part in a competition online, I don’t know if anyone here has heard of Kaggle? It’s kind of like a ‘turn stats into a sport’ and you take part in a competition, you try and predict things. And I did one that was about recommending events to users of a website.

So I did ok in that, I came in the top 25 percent. So I thought, what could we do that was useful for TNA [The National Archives] and the cultural stuff, as well as kind of interesting from the statistical perspective?

At the same time I had a conversation…there was a group about what innovative stuff could be done at TNA and I kind of casually said at one point, ‘Well you know we could look at how you do search rankings in the same way that Google did.’

And I thought about it, and thought actually, how could we do that? And then kind of went further and further with the idea. So this is generally what came out of it.

So, looking at how users move around the records. And when I mean the records I mean people actually ordering documents here on the site, and taking them out on the site, not necessarily looking at things on the web, although that’s possible. You know you have got to choose which particular bit of the problem you tackle first and what you look at. And I thought, documents at the site is probably more interesting because it’s not been done as much. Obviously, with websites, lots of people study how people click around websites, but not so much in an institution like this, maybe.

So why should we care about this? Why should we bother doing it? Well the usefulness of looking at how people navigate around sets of documents [has] mainly come out through search. If we cast our minds back to around 1998 or 1999, do you remember using sites like Lycos and Ask Jeeves? And how many people still use them? Not very many. I mean, you might still use Yahoo sometimes, it’s got elements of things that are still useful in it – like the finance information is pretty good. And then you have got user-curated, sort of person edited, search engines like DMOZ [directory.mozilla.org], so actually getting a human being to go along and index every web page. It kind of got a little bit out-run by the way the web works, although it is still around and I think it is still useful and uses a signal for other people.

So how did Google come to dominate the search market? Well there is probably a bunch of reasons. A really simple one is that the results were better then those of a lot of the other sites around at the time. And they also found a really good way to make money out of people searching for information, so they found a good way of selling their adverts.

But in terms of how they improved their search results, they had this model [shows image]. So the diagram you are looking at, at the moment is just a load of coloured blobs. But I want to persuade you to imagine that these coloured blobs are webpages, and these arrows between them are links between the webpages.

So Google’s model back in 1998 – I am sure they have much more sophisticated techniques now – was based on the idea that, say if somebody materialises at page E, that’s where they start off. Then they randomly choose one of the outbound links from E. And then say they went to D, they would then have to bounce to A, because that’s the only way to get out of A. And then Google also said, well people will randomly go to somewhere else on the entire internet with a certain level of probability.

And they ordered the search results. If you can imagine somebody bouncing around this system for a long time, where would they end up? What would the final destination be? And you can attach a probability to somebody ending up on, say page B, after bouncing around the system.

So in this particular graph [shows image] the explanation on Wikipedia – you can all go and look at it again – so B has the highest chance of somebody ending up there after a long time.

So if this sounds drastically kind of simple, somebody just randomly chooses any link on a webpage when you’re there, I mean imagine if you go to Amazon – how many links there are on a single webpage? Their model was just based on you just click any one of them completely randomly. And it works really well. So it worked well enough to beat all those competitors. Which is quite surprising really, but nobody else was doing it at the time. They were doing things like counting up numbers of keywords and how often they are used on web pages.

This is just what I have described to you [shows image]. Again, so just to emphasise, the size of those discs represents, in the long run, how often would somebody land on one of those particular web pages.

The maths behind this – and I promise not to use any equations in this presentation! – although there are probably going to be quite a few pictures for people to look at. So the maths behind this idea of people bouncing around the system for random changes in what we call the ‘state of the system’ is quite old.

Here is another simple example: it’s a two-state version of the weather. It can either be sunny or rainy. And if it’s sunny it will stay sunny with probability of 0.9. Or it will become rainy with a small probability.

Now, this way of looking at the world has proved to be quite useful. It has some limitations, but you can do all kinds of things with it. And so, some of the things it’s involved with at the moment:

Speech recognition – so if you make one noise, what is the probability that the next noise you make will be this other noise?

Detection of security intrusion in computer systems – so if you are a legitimate user you will behave in one particular way, if you are not you might behave in a different way – so it can be used to help detect those kind of attacks.
It’s used in demographic analysis – so if somebody is employed what is the chance in the next period they will become unemployed, et cetera.

It is used all over the place. It’s also used as a component of other techniques. So one of the first things they did with a computer, after they’d invented it, was simulate explosions of the nuclear bomb with it. That’s where they got the money from to build the computer…

What can we use these techniques for? So Google was looking at the web, and the web has visible links. So you can actually see a particular document will say: I am a document, I am linked to this other document over there. Public records don’t have that in the same way. Obviously, a lot of colleagues have put effort into curating the content and saying ‘you know if you are looking at this you might also want to look at that’. And inside the documents themselves, there is not the link that you can just click to teleport to another document.

So, how can we do a similar thing? And the idea as it came to me, was we could look at how our users navigated records in the past. So there’s a few different sets of data I think around the organisation about this. You can look at how people moved around the website and the documents that are already scanned. Or, you could look at how they moved around the internet catalogue for things that haven’t been scanned. Or, you could look at which documents they actually took out.

So I decided to work with the data about the documents that people actually took out. One reason for this was I thought it would be interesting to look at paper stuff, and the other reason was, it seems to me, that taking out a box of documents is quite a strong signal that somebody is interested in the contents. It is very cheap to click, and then click back again, and go to another webpage and decide that it wasn’t for you. And so, it seemed like perhaps there would be a bit more investment of the users interest in what they have taken out physically, because it takes more time and effort to do that.
Obviously, I’d really appreciate colleagues’ opinions on, you know, whether that assumption is valid and to sort of challenge that and say: well, actually the information about how people use the web catalogue is just as valuable.

So what we need to do then, using this form of data, is to look at how many individuals went from one series to another series, to another letter code to another piece – and you can analyse it at different levels. So say for example from the Duchy of Lancaster to the Palatinate of Lancaster – I have been looking at a few of these letter codes for the last few days, so that’s just one example. So you might count the number of different transitions that people make.

But if we have got a different kind of data, maybe we can do a different kind of analysis? Do we have to do exactly the same thing as say, Google did?

The first thing that I thought would be interesting is: can we ditch the assumption that everyone behaves in the same way? Now, it seems to me that there are lots of different kinds of people that use the archives. If we cast our minds back to this diagram [shows image] this is saying everybody bounces around this diagram, imagine this is the whole of the internet – I am really stretching your imaginations – imagine this is the whole of the internet and everyone bounces around with the same probabilities.

Actually there are lots of different people; maybe some of the people bounce around in a different kind of way. How do you even model that? And the second thing is, instead of thinking about where in the long run would somebody end up, you ask the question: ok what’s the next thing that they are going to do, rather than where might they end up in a thousand time steps out. It’s looking at the next thing they do, rather then the long run.

So to give some kind of visual clue about why you might want to model people as behaving differently to each other, rather then thinking everybody does the same thing, I am going to show you an example.
This idea gets bandied around quite a lot which is that: does more data beat better algorithms? Some people might have heard this idea, and I thought it would be interesting to test it.

In some circumstances the answer is no. So this is one of those graphs I threatened to show you [shows image]. Here we have data – these black bars – it’s just a histogram, most people will have seen these before. So, the black bars are the real data that are in our sample, and this red line is the model that we are building out of the data that we have got in our sample.

So we start off with ten observations, and then we get 1,000 observations, and then, to you know really go over the top, we can take a million observations. You can probably see from the graph that we have two blocks of data, and the model only has one block in it.

So we are saying that all this space is really likely – this number zero is the most likely thing to happen – but unfortunately it has not happened in the data at all. This isn’t a complete failure because at least you know that the number is likely to be somewhere near zero, it’s not going to be off a million, or minus two billion or something crazy. But it just looks like you’d want a model that followed the data rather then failed to do so.

So the point here is, that if the models you are using don’t fit the data very well, just getting more and more data sometimes will have a limited benefit. If this is a really obvious problem – you know you can just see these two lumps in the data – why can’t you just sort that out? Well it has been fixed, I don’t know if anybody here has heard of regression? So in regression you would say this was the outcome variable Y, and we will use some other data, like people’s age, or their shoe size, or whatever, to try and predict things. So you will end up with people with a small shoe size, and people with a large shoe size, and that will help you get over this problem of having two lumps in the data and only one in the model.

So there’s some problems with that, first, can you get hold of the data that you need to help you? So it might be impossible to collect, unless the government can force people to disclose their personal wealth then they don’t have access to that information. (I don’t know if anyone has heard these debates around Thomas Piketty at the moment, that’s one of the problems – can we actually get ahold of the data?) Another is: so X might be personal – you might be trying to predict something using personal data. And it’s best not to do that if you don’t have to. Obviously, for many applications that’s going to be essential and you have to take real care there. So it would be good to be able to not have to use X if you don’t have to, if it’s personal.

Another thing would be [that] collection is expensive. So you might have to go out and do surveys – this can get quite pricey. And then you might encounter the second problem, which is: is the data that you have – so this would be your shoe size or your height – is that actually going to help, or do you also need somebody’s…hat size and leg length as well? And if you didn’t collect that you could have a problem.

There’s other issues to do with things, that if you have another type of data that causes both the data you have and the X-variable – that’s quite abstract, sorry guys. Here’s an example: There was a study suggesting that people that take short breaks get cancer. Of course there is another variable that causes short breaks and cancer, which is smoking. So do you know that you have included all the things that are equivalent to smoking in your dataset? You might do, you can have arguments about that, but it can be a difficult problem to solve. In many cases it is absolutely essential to do that, but can we dispense with that here for our problem?

The answer I am going to try and convince you is: yes we can. We can live without any personal data. We can just model how an anonymous user moves around the records without knowing anything else about the person. We don’t need to know the shoe size or anything else about them!

So this is just looking for what people do and not why they do it. So this is the idea of dispensing with those kind of explanatory variables and just looking at how people behave, and not why they are behaving like that.

So this I am going to show you is called mixture modelling, and I don’t expect you to imagine that just from the name – so I am going to show you an example [shows image]. Imagine there are three different colours, and each of the colours is linked to a kind of behaviour.

So we have got a red kind of behaviour, and a blue kind of behaviour and a green kind of behaviour. And the data that people generate, so how they move around the records – or how tall they are, plus how wide they are, and how heavy, and what their shoe size is – can all be characterised by a red kind of behaviour. And everybody that’s got a red behaviour would be similar, and people that have a blue behaviour would be similar, and people that had a green behaviour would be similar.

Imagine that we only have three people, we have got to choose how to group those people into different, what we call, classes. So in the first row we have put all three people in the same class. We are saying that they all have the same characteristic behaviour. So this is the same as the first diagram that I showed you, where we were assuming everybody does the same thing. Then, in the next three rows, we have got two different kinds of behaviour involved: we have got a red behaviour and a blue behaviour.

And just for example, in the second row we are saying that the first and the second person share a similar behaviour [but] the third person is doing something different, they have a different kind of behaviour. And on the last row we are saying that everybody does something completely unique and they have their own characteristic kind of behaviour.

So next, once we have decided what kind of behaviours the people have, we can generate some data for them [shows image]. In this case, the people are flowers and we have got the sepal length – this is the observed data – and their petal width. This is a really famous dataset, and as you can see there is a characteristic green blob and a red blob and a blue blob.

So that’s what is really going on behind the scenes when we know what species the flower things are and we have decided to give them colours. If we didn’t already know that – because the dataset was collected by a famous biologist in that case, we would just see something like this [shows image] which is much less informative. And then we would be faced with the problem of: ok how do we recover these coloured groupings, these kinds of behaviour, just from seeing this kind of unlabelled data?

So this is what we have got, what we are analysing in terms of user behaviour, something where we don’t have the labels to begin with, we don’t know how to group people and we are trying to find that out.

So there is ways of analysing these things. This is an attempt for modelling this data [shows image] – it is the same dataset we were looking at, just a different angle. This is assuming that all the data behaves in the same way, so this is equivalent to the first row here – everything is doing the same thing. In this second diagram, we’re saying that there is two different kinds of characteristic behaviour. And it picks out one characteristic behaviour here, and another here. And that’s equivalent to these kinds of rows – its tried to sort people into groups. And in the last diagram [shows image] there’s three characteristic kinds of behaviour, corresponding to this last row.

And this happens to be the true model, so it is not surprising that it has picked out the correct sort of placing of the different kinds of behaviour. So this at the bottom here is an equivalent to this kind of model, but instead we have got the red line following the contours of the observed data exactly, which is nice.

But we have got a problem. Here the dataset was from the famous biologist – he knows his three species involved in the data – how many are there in our data, could somebody tell me please? It’s actually it’s not easy, so we have to take different steps to figure out how many different kinds of behaviour or clusters there are in the data.

So I am not actually going to try and explain to you the technicalities of how this works, but essentially instead of going with a model that says everybody does the same thing, or there is two kinds of behaviour, or there is three, we average over what would the world be like if there was only one kind of behaviour. What if there was two, what if there was three, what if there was 1000, what if there was a million kinds of different behaviour?

So we avoid having to answer the question directly, just by saying: what would the world be like if that was the case. This also, as you have more data you can have more different kinds of behaviour enter the model. So it effectively grows in richness as you get more data.

Ok, so that was all scattered diagrams and I have been talking – I was meant to be talking – to you about time series, about how people move around the records over time. So it’s nice, I can find some blobs on a scatter chart – does it work for the stuff that we have got? Or what have we got?

Our data, as I am alleging to you looks something like this: a user looks at Home Office record and then another one and another one, and then a War Office record. Or maybe they look at something from the Met [Metropolitan] Police then the [UK] Atomic Energy Authority, then you know several other records. So that’s the kind of data that I am analysing here. What is the equivalent of the different kinds of behaviour for that kind of data?

If you remember back to this diagram I showed you before [shows image], what we are saying is there is , for example, two characteristic kinds of behaviour again: there is blue behaviour and a red behaviour. What’s different between them is differences in the probability of moving between different states.

So maybe this one at the bottom is somewhere like Manchester, where I grew up, where it is quite likely to be sunny and then go to rainy. And maybe the blue behaviour is more like the south of France, where once it gets sunny it stays that way for quite a long time. So that is the kind of equivalent that we are running here.

This is some randomly generated data using three different kinds of behaviour, there is red, blue and green again – to help people get some sort of continuity in what we are talking about.

So the numbers on this axis are the different states, so like; sunny, or rainy, or Amazon, Google, Bing, whatever. And then we are looking at how the series moves around over time, so what’s the chance of it being sunny and then going to rainy. And then over here we have…a matrix that shows the similarity between different time series.

So all these really are, say, the blue behaviour time series, and this block represents the red behaviour time series, and this is the green behaviour time series. So we have not plugged into the model the fact that there are three different behaviours; it has just figured that out. It works with two, three, four, five – however many you like.

The advantage of using some simulated data is just to check whether your software works. Because you know what the answer should look like if you have used three different types of behaviour, you should find three different blocks of similarity between the different time series. The other thing that this is telling you is that what we get out of this is the similarity between one person and another person, or one flower and another flower, or one time series and another time series, rather then any answer saying this person has a red behaviour or a blue behaviour.

That was quite ugly, that last slide. One of the reasons is that the time series were like 30, so on these slides [shows image] this is two-dimensional…I would have to have something with 30-axis on it to plot them on a scatter plot, and the human visual system doesn’t work that way. Then also, there’s around about 110 different parameters so you can’t plot those either on one plot.

So that’s quite difficult. So if we apply this technique to user data, how people have moved around the records, what do we get out? We can show you what the network looks like for an individual. So here [shows image] I will show you…this is the matrix. If we can remember back about those sort of movements between something being sunny and it being rainy, this is kind of the equivalent for public records, as seen here for a particular individual. So if I use a different set of time series, if I use a different individuals data as an input to this, you would get out a different map. But for this particular time series this is the map that you get out.

My computer screen is off so it’s difficult to get the names of these things, but here you have got things like: the railways, and then you have got the Rail[way] Commission and these things are from the British Rail libraries. This is from British Rail libraries and they have got the Coal Authority’s, and then we have got the Ministry of Power down here. So you get out these thematic groupings.

Here is another sort of grouping that is quite kind of easy – or easy-ish – to see what is going on. So I had a look at these before and they all seem to be to do with old court records, really… old documents, I am not sure – somebody will have to tell me. So this is the [Court of] King’s Bench; you have got the Court of Common Pleas; and assizes; and King’s Wardrobe – which I believe is quite an important station at a previous point in history; this is to do with Durham – which I think was an important legal site; that’s the Star Chamber – which I wouldn’t want to visit too often. So it has figured out that these different sets of records should be fairly close to one another.

Of course, when you have got a visualisation like this there is 315 different sets of records looking at this level of depth, so the actual matrix of numbers – its like a spreadsheet 315 rows, 315 columns – is nearly 100,000 numbers. So a lot of information gets lost when you squish things down into two-dimensions to look at them.

So I was kind of pleased that you can find many groups that look like they have coherence, semantic coherence, at all. Because when we are looking at these diagrams you could be looking at many things that are arranged in parallel, and you simply happen to be staring at them all in a line – so everything gets squished together, but you can see some areas of common meaning there.

And just to convince you that these are different, here’s one for a different time series [shows image]. So here you will see the situation where you have got the Ministry of Energy, and this is labour relations or something, and then… what was AB… it’s the [UK] Atomic Energy Authority, so it is another power related one. And I think that’s Pensions, and what was F? It was…Trade and Industry.

So at the bottom here we see a cluster that’s heavy industry and power generation things, so it might be that which ever time series we put into this, the person had been interested in things to do with energy policy. But it could be just that they were interested in something completely elsewhere in the chart, and they just happened to look like other people that liked documents to do with coal and atomic energy. And if they did happen to go and look at something from the Ministry of Energy, then we would expect them to go and look at something like coal documents next…

So this was just about how long it takes to do this, obviously there is a trade off between how rich your model is and how long it takes to run it. It takes quite awhile to run one of these things, you have to use parallel computing. All these stickers that you get on your computer that say like [intel core] i5, or whatever, a lot of software doesn’t actually use them – a lot of the off-the-shelf analytical software you get doesn’t always use it. So you have to explicitly do that, it’s quite tricky.

The other thing is, when you have got these huge numbers of different edges, if you have 2,000 documents you don’t want to fill in – if you have a big spreadsheet you don’t want to fill it with zeros so you basically have a spreadsheet where you only have the cells that have something in them, and the rest don’t kind of exist so your not filling up your computer with nonsense.

So the other issue was, if we look back at the beginning…these numbers…how big is the circle, was just the long run probability that someone ends up looking at those documents. And we want to look at: what’s the chance at the next thing they do being removed from the War Office to Ministry of Defence?

And for example, if you have got a dataset where everybody ends up looking at War Office documents, maybe you have got a researcher who really wants to look at Rail documents and they would really hate it if everything ended up getting weighted way towards War Office documents, maybe…

So if we can just think about the next thing they do, it saves us from that problem. So we are just focusing on the choice that somebody faces here and now, like, what do I do next, rather then what do I do at some point in the far distant future.

So this model, doing the clustering, looking at how similar users are to each other, does explain the test dataset. So if we put a 1,000 time series to one side, and look at how likely they are under the original kind of ‘Google style’ model (I am sure they have massively, massively more sophisticated things now).

So this is versus the kind of model used from 1998, and the clustering kind of model does explain the data a lot better. And the here-and-now question does save some computer resources, because once you’ve done this analysis – it takes a long time to run – it’s easier to get the answer to the question: what are people doing next?

Conclusion. Is this data science? It’s something that has been talked about a lot recently. I don’t know maybe it is. The journals I used to write the software et cetera were in statistics, and machine learning and computer science. So you tend to find there’s a lot of people that kind of move between different disciplines rather then something else out there called data science.

The other thought that I had: you can do this kind of work because we have got a log of different transactions that take place in the past. It is all quite routine stuff. I think perhaps in the future you will find that there is other routine stuff, that at the moment we suggest people get rid of it, because it doesn’t seem particularly interesting, but it could be interesting to some kind of computer equipped analyst.

So like emails between different departments saying how many biscuits shall I order for the meeting? That tells you something, that tells you that one department is talking to the other department and that they go to the same meetings.

So if you can imagine looking at some kind of graph like this in the future, where instead of it being to do with time series and records, it’s to do with different teams in a ministry and how close they are to each other – based on how often they communicate – and you can look at how things changed over time. So you can do all kinds of interesting things with that.

There are plenty of people around that do things like that now. Like when Enron went the wrong way, people looked at how emails were exchanged within the organisations. And I think there will be more and more people doing that kind of research in the future. And kind of having access to low level kind of trivial information can actually be quite useful in that situation, so that’s my closing thought.

Transcribed by Nikki Vickaryous as part of a volunteer project, April 2015

Leave a comment

Help

You can find help on how to download and listen to our podcasts in our quick guide to getting started. If you wish to re-use any part of a podcast, please note that copyright in the podcasts and transcripts in some cases belongs to the speakers, not to the Crown. Please contact the Copyright Officer with queries. If commenting, please be aware of our moderation policy.

Subscribe

Select an option to receive our free podcast series, using either RSS or iTunes. See our help guide for more information on podcast subscription.