To view this media, you will require Adobe Flash 9 or higher and must have Javascript enabled.

Duration 00:41:14

Big Ideas: Big Data for Law

Big data is big news. Did you know an estimated 90 per cent of the world’s data was created in the last two years (see Insights gleaned from large datasets are increasingly driving business innovation and economic growth. Underpinning this ‘big data revolution’ is a powerful combination of low cost cloud computing, open source analytics software and new research methodologies. These are enabling us to move from simply storing large sets of data to extracting real value from them. Big data analysis can now tell us everything from the most borrowed library books in 2013 to the most overweight areas in England.

John Sheridan, Head of Legislation Services, introduces the Big Data for Law project. Why does data matter in law?  What are we doing to transform the legal research? Can you imagine what an annual ‘census’ of the statute book might look like and what it could be used for? If you care about law, how it works and how we can make legislation clearer and more accessible, this talk is unmissable.

This event took place as part of Big Ideas, a series of monthly talks on big ideas coming out of The National Archives’ research programme.


So for those of you who don’t know me, my name is John Sheridan. I’m the Head of Legislation Services here. And welcome to this talk about Big Data for Law in the series of Big Ideas []. Anyway, I’m going to talk to you a little bit about legislation and big data and those sorts of things.

So there are some times, I do these… I do quite a lot of talks. And in a way, it’s quite indulgent. If I draw a Venn diagram of my areas of interest, I usually end up with a kind of exclusive intersection of people who are interested in legislation, information technology – and particularly data – and the music of JS Bach. And I put me right in the middle! And it’s all I can manage to not slip in bits and pieces of other things that I’m interested in. Not to so much entertain you, but as to keep me interested in what I’m saying – which is, I suspect that’s one of the things you don’t see when you do the ‘how to present’ training course.

Anyway, depending on who you are in the organisation, you may have a mental model for the statute book that looks like this [shows image]. So The National Archives has long been a home for a thing called the Chancery Roll. And one of the really terrific things to do is when we have visitors here – if you’re in the Legislation Services team – and you can give people the opportunity to see legislation as it was made. So you can see the Act of Settlement. We can see the Act of Union from 1706. And you realise that there is quite a deep relationship between the form of the content and its function. And this is one of the really important changes that we’re going through at the moment with what’s happening to primary sources of law.

Anyhow, The National Archives – alongside being the home for the Chancery Roll and one of the two vellum copies of each Act of Parliament that is produced – is also the home to the Legislation Services team. And that gives us a very particular insight.

Now it’s quite fun from the perspective of looking back at our nation’s history – and partly from the perspective of looking back at our history in terms of literature – to see how things that have been topical at one point in time or other in the country have also been topical for Parliament to legislate on too.

So you can go back to the 19th century, and Charles Dickens is busily writing Oliver Twist. Parliament, meanwhile, is passing law relating to the poor. You can track forwards to Virginia Woolf – so we’re now into the [19]20s, actually 1918. Parliament is looking at extending suffrage and widening the franchise.

There was then this wonderful period of time in the [19]50s where people got very excited in mediums and talking to the dead beyond the grave – and Noël Coward beautifully satirises some of that in Blithe Spirit. And lo and behold, you have the Fraudulent Mediums Act of 1951 – this being [a] sufficiently important matter of public discourse for [the] Parliament of the day to legislate:

‘Certain provisions for the punishment of persons who fraudulently purport to act as spiritualistic mediums or to exercise powers of telepathy, clairvoyance or other similar powers‘.

One wonders how on earth you can claim to be a spiritualistic medium and for it not to be fraudulent. But anyway, that’s by the by.

And right the way through to the 1980s – Martis Amis, Money. And oh, 1990s, Dangerous Dogs Act. So there’s always been this kind of connection between sets of things that have been happening in the country and sets of things that have been legislated on in Parliament. And our statute book in some deep and important sense reflects an enormous part of our nation’s history.

There’s also this relationship between the form of legislation and its function. Now we have – amongst many responsibilities – the responsibility for officially publishing new legislation. And it, kind of, looks like books, numbered paragraphs, subsection one and sections 475 to 476 apply – blah, blah, blah. And it’s an important observation, the extent to which this manner of legislating is in no small part a function of the introduction of printed technology. And it just wouldn’t have been possible to construct such an interwoven set of concepts and ideas without the introduction more recently of word processing technology.

Look at an act of 50 years ago and look at an act of five years ago and see the difference in terms of the numbers of cross-references. And think to yourself, however could the legislation have been drafted the way it is today with so many cross-references were it not for the fact that we have computers who can take care of matching all of the references? So there’s a real relationship between tools, function and form.

Now we have the great pleasure of running a service called []. And I won’t say too much about it, other than it’s something [that] occupies the minds and work and life of many of the people in this room. But in doing that, we tend to think of legislation not as simply numbered paragraphs, but we tend to necessarily think of it in terms of data. So it’s common in my team to talk about not so much numbered paragraphs, but the data. It’s all about the data. And we make the data that we have for UK legislation open and available.

So, legislation: part of our nation’s history; a printed artefact. Something that people access on the web; data. What’s its true nature? Well, some of it is about a change that we have seen over the last 30 years in terms of the role of sources of law and how consumers of legal services have interacted with them. And there was a world 30 years ago where access to sources of law was principally constrained to those people who could afford to buy the books. And we played our part in that, helping to produce some of the books. But if you had a legal problem, you would go to a lawyer. And the lawyers would consult the sources of law.

We’ve moved to a much more rich and interesting place today. For sure, consumers – people like us, those of us who are not lawyers – still go and talk to lawyers when we have legal problems and lawyers consult sources of law. These days they more typically consult online electronic legal databases rather than consulting books. But there are some other big trends at work.

One is that consumers have an interest and an appetite to directly access sources of law for themselves. And these are the people that we see who work with and use And those people are mainly not lawyers.

They’re looking to comply with the law – and in some sense they’re trying to resolve or solve their legal problem – and they expect to be able to go to Google and type the name of the Equality Act 2010 and to be able to find it. And they expect – if for example they’re in some kind of dispute with their employer about whether or not sufficient modification has been made because they have a disability for them at work to be able to go about their work – to be able to read the relevant provisions in the Equality Act and to make some sense of it.

This audience is also enormous. We see something like two million unique visitors a month to We will serve something like half a billion page views a year of legislation.

Now, but that isn’t the only thing that consumers are doing. They’re also talking to each other about how they work with or how they understand law. And one of the things – if you happen to use Twitter – if you do a search on Twitter for, you can see the conversations people are having about different parts of the statute book. And you can see the conversations that people are having about sources of law.

And finally, there’s the rise of new types of digital legal service – from services about making wills to services that [are] for buying or selling your house. Government too is busily involved in making new forms of digital legal service. From a certain point of view, you can view GOV.UK [] as being a kind of digital legal service for transacting with Government that sits on top of the statute book as a source of law.

Now, a shift from the old world to the new world starts begging some really profound questions about: what is it that we need our sources of law to do. How do they need to be constructed? Not just to be useful to lawyers who are providing a mediating layer, but explicable to consumers who are directly working with those sources of law and that can facilitate the rise of a variety of different types of digital legal service.

So what are we then thinking about? We’re thinking about sources of law being not just sets of numbered paragraphs in published, printed documents – or even websites – but sources of law being data. What does it mean to plug into the statute book if you’re providing a digital legal service? And maybe it means thinking about sources of law becoming increasingly, themselves, computer code.

And we see some of that too in the last 18 months or so, with departments like the Department for Work and Pensions passing the Universal Credit Regulations using a technology called Oracle Policy Automation – which is a kind of constrained form of natural language where the rules have been framed to be directly turned into machine-processable rules.

So when Government’s got a stake in the game, it says, ‘Oh yeah, wouldn’t it be great if we turn a bit of our source of law into computer code to underpin our digital legal service so that we can more efficiently process the three million immigration applications we see a year? We can more effectively process claims for benefits and so on and so on.’

So this is a little bit of [information] just to orientate you in the landscape – the web fundamentally changing who accesses sources of law, our role in that at The National Archives in terms of – but also having a sense for a much wider game that’s afoot.

That’s like, I don’t know. If it were an opera, act one. Act two, big data – which is, kind of, like buzzword bingo. And people talk a lot about big data… I don’t know how many of you are kind of like IT type people? Ah yes, hello brothers. I am too. I’m like an IT person whose career has gone horribly wrong [laughter].

So, big data… People talk about it being in terms of volume and velocity, so it’s about very large amounts of information. And basically it’s IT people sort of gradually catching up with where statisticians have been for quite a long time, which is: you can develop an understanding about a thing by measuring it, by counting it and by applying analytical methods to it.

The difference between big data and what we have been doing for 150 years in statistics is many of the statistical methods were designed for an era where you couldn’t process all of the available information about something. So you had to sample. And then how can you deduce from your sample whether the conclusions that you are arriving at stand up?

With big data the idea is that you can now process all of the information that you have. And therefore you have different types of method. And it’s become an area that the search councils are very interested in. It’s become an area that big business is pretty interested in – in fact, anyone who has quite a lot of data. But it’s also a terrible, terrible Buzzword Bingo name for essentially processing data!

What’s the kind of things people can do with big data? This is one of my favourite examples. This is possibly a little creepy actually. It’s a scientist in MIT who created a map of his house. And he has a toddler. And the map is [the] frequency where the child is talking about ‘water’ in terms of both context and location in the house. But essentially it’s this observation… now, this wouldn’t have been possible to have gathered all of the information – every utterance that this four year old makes is processed by the computer and then mapped against their location to get some sense of both context, frequency and location of the use of the word ‘water’.

Now, I mean it’s kind of interesting to see where a four year old uses the word ‘water’. But what’s much more interesting is that it’s possible to do such a thing. And imagine the potential if you took these types of capability and you applied it to a dataset that really meant something. You applied it to something like the statute book that exists as, if you like, the operating system for our country.

[Shows image]. Now some of you will have seen this diagram before. We know for absolutely sure that it makes sense to think of the statute book as a data resource. This is a diagram that shows what in my team are called ‘legislative effects’ – or amendments if you like. So it’s how one piece of legislation changes another. And it’s a diagram that shows you how much of the statute book you would need to know about in order to know the current and enforced state of one particular act – which buried in the middle of that is the Company Audit Community Enterprise Act.

So we know that it makes sense to think about and manage facets or aspects of the statute book as data. And we know that it can give some insights. We can also see that the statute book is a pretty complicated and deeply interwoven place – potentially beyond, in some sense, our ability to be able to curate it and manage it as a large and complex system.

So big data potentially has an important role. In a world where you can measure or understand the use of the word ‘water’ by a four year old in a house, what about legally significant words in terms of the statute book?

Which brings me onto the project that I’m very fortunate to be leading – and we’re very fortunate to be able to run in my team – which is a research project entitled Big Data for Law. Now, not coming from a legal research background, I’ve been very struck by how legal research has been done. [Shows image]. This is one of the partner organisations for our research project. This is the Institute of Advanced Legal Studies in London. And if you’ll remember that slide of the old world of sources of law being in books – that’s where these guys are.

So legal researchers, it turns out, tend not to think in computational terms. They tend to want to answer legal research questions in the same way that lawyers answer many questions, which is in particular. But thinking about what the statute book as a system as a whole looks like and how you manage it or curate it – where there are patterns, what’s its shape – that’s next to impossible to do if your resources look like this [shows image]- books in the library. Just next to impossible to do.

So the ambition we have is to try and create some new instruments for legal research. And it’s always been the case in the history of science that when you make new instruments, you get new insights. You see new things.

[Shows image] So this is also a little indulgence for me because I saw these two pictures, and they’re…this picture and the next one are both from NASA and I love them. And it’s like a kind of little frustrated, ‘Ah, why can’t I get to work on something like that?’ So that’s the Hubble Telescope. And it’s one of the most amazing new instruments that humanity has ever created. And it allows us to see things like this [shows image]. This is a cloud nebula photographed in infra-red. And there…those bright spots are stars being born; a new instrument that gives us an insight that we never had before into a system that’s large and complex and adaptive and evolving.

So what are the instruments that will allow us to understand and better work with and process and manage our system of statute law? What’s the equivalent of our Hubble Telescope? And what’s our equivalent of…? Well, there probably isn’t anything in the statute book that’s that beautiful! Looking at people who work with legislation every day, and I suspect …I mean, it’s a longshot, right? [Laughter]. We probably would’ve spotted it. Anyhow, what’s our equivalent of being able to stand back far enough and see something deep and profound?

So that’s the ambition for the research project that we’re leading. We’ve been funded in significant part by the Arts and Humanities Research Council to construct a new instrument for legal research.

And The National Archives is very fortunate to be an independent research organisation, as well as being a government department. It leaves us really ideally placed to be able to bring together people and organisations that wouldn’t normally work together.

And for the research project we have both partners from the private sector, so companies like LexisNexis; from the third sector, so people like the Incorporated Council of Law Reporting – who produce the law reports for all of the higher courts and crucially have the dataset where words and phrases in legislation have been interpreted by the courts; and also to work with our colleagues elsewhere in government, in particular with the Office of the Parliamentary Council – so the team of elite government lawyers who draft all government bills.

So we’re really uniquely placed – partly in government and partly as a research organisation – to bring together this kind of collaboration and to put in place some of the new instrumentation that can advance legal research. And that in a sense is our big idea. What happens when you start thinking about the statute book as a whole? What happens when you start thinking about the statute book as data? What happens when you start introducing new capability that makes it easy for people to download the statute book and to run their own experiments? What happens when you provide the right theoretical framework for thinking about the system of statute law that we have as an adapting and evolving system?

So we’re creating this thing at the moment at Of course, we’re very proud of trying to design services that meet people’s needs. So we’re going away and understanding the needs of researchers in [the] arts and humanities. And that’s proven very interesting.

I was saying earlier about the dearth of empirical evidence that’s used in legal research. Very much, the end users of the capability of the tool we’re putting in place look like they’re going to fall into two types of people; some people who will be very competent and capable to be able to access and download and process very large amounts of data, and other people who are going to need something that’s very packaged and very polished in terms of what kind of analysis has been done; and why, just in terms of them not having not only the IT capability, but actually the statistical knowledge to be able to do anything more sophisticated with data.

So anyway, we’re going to do a lot about understanding the needs of researchers in [the] arts [and] humanities around this new kind of capability. We’re going to find ways of taking some of the closed data that we have – not least, I was talking earlier about the volume of use and the numbers of users we have of So that creates a huge amount of usage data – information about people who visit statute X also visit statute Y, or people who visit section X also visit regulation Y.

Is there some kind of new topology there? Not based on the connection of different pieces of legislation based on their subject or based on the power or based on ‘these terms were defined here and have been used over there’, but a topology of the statute book based on its usage and how might it help us to see and understand that. And by making some of our usage data…finding ways in which we can make that available to the research community can really help. And finally, coming up with a different way of thinking about the architecture of the statute book that maybe will give us a better chance of starting to map what’s going on.

So not just a new instrument – not just data, not just tools – but understanding who’s going to use those tools, finding ways of making new material available and coming up with some new ways of thinking about how the statute book as a system works.

A big observation here is – if you look at other spheres – the power of abstraction. When you create an abstract representation of something, you can start to understand it and work with it in a different way. And we have seen this in other realms, other spheres. So there was a time where we only understood topological space in terms of our direct personal experience of it. You went walking. And then people are saying ‘Ah, but I can create an abstract representation of that space and I can draw a map’. Now, the statute book is in the same place as the pre-map era, right? If you want to go around and explore it, you read it. You read it.

How does it fit together? Well, maybe you can go and visit Halsbury’s Statutes and it’ll be organised by subject. So you’ve got some road signs, but essentially you’re left with reading it. You can’t step back and see what the pattern and shape is of the thing as a whole. But even when you look at early maps, you realise that in a sense they’re kind of naïve. The language of the abstraction is not terrifically sophisticated. You know, people are like, ‘Well, we don’t have symbols, we’re just going to draw little pictures of how we see the buildings as being.’ And then you end up with – which is a huge contrast – maps that are far more abstract and far more useful.

Now, what’s the equivalent abstraction for concepts contained in laws? What does that look like? And in order to process and manage the statute book as data, what are the ideas that are going to provide the props that will allow you to compute over that system and produce representations that are meaningful [and] useful when people have been very familiar with living in a world like this?

So there are two things – we’ve been thinking about this really carefully that we’re going to try and do. And one is this idea of – it’s not that clever – what happens if we just start counting stuff in the statute book? Just counting how many words are there.

How many of those words are verbs? How many words typically are there in provisions and how’s that changed over time? How many internal references are there in a piece of legislation?

And what’s the distribution of internal and external references? And what does that tell you about the modularity of the statute book?

How often is legislation amended and what does that tell you about whether or not pieces of legislation are wearing out? Whether they’re good or bad?

What happens when we change our political leaders and we have different parties in parliament? Do pieces of legislation passed by Labour governments get disproportionately repealed by Conservative governments? I don’t think anybody knows.

No one has been able to count any of these things. So not only will we be making available some data for people to experiment with, we’re going to conduct a census, if you like, of the statute book – so identify indices, things that we can count for the first time. Some of those things will give us insights into how the system as a whole is evolving and adapting. Some will give us insights into the nature and form of language that’s being used. Some will tell us quite a lot about the relationship between form of legislation and how the law has been constructed as you go through those big changes of format shift – from scrolls of vellum through print through the introduction – essentially to support print workflows of word processing technology and what may be coming with the web that we see today. So we’re going to count lots of things.

Counting things always gives you great insights. There’s [a] particular dataset called Ngrams, which are words and their frequency of occurrence. And you can try Ngrams out for yourself if you go to Google Ngram Service []. Then you can do a search for different words and phrases and they’ll give you frequency of occurrence of those words and phrases from their digitised collection of books. So we can certainly do a kind of calculation of Ngrams for the statute book.

But we’re going to also need some new ways of thinking about some of these things. And this is the second bit if you like. So we have data tools and methods. We have users – some of whom are very good with working with data, some of whom are not very good with working with data. And for those who are not very good with data, we’re going to conduct, if you like a census, so that with a bundle of indices that are pre-packaged so that they can start to use empirical evidence in their research. And it will give us some new insights.

But if we’re trying to tackle the system and understand how it works, can we come up with better ways of abstracting it so that we can see beyond acts and statutory instruments and enabling powers and provisions? That we can see how law is being designed? And maybe we can help the designers of legislation make better law.

I did a short talk last week at the Heads of Analysis. And someone came up to me – and I was talking a bit about this concept of a patterned language for legislation – and gave the most brilliant analogy that I wish I’d thought of, which is, I mean he said, ‘Have you ever seen that work by I think the Swedish academic who came up with the notion that there are essentially 13 stories in all novels?’ You know, the quest story or the whatever it might be. But there’s basically…you can distil all novels down to essentially 13 stories. So in my mind, each one of those stories you could imagine as being a pattern. And the set of 13 would be your pattern language. So what are the 13 essential laws from which all laws subsequently derive?…That’s the idea.

So, a little bit of…well, this is nonsense on stilts. Well, maybe not. The architects dreamt some of this up. A guy called Christopher Alexander came up with a pattern language for architecture in the late 1970s. And then my crowd – the software engineers – really borrowed those ideas for helping us understand how we manage our large, complex, adaptive systems – our large, complex, often legacy software systems – by having an appropriate level of abstraction that would allow us to understand where those systems were working well and where they were working badly and what looked like good design.

So concretely, a pattern language consists of designed patterns. A designed pattern is the description of a problem and a core solution to that problem, with both the problem and the solution reoccurring in many different places. So it’s like a generalised problem to which there is a generalised solution. And the value of it is that you can use the solution in a number of different contexts. When you’re looking at your system, you can think about what problems components of, parts of, your system are solving and whether or not the solutions that you have in those areas are good or bad or indifferent to the context of use.

What do they consist of concretely? Well, a name is a big thing. Alright? So when someone says ‘Ah, there’s a quest pattern in literature’, you immediately [respond]: ‘I think I know what that would be doing’. So naming patterns is really important in terms of keying into what they are and what they might be about – describing the problem that they solve, describing the solution, describing the consequences of using the pattern.

Do such things exist? Well, when you think about the sorts of problems that legislation solves, it does a big piece of figuring out who decides the whatever-it-might-be. So if I want to put a conservatory on the back of my house, the legislation won’t say, ‘Yes you can’ or ‘No you can’t’. But it will say your local authority gets to decide, and here’s the process you have to go through and here are the things they have to consider in making that decision. So what most legislation does is set up decision making mechanisms. So when we’re talking about patterns for legislation, we’re talking about patterns of decision making.

And there are a number of things you can ask yourself. So, how much flexibility is a decision maker given? Are they left to decide whether or not to let your planning application go through on the nod? Or do they have to go through a particular process?

So, let’s do a little recap. [We] can think about legislation as data. We can make available a new capability. And we need to think about, in order to process and manage the statute book, we need some new concepts about how the system as a whole is working. One of those is the idea of a pattern language. The patterns are patterns of decision making. That’s what most legislation is doing – it’s setting up decision making systems.

So we’re looking for patterns of decision making. What about an example?

So an example would be in regulation. The problem is you have, it’s called polycentric decision making, where – so thinking about water, gas, electricity – you have the needs of an industry on one side, you have the need of society for people in that industry to invest and you have the needs of consumers to have not basically not be ripped off. How do you frame a solution in law to [ensure] protecting the interest of consumers on one side, whilst allowing investors to have a reasonable turn of investment on the other? And remembering that the courts are not particularly good at these sort of subtle balancing problems. They’re much better at you win/you lose, you’re innocent/you’re guilty.

So the pattern for solving this type of problem in legislation – what you could call a regulator pattern – is that you establish a regulator. And the regulator decides – they issue a licence – that if you want to supply electricity, or telephony, or gas, you need to get a licence. If you try and do that activity and you don’t have a licence then you get in trouble. In the licence are a bundle of conditions that the regulator can determine, and they have the power to determine what the conditions are and evolve that over time. So you make them quite powerful. The licence can be modified on an on-going basis as the world change[s] by consent of the parties. And you have only a limited right of appeal back to the courts. But essentially everything is hanging off the regulator who can issue a licence to allow you to do a particular type of activity.

Now that is a pattern of things hanging off a licence is quite common in legislation. [It’s] quite often that you set up a system where you have a decision maker who can issue a licence and in order to do an activity you need to get a licence from the decision maker. It’s maybe one of our 13 essential laws. Or 15 essential laws.

Finding out will give us a way of thinking about the statute book as a system as a whole, rather than as sets of isolated individual acts and statutory instruments. It will give us an insight that we’ve not had before; combined with the facilities that we have with data with our ability to be able to count and measure and process; combined with our capacity to enable, not just to do this activity for ourselves, but to enable other people to do this type of activity.

We have within, not just in the grasps of the work that we’re doing in The National Archives, but for the nation if you like, the chance to begin to manage this large, complex, adaptive system that is our law in a way that is a little bit more contemporary. You go back to that earlier piece where [there’s] the relationship between the function of law and its form. When you start to be able to process and manage a system of law as a whole and when you start having the abstract ideas – if you like the equivalent of the maps that allow you to see what the patterns are that are happening – then this can both benefit users [and] consumers. Because at the minute you know that a piece of legislation is following this regulator pattern, suddenly you’ve got real clues into what’s going on. You’re kind of like ‘oh, I understand what’s happening.’ So it’s useful for users, but it’s also very useful for policy makers.

So yeah, and a big part of it is about trying to find ways of abstracting what’s difficult in order to make it explicable. And by stepping a level up from words and phrases into thinking about the design of our system of law and using the computing power that we have to be able to deduce that: find those patterns; know where they exist; and understand how they’re evolving and adapting over time. So it’s quite insane. It’s quite insane.

Anyway, the Arts and Humanities Research Council have given us a reasonable slug of money to go away and make said thing. We are having a crack at both the infrastructure building and doing some of the thinking that will allow us to really exploit and use the potential of the kind of big data technology, the kinds of information that we have, the kind of partnership we can build, to help not just to deliver a better end user experience, but maybe give everyone a better understanding and insight into how our system of law itself is evolving and adapting.

Transcribed by Emily Duis as part of a volunteer project, April 2015