How to make British politicians slightly more accountable with two lines of code

The Register of MPs’ Interests is published on the Parliament website roughly every two weeks.

Helpfully, anything added since the last update is highlighted in yellow. But the register is spread across 650 pages – one for each MP – so to see what’s new you need to click through all of those pages.

Wouldn’t it be great if we could see all the latest additions to the Register in one place?

Here’s how I generated a ‘what’s new?’ page for the Register of MPs’ Interests in two lines of code at the command line.

Prerequisites: wget, python, lxml, cssselect, scrape

You can install everything you need to run this code on Ubuntu with:

sudo apt-get install wget python python-lxml python-pip
sudo pip install cssselect
wget https://github.com/jeroenjanssens/data-science-at-the-command-line/raw/master/tools/scrape
chmod +x scrape

If you don’t have Ubuntu, check out the Data Science Toolbox.

Now for the code:

1. Download the current edition of the register of MPs’ interests.

wget -r -np -w 1 -t 3 -l 1 http://www.publications.parliament.uk/pa/cm/cmregmem/141110/part1contents.htm

(The URL points to the latest edition at the time of writing. To get the URL for a more recent edition, head to the Parliament website.)

2. Search the files you’ve downloaded for new entries and output them to a file.

for file in www.publications.parliament.uk/pa/cm/cmregmem/141110/*;
do [[ -n ` cat $file | ./scrape -e ".highlight"` ]] &&
echo '<div class="mp-entry">' &&
cat $file | ./scrape -e "h2,.highlight" &&
echo '<a href="http://'$file'" target="_blank">See these additions in context at the Register of MPs&rsquo; Interests &gt;</a></div>';
done > output.htm

You can see the result of these commands here.

The second line of code uses Jeroen Janssens’ handy scrape Python script, which is powered by lxml and cssselect, to extract the highlights from the Register of Interests. Here we’re using the .highlight CSS selector to specify the parts of the page we’re interested in.

The code is quite inefficient as it calls scrape once to check if there are any highlights, and if so it calls scrape a second time to output the MP’s name and the new entries. The job would run faster if we wrote all the logic in Python. But the code only needs to be run every couple of weeks, and takes a minute or so to complete on my laptop, so this shell one-liner suffices as a quick solution.

TheyWorkForYou publish a more detailed analysis of changes to the Register of Interests. This shows text removed and added since the last update. There’s more text to trawl through, but it’s possible their method might catch changes which haven’t been highlighted.

Leave a Reply

Your email address will not be published. Required fields are marked *