Get Covid19 Data from the Public Health England API

Monitoring the spread of corona virus will be an additional challenge for educators this autumn on top of the usual challenges of making a successful start to the academic year. This post explains how to get Covid19 data from Public Health England (PHE) into BI via Power Query. In subsequent posts I’ll talk about what you can do with it in BI. If you know of other sources of Covid19 data (with a publicly accessible API) please let me know.

The Data Source

PHE publish data for six area types, overview (whole UK), nation (England, Scotland, Wales and Northern Ireland), Region, NHSRegion, Upper Tier Local Authority and Lower Tier Local Authority. I haven’t quite worked out the difference between Upper and Lower LA but Lower Tier seems to correspond with London Borough/Local Education Authority and is the most granular so that’s the level I’ll be focusing on.

(There is a map here https://www.arcgis.com/apps/webappviewer/index.html?id=47574f7a6e454dc6a42c5f6912ed7076with smaller areas (Middle Super Output Areas) but I can’t find a way to access its underlying data )

The API endpoint we’ll be using is

https://api.coronavirus.data.gov.uk/v1/data

A developer’s guide (which explains what’s there and how to connect) is here.

https://coronavirus.data.gov.uk/developers-guide

If you know Python you can do cooler stuff than I’ll cover and there are some pointers about it and Software Development Kits available. I’m sticking with bog standard http requests, where essentially, you write your query in the URL to get the data you want.

Planning your queries

The whole dataset for local authorities is around 60k rows right now, but the maximum records per request is 1k so you need a way of breaking the data up into several queries. There are basically three options for doing this:

1.    Split the records up by area – ie have a query for each area you’re interested in.

2.    Split the records up by date – ie have one for 2 days ago, one for 3 days ago etc.

3.    Paginate – write code in M to loop through successive requests pulling 1000 records at a time

Of these I recommend number 1 because although you can write code to dynamically change the url for option 2 (I’ll explain how below) and although it will refresh fine from Power BI desktop, when you upload your file to the BI web service it won’t let you schedule refresh. Same problem affects option 3.

Note that in the url filtering dates are strings – so you can query for a specific date but not a date range.

Get a list of areas

To start off with lets do a simple query to pull just the names of the lower tier local authority areas so we know how to filter to them. To make our API url we need to add two things to the API endpoint, filters, where we specify what records we want, and structure, where we specify what fields from those records.

Our first request is this:

https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=ltla;date=2020-08-13&structure={"areaName":"areaName","areaCode":"areaCode"}

Note that filters are separated by semi colons, dates are in the format YYYY-MM-DD and that the request is case sensitive. Postman is a handy piece of free software you can use for checking your syntax. https://www.postman.com/downloads/

In BI desktop, click ‘Transform Data’ to open power query. Click New Source then Blank query.



We’ll be using the Json.Document function so paste this into the formula bar

= Json.Document(Web.Contents("https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=ltla;date=2020-08-13&structure={"areaName":"areaName","areaCode":"areaCode"}"))

 

It will fail. The reason is that it thinks the quote marks of ={"areaName are ending the webaddress. To fix this problem you need to make all the quote marks inside the url into double quote marks "" 


Click list to access the data, then 'To Table' on the List Tools Ribbon.

 

That takes you to a list of the records, click the arrows in the column header to expand it and that gives you your table with one row per LTLA.


Actual Covid19 Data

Once you’ve looked up what areas you want data for you can write an expanded query with actual Covid19 data in it. The list of available fields is here (note that not all of them are available at every areaType.)

 



The quickest way to make multiple queries for different areas after making a new blank query is copy, paste and amend the code in the advanced editor. Here is the power query code for a single area query for New Cases, Cumulative Cases and the Cumulative Case rate per 100K people.

let

    Source = Json.Document(Web.Contents("https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=ltla;areaName=Aylesbury Vale&structure={""areaName"":""areaName"",""areaCode"":""areaCode"",""date"":""date"",""newCasesBySpecimenDate"":""newCasesBySpecimenDate"",""cumCasesBySpecimenDate"":""cumCasesBySpecimenDate"",""cumCasesBySpecimenDateRate"":""cumCasesBySpecimenDateRate""}")),

    data = Source[data],

    #"Converted to Table" = Table.FromList(data, Splitter.SplitByNothing(), null, null, ExtraValues.Error),

    #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"areaName", "areaCode", "date", "newCasesBySpecimenDate", "cumCasesBySpecimenDate", "cumCasesBySpecimenDateRate"}, {"areaName", "areaCode", "date", "newCasesBySpecimenDate", "cumCasesBySpecimenDate", "cumCasesBySpecimenDateRate"})

in

    #"Expanded Column1"


So if you create a blank query, open the advanced editor and paste that text in you should get the data. Then change "Aylesbury Vale" in the url to whatever area you want.

Query by relative date

This is the M code you need in your query to put relative dates in the url. (hat tip to Sam Jukes (@eljukes) who taught me this. Remember that it will mess with your scheduled refreshes tho.

 

let

    DayShift = -1,

    RefDate = Date.AddDays(DateTime.LocalNow(),DayShift),

    YearN = Date.Year(RefDate),

    MonthN = Date.Month(RefDate),

    DayN = Date.Day(RefDate),

    MonSep = if MonthN < 10 then "-0" else "-",

    DaySep = if DayN < 10 then "-0" else "-",

    DateString = Text.From(YearN) & MonSep & Text.From(MonthN) & DaySep & Text.From(DayN),

    Source =  Json.Document(Web.Contents("https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=ltla;date="&DateString&"&structure={""areaName"":""areaName"",""areaCode"":""areaCode"",""newCasesBySpecimenDate"":""newCasesBySpecimenDate"",""newTestsByPublishDate"":""newTestsByPublishDate""}"))

in

Source

 

Where DayShift is the number of days back you want to go.

Combine the Queries into a single table

To improve the experience in BI the last thing we want to do is combine our separate LA queries into a single table which we'll bring through to the BI model. To do that create a new blank query and type in the formula bar: 

= Table.Combine({Aylesbury,Basildon,..etc})

Where Aylesbury, Basildon are the names of your separate queries.

Finally right click on those individual queries and untick 'Enable Load'. Their names will turn italic and this will mean that they won't come through to BI as individual tables, you'll just have the combined one with all the data.




Note on Data publishing

In my experience records are usually present for the previous day for ltla but with 0 cases reported and these records are updated the following day with case numbers. Therefore I recommend treating two days ago as the most recent date for data.

 

 

 

 

 

Comments

Popular posts from this blog

Attainment 8 in BI - Part 1

Subject - Class - Pupil Table

Data modelling for a large MAT