Get Covid19 Data from the Public Health England API
Monitoring the spread of corona virus will be an additional challenge for educators this autumn on top of the usual challenges of making a successful start to the academic year. This post explains how to get Covid19 data from Public Health England (PHE) into BI via Power Query. In subsequent posts I’ll talk about what you can do with it in BI. If you know of other sources of Covid19 data (with a publicly accessible API) please let me know.
The Data Source
PHE publish data for six area types, overview (whole UK), nation (England,
Scotland, Wales and Northern Ireland), Region, NHSRegion, Upper Tier Local Authority and
Lower Tier Local Authority. I haven’t quite worked out the difference between Upper
and Lower LA but Lower Tier seems to correspond with London Borough/Local Education
Authority and is the most granular so that’s the level I’ll be focusing on.
(There is a map here https://www.arcgis.com/apps/webappviewer/index.html?id=47574f7a6e454dc6a42c5f6912ed7076with
smaller areas (Middle Super Output Areas) but I can’t find a way to access its
underlying data )
The API endpoint we’ll be using is
https://api.coronavirus.data.gov.uk/v1/data
A developer’s guide (which explains what’s there and how to
connect) is here.
https://coronavirus.data.gov.uk/developers-guide
If you know Python you can do cooler stuff than I’ll cover
and there are some pointers about it and Software Development Kits available. I’m
sticking with bog standard http requests, where essentially, you write your
query in the URL to get the data you want.
Planning your queries
The whole dataset for local authorities is around 60k rows right now, but the maximum
records per request is 1k so you need a way of breaking the data up into
several queries. There are basically three options for doing this:
1. Split
the records up by area – ie have a query for each area you’re interested in.
2. Split
the records up by date – ie have one for 2 days ago, one for 3 days ago etc.
3. Paginate
– write code in M to loop through successive requests pulling 1000 records at a
time
Of these I recommend number 1 because although you can
write code to dynamically change the url for option 2 (I’ll explain how below) and
although it will refresh fine from Power BI desktop, when you upload your file
to the BI web service it won’t let you schedule refresh. Same problem affects
option 3.
Note that in the url filtering dates are
strings – so you can query for a specific date but not a date range.
Get a list of areas
To start off with lets do a simple query to pull just the names of the lower tier local authority areas so we know how to filter to them. To make our API url we need to add two things to the API endpoint, filters, where we specify what records we want, and structure, where we specify what fields from those records.
Our first request is this:
Note that filters are separated by semi colons, dates are
in the format YYYY-MM-DD and that the request is case sensitive. Postman is a handy
piece of free software you can use for checking your syntax. https://www.postman.com/downloads/
In BI desktop, click ‘Transform Data’ to open power query. Click
New Source then Blank query.
We’ll be using the Json.Document function so paste this
into the formula bar
=
Json.Document(Web.Contents("https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=ltla;date=2020-08-13&structure={"areaName":"areaName","areaCode":"areaCode"}"))
It will fail. The reason is that it thinks the quote marks
of ={"areaName are ending the webaddress. To fix this problem you need to
make all the quote marks inside the url into double quote marks ""
Click list to access the data, then 'To Table' on the List Tools Ribbon.
That takes you to a list of the records, click the arrows
in the column header to expand it and that gives you your table with one row
per LTLA.
Actual Covid19 Data
Once you’ve looked up what areas you want data for you can
write an expanded query with actual Covid19 data in it. The list of available
fields is here (note that not all of them are available at every areaType.)
The quickest way to make multiple queries for different
areas after making a new blank query is copy, paste and amend the code in the
advanced editor. Here is the power query code for a single area query for New
Cases, Cumulative Cases and the Cumulative Case rate per 100K people.
let
Source =
Json.Document(Web.Contents("https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=ltla;areaName=Aylesbury
Vale&structure={""areaName"":""areaName"",""areaCode"":""areaCode"",""date"":""date"",""newCasesBySpecimenDate"":""newCasesBySpecimenDate"",""cumCasesBySpecimenDate"":""cumCasesBySpecimenDate"",""cumCasesBySpecimenDateRate"":""cumCasesBySpecimenDateRate""}")),
data =
Source[data],
#"Converted
to Table" = Table.FromList(data, Splitter.SplitByNothing(), null, null,
ExtraValues.Error),
#"Expanded
Column1" = Table.ExpandRecordColumn(#"Converted to Table",
"Column1", {"areaName", "areaCode",
"date", "newCasesBySpecimenDate",
"cumCasesBySpecimenDate", "cumCasesBySpecimenDateRate"},
{"areaName", "areaCode", "date",
"newCasesBySpecimenDate", "cumCasesBySpecimenDate",
"cumCasesBySpecimenDateRate"})
in
#"Expanded Column1"
So if you create a blank query, open the advanced editor and paste that text in you should get the data. Then change "Aylesbury Vale" in the url to whatever area you want.
Query by relative date
This is the M code you need in your query to put relative
dates in the url. (hat tip to Sam Jukes (@eljukes) who taught me this. Remember
that it will mess with your scheduled refreshes tho.
let
DayShift = -1,
RefDate =
Date.AddDays(DateTime.LocalNow(),DayShift),
YearN =
Date.Year(RefDate),
MonthN =
Date.Month(RefDate),
DayN =
Date.Day(RefDate),
MonSep = if
MonthN < 10 then "-0" else "-",
DaySep = if DayN
< 10 then "-0" else "-",
DateString =
Text.From(YearN) & MonSep & Text.From(MonthN) & DaySep &
Text.From(DayN),
Source =
Json.Document(Web.Contents("https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=ltla;date="&DateString&"&structure={""areaName"":""areaName"",""areaCode"":""areaCode"",""newCasesBySpecimenDate"":""newCasesBySpecimenDate"",""newTestsByPublishDate"":""newTestsByPublishDate""}"))
in
Source
Where DayShift is the number of days back you want to go.
Note on Data publishing
In my experience records are usually present for the
previous day for ltla but with 0 cases reported and these records are updated
the following day with case numbers. Therefore I recommend treating two days
ago as the most recent date for data.
Comments
Post a Comment