Monday, November 18, 2013

Parsing USGS river conditions in JSON

Over the summer, I wrote a post about writing an application for OpenShift by Red Hat to display current USGS river levels on an interactive map. That post went into quite a bit of detail about populating the MongoDB database, using MongoDB's geo features, displaying the map, and so forth. However, in the interests of keeping the post to a manageable length, I only lightly touched on the topic of retrieving and interpreting the data from the USGS in the first place. That will be the topic of this post. It's a good case study because this particular data set is relatively complex.

Historically, it could be fairly difficult to retrieve USGS data. As the site itself notes: "Most data from the USGS Water Data for the Nation site are currently downloaded as tab-delimited (rdb) data files. While this approach works, it uses 20th century approaches rather than 21st century approaches." Fortunately, for our purposes here, the current conditions data can now be retrieved using the USGS instantaneous values web service in either XML or JSON (Javascript Object Notation) format. This post will describe the procedure for using JSON.

Determining the syntactically correct URL to use with the USGS Instantaneous Values REST web service

The first step is to determine the right URL or URLs to use with the web service. The USGS provides a nice tool that allows you to figure this out interactively. You should note that the USGS won't let you pull data associated with all their gauges at one time; you have to specify at least one "major filter." For my purposes, the best approach seemed to be to create a list of two letter lowercase state abbreviations (plus DC and Puerto Rico) and create URLs for each individual state. I also observed that the service didn't always return a result--timing out instead. Therefore, having an update failure of only a subset of the total data would be preferred in any case.

Retrieving the data

The below code shows how to retrieve current conditions in this way in Python 2.7 using urllib2:

statelist = ["al","ak","az","ar","ca","co","ct","de","dc","fl","ga","hi","id","il","in","ia","ks","ky","la","me","md","ma","mi","mn","ms","mo","mt","ne","nv","nh","nj","nm","ny","nc","nd","oh","ok","or","pa","ri","sc","sd","tn","tx","ut","vt","va","wa","wv","wi","wy","pr"]
for i in statelist: 
    requesturl = "http://waterservices.usgs.gov/nwis/iv/?format=json,1.1&stateCd=" + i+"&parameterCd=00060,00065&siteType=ST" 
    req = urllib2.Request(requesturl) 
    opener = urllib2.build_opener() 
    f = opener.open(req) 
    entry = json.loads(f.read())

Each request returns all the gauges in the state of type "ST" (stream).

For each iteration (state), we now need to parse the contents of variable entry. Before doing so, however, let's look at what the USGS has sent to our computer.

The JSON

If you just type an appropriate URL into your browser, you'll get back a big block of text. For example, the URL:

http://waterservices.usgs.gov/nwis/iv/?format=json,1.1&stateCd=ma&parameterCd=00060,00065&siteType=ST

returns a string that starts with the following fragment:

{"name":"ns1:timeSeriesResponseType","declaredType":"org.cuahsi.waterml.TimeSeriesResponseType","scope":"javax.xml.bind.JAXBElement$GlobalScope","value":{"queryInfo":{"creationTime":null,"queryURL":"http://waterservices.usgs.gov/nwis/iv/","criteria":{"locationParam":"[]","variableParam":"[00060, 00065]","timeParam":null,"parameter":[],"methodCalled":null},"note":[{"value":"[ma]","type":null,"href":null,"title":"filter:stateCd","show":null},{"value":"[ST]","type":null,"href":null,"title":"filter:siteType","show":null},{"value":"[mode=LATEST, modifiedSince=null]","type":null,"href":null,"title":"filter:timeRange","show":null},{"value":"methodIds=[ALL]","type":null,"href":null,"title":"filter:methodId","show":null},{"value":"2013-11-18T17:59:12.444Z","type":null,"href":null,"title":"requestDT","show":null},{"value":"2172fbb0-507b-11e3-9141-6cae8b6642ea","type":null,"href":null,"title":"requestId","show":null},{"value":"Provisional data are subject to revision. Go to http://waterdata.usgs.gov/nwis/help/?provisional for more information.","type":null,"href":null,"title":"disclaimer","show":null},{"value":"sdas01","type":null,"href":null,"title":"server","show":null}],"extension":null},"timeSeries":[{"sourceInfo":{"siteName":"NORTH NASHUA RIVER AT FITCHBURG, MA","siteCode":[{"value":"01094400","network":"NWIS","siteID":null,"agencyCode":"USGS","agencyName":null,"default":null}],"timeZoneInfo":{"defaultTimeZone":{"zoneOffset":"-05:00","zoneAbbreviation":"EST"},"daylightSavingsTimeZone":{"zoneOffset":"-04:00","zoneAbbreviation":"EDT"},"siteUsesDaylightSavingsTime":true},"geoLocation":{"geogLocation":

Not so easy to parse as you can see. The problem is that the returned JSON is deeply nested, which makes it very hard to interpret just by eyeballing it. I found that using a JSON viewer really helped. There are a lot of different ones out there but I used this one at chris.photobooks.com. (I'm including some screenshots below, but I'd encourage you to retrieve your own JSON from the USGS using your browser, paste it into this or another JSON viewer, and play along at home.

I'm not going to go through how to access every field you may be interested in, but I'll hit a few that demonstrate interesting points about the data structure.

Interpreting the data


The first point to notice is that the returned data has a singe top-level object and you actually have to go two levels deep under [value][timeSeries] before you get to the key to the data structures that contain the current conditions data for each of the gauge locations.

So let's say we want to iterate over this data and get to the gauge number. Here's what that looks like:

Screen Shot 2013 11 18 at 1 15 03 PM

count = int (len(entry['value']['timeSeries']) - 1)
while count >= 0:
#We construct an array of the relevant values associated with a guage number
#Note that gage height and discharge are in separate entries
#Right here we're just filling out the "permanent" values
#Gauge Number. This will be the dictionary index
agaugenum = entry['value']['timeSeries'][count]['sourceInfo']['siteCode'][0]['value']

Per the screenshot that drills down to the individual gauge level, we're first iterating on the data structure that holds the individual gauge conditions (['value']['timeSeries'][count]). Then we drill down another four levels until we get to the 'value' field holding the gauge number. This is why this data is very hard to parse just by looking at the text.

Screen Shot 2013 11 18 at 1 24 12 PM

There's also a wrinkle to be aware of when working with this data. I've been talking as if the individual "records" were for gauges. But, actually, they're not. They're for values. 

The value can be found here:

entry['value']['timeSeries'][count]['values'][0]['value'][0]['value']

But what that value means is defined here:

entry['value']['timeSeries'][count]['variable']['variableCode'][0]['variableID']

What this means in practice for stream data is that most--but not all gauges--have two records. One is for the current river height (in feet) and one is for current river flow (in cfs). A variableID of 45807202 corresponds to height and a variableID of 45807197 corresponds to flow. 

I should probably mention that my code assumes that these are the only types of variables and that their units don't vary from site to site. I haven't seen any exceptions but for any sort of critical application, I'd probably use additional data encoded in the JSON rather than just assuming it's all consistent and invariable. (My application also just updates the current conditions rather than looking for changes to any of the parameters associated with the gauge itself such as location.)

The same basic process that I've described here can be used to make use of data from a wide range of sources. For example, you can read about the structure associated with tweets on the Twitter developer site. It's easy to get started playing around. And when you get to the point where you want to run an application with a database or other components, give OpenShift Online by Red Hat a try if you're not already doing so. It's free and easy to get up and running with the language of your choice. 

No comments: