So you want to write an R client for a web API? This document walks through the key issues involved in writing API wrappers in R. If you’re new to working with web APIs, you may want to start by reading “An introduction to APIs” by zapier.
APIs vary widely. Before starting to code, it is important to understand how the API you are working with handles important issues so that you can implement a complete and coherent R client for the API.
The key features of any API are the structure of the requests and the structure of the responses. An HTTP request consists of the following parts:
GET
, POST
, DELETE
,
etc.)?foo=bar
)An API package needs to be able to generate these components in order to perform the desired API call, which will typically involve some sort of authentication.
For example, to request that the GitHub API provides a list of all issues for the httr repo, we send an HTTP request that looks like:
-> GET /repos/hadley/httr HTTP/1.1
-> Host: api.github.com
-> Accept: application/vnd.github.v3+json
Here we’re using a GET
request to the host
api.github.com
. The url is /repos/hadley/httr
,
and we send an accept header that tells GitHub what sort of data we
want.
In response to this request, the API will return an HTTP response that includes:
An API client needs to parse these responses, turning API errors into R errors, and return a useful object to the end user. For the previous HTTP request, GitHub returns:
<- HTTP/1.1 200 OK
<- Server: GitHub.com
<- Content-Type: application/json; charset=utf-8
<- X-RateLimit-Limit: 5000
<- X-RateLimit-Remaining: 4998
<- X-RateLimit-Reset: 1459554901
<-
<- {
<- "id": 2756403,
<- "name": "httr",
<- "full_name": "hadley/httr",
<- "owner": {
<- "login": "hadley",
<- "id": 4196,
<- "avatar_url": "https://avatars.githubusercontent.com/u/4196?v=3",
<- ...
<- },
<- "private": false,
<- "html_url": "https://github.com/hadley/httr",
<- "description": "httr: a friendly http package for R",
<- "fork": false,
<- "url": "https://api.github.com/repos/hadley/httr",
<- ...
<- "network_count": 1368,
<- "subscribers_count": 64
<- }
Designing a good API client requires identifying how each of these API features is used to compose a request and what type of response is expected for each. It’s best practice to insulate the end user from how the API works so they only need to understand how to use an R function, not the details of how APIs work. It’s your job to suffer so that others don’t have to!
First, find a simple API endpoint that doesn’t require
authentication: this lets you get the basics working before tackling the
complexities of authentication. For this example, we’ll use the list of
httr issues which requires sending a GET request to
repos/hadley/httr
:
library(httr)
<- function(path) {
github_api <- modify_url("https://api.github.com", path = path)
url GET(url)
}
<- github_api("/repos/hadley/httr")
resp
resp#> Response [https://api.github.com/repositories/2756403]
#> Date: 2022-05-03 20:41
#> Status: 200
#> Content-Type: application/json; charset=utf-8
#> Size: 6.21 kB
#> {
#> "id": 2756403,
#> "node_id": "MDEwOlJlcG9zaXRvcnkyNzU2NDAz",
#> "name": "httr",
#> "full_name": "r-lib/httr",
#> "private": false,
#> "owner": {
#> "login": "r-lib",
#> "id": 22618716,
#> "node_id": "MDEyOk9yZ2FuaXphdGlvbjIyNjE4NzE2",
#> ...
Next, you need to take the response returned by the API and turn it into a useful object. Any API will return an HTTP response that consists of headers and a body. While the response can come in multiple forms (see above), two of the most common structured formats are XML and JSON.
Note that while most APIs will return only one or the other, some, like the colour lovers API, allow you to choose which one with a url parameter:
GET("http://www.colourlovers.com/api/color/6B4106?format=xml")
#> Response [http://www.colourlovers.com/api/color/6B4106?format=xml]
#> Date: 2022-05-03 20:41
#> Status: 403
#> Content-Type: text/html; charset=UTF-8
#> Size: 12 kB
#> <!DOCTYPE html>
#> <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
#> <!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
#> <!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
#> <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
#> <head>
#>
#> <title>Please Wait... | Cloudflare</title>
#>
#> <meta charset="UTF-8" />
#> ...
GET("http://www.colourlovers.com/api/color/6B4106?format=json")
#> Response [http://www.colourlovers.com/api/color/6B4106?format=json]
#> Date: 2022-05-03 20:41
#> Status: 403
#> Content-Type: text/html; charset=UTF-8
#> Size: 12 kB
#> <!DOCTYPE html>
#> <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
#> <!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
#> <!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
#> <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
#> <head>
#>
#> <title>Please Wait... | Cloudflare</title>
#>
#> <meta charset="UTF-8" />
#> ...
Others use content
negotiation to determine what sort of data to send back. If the API
you’re wrapping does this, then you’ll need to include one of
accept_json()
and accept_xml()
in your
request.
If you have a choice, choose json: it’s usually much easier to work with than xml.
Most APIs will return most or all useful information in the response
body, which can be accessed using content()
. To determine
what type of information is returned, you can use
http_type()
http_type(resp)
#> [1] "application/json"
I recommend checking that the type is as you expect in your helper function. This will ensure that you get a clear error message if the API changes:
<- function(path) {
github_api <- modify_url("https://api.github.com", path = path)
url
<- GET(url)
resp if (http_type(resp) != "application/json") {
stop("API did not return json", call. = FALSE)
}
resp }
NB: some poorly written APIs will say the content is type A, but it will actually be type B. In this case you should complain to the API authors, and until they fix the problem, simply drop the check for content type.
Next we need to parse the output into an R object. httr provides some
default parsers with content(..., as = "auto")
but I don’t
recommend using them inside a package. Instead it’s better to explicitly
parse it yourself:
jsonlite
package.xml2
package.<- function(path) {
github_api <- modify_url("https://api.github.com", path = path)
url
<- GET(url)
resp if (http_type(resp) != "application/json") {
stop("API did not return json", call. = FALSE)
}
::fromJSON(content(resp, "text"), simplifyVector = FALSE)
jsonlite }
Rather than simply returning the response as a list, I think it’s a good practice to make a simple S3 object. That way you can return the response and parsed object, and provide a nice print method. This will make debugging later on much much much more pleasant.
<- function(path) {
github_api <- modify_url("https://api.github.com", path = path)
url
<- GET(url)
resp if (http_type(resp) != "application/json") {
stop("API did not return json", call. = FALSE)
}
<- jsonlite::fromJSON(content(resp, "text"), simplifyVector = FALSE)
parsed
structure(
list(
content = parsed,
path = path,
response = resp
),class = "github_api"
)
}
<- function(x, ...) {
print.github_api cat("<GitHub ", x$path, ">\n", sep = "")
str(x$content)
invisible(x)
}
github_api("/users/hadley")
#> <GitHub /users/hadley>
#> List of 32
#> $ login : chr "hadley"
#> $ id : int 4196
#> $ node_id : chr "MDQ6VXNlcjQxOTY="
#> $ avatar_url : chr "https://avatars.githubusercontent.com/u/4196?v=4"
#> $ gravatar_id : chr ""
#> $ url : chr "https://api.github.com/users/hadley"
#> $ html_url : chr "https://github.com/hadley"
#> $ followers_url : chr "https://api.github.com/users/hadley/followers"
#> $ following_url : chr "https://api.github.com/users/hadley/following{/other_user}"
#> $ gists_url : chr "https://api.github.com/users/hadley/gists{/gist_id}"
#> $ starred_url : chr "https://api.github.com/users/hadley/starred{/owner}{/repo}"
#> $ subscriptions_url : chr "https://api.github.com/users/hadley/subscriptions"
#> $ organizations_url : chr "https://api.github.com/users/hadley/orgs"
#> $ repos_url : chr "https://api.github.com/users/hadley/repos"
#> $ events_url : chr "https://api.github.com/users/hadley/events{/privacy}"
#> $ received_events_url: chr "https://api.github.com/users/hadley/received_events"
#> $ type : chr "User"
#> $ site_admin : logi FALSE
#> $ name : chr "Hadley Wickham"
#> $ company : chr "@rstudio "
#> $ blog : chr "http://hadley.nz"
#> $ location : chr "Houston, TX"
#> $ email : NULL
#> $ hireable : NULL
#> $ bio : chr "Chief Scientist at @rstudio"
#> $ twitter_username : chr "hadleywickham"
#> $ public_repos : int 175
#> $ public_gists : int 178
#> $ followers : int 22416
#> $ following : int 0
#> $ created_at : chr "2008-04-01T14:47:36Z"
#> $ updated_at : chr "2022-04-09T12:45:00Z"
The API might return invalid data, but this should be rare, so you can just rely on the parser to provide a useful error message.
Next, you need to make sure that your API wrapper throws an error if the request failed. Using a web API introduces additional possible points of failure into R code aside from those occurring in R itself. These include:
You need to make sure these are all converted into regular R errors.
You can figure out if there’s a problem with http_error()
,
which checks the HTTP status code. Status codes in the 400 range usually
mean that you’ve done something wrong. Status codes in the 500 range
typically mean that something has gone wrong on the server side.
Often the API will provide information about the error in the body of the response: you should use this where available. If the API returns special errors for common problems, you might want to provide more detail in the error. For example, if you run out of requests and are rate limited you might want to tell the user how long to wait until they can make the next request (or even automatically wait that long!).
<- function(path) {
github_api <- modify_url("https://api.github.com", path = path)
url
<- GET(url)
resp if (http_type(resp) != "application/json") {
stop("API did not return json", call. = FALSE)
}
<- jsonlite::fromJSON(content(resp, "text"), simplifyVector = FALSE)
parsed
if (http_error(resp)) {
stop(
sprintf(
"GitHub API request failed [%s]\n%s\n<%s>",
status_code(resp),
$message,
parsed$documentation_url
parsed
),call. = FALSE
)
}
structure(
list(
content = parsed,
path = path,
response = resp
),class = "github_api"
)
}github_api("/user/hadley")
#> Error: GitHub API request failed [404]
#> Not Found
#> <https://docs.github.com/rest/reference/users#get-a-user>
Some poorly written APIs will return different types of response based on whether or not the request succeeded or failed. If your API does this you’ll need to make your request function check the
status_code()
before parsing the response.
For many APIs, the common approach is to retry API calls that return something in the 500 range. However, when doing this, it’s extremely important to make sure to do this with some form of exponential backoff: if something’s wrong on the server-side, hammering the server with retries may make things worse, and may lead to you exhausting quota (or hitting other sorts of rate limits). A common policy is to retry up to 5 times, starting at 1s, and each time doubling and adding a small amount of jitter (plus or minus up to, say, 5% of the current wait time).
While we’re in this function, there’s one important header that you should set for every API wrapper: the user agent. The user agent is a string used to identify the client. This is most useful for the API owner as it allows them to see who is using the API. It’s also useful for you if you have a contact on the inside as it often makes it easier for them to pull your requests from their logs and see what’s going wrong. If you’re hitting a commercial API, this also makes it easier for internal R advocates to see how many people are using their API via R and hopefully assign more resources.
A good default for an R API package wrapper is to make it the URL to your GitHub repo:
<- user_agent("http://github.com/hadley/httr")
ua
ua#> <request>
#> Options:
#> * useragent: http://github.com/hadley/httr
<- function(path) {
github_api <- modify_url("https://api.github.com", path = path)
url
<- GET(url, ua)
resp if (http_type(resp) != "application/json") {
stop("API did not return json", call. = FALSE)
}
<- jsonlite::fromJSON(content(resp, "text"), simplifyVector = FALSE)
parsed
if (status_code(resp) != 200) {
stop(
sprintf(
"GitHub API request failed [%s]\n%s\n<%s>",
status_code(resp),
$message,
parsed$documentation_url
parsed
),call. = FALSE
)
}
structure(
list(
content = parsed,
path = path,
response = resp
),class = "github_api"
) }
Most APIs work by executing an HTTP method on a specified URL with some additional parameters. These parameters can be specified in a number of ways, including in the URL path, in URL query arguments, in HTTP headers, and in the request body itself. These parameters can be controlled using httr functions:
modify_url()
query
argument to
GET()
, POST()
, etc.add_headers()
body
argument to GET()
,
POST()
, etc.RESTful
APIs also use the HTTP verb to communicate arguments (e.g.,
GET
retrieves a file, POST
adds a file,
DELETE
removes a file, etc.). We can use the helpful httpbin service to show how to send
arguments in each of these ways.
# modify_url
POST(modify_url("https://httpbin.org", path = "/post"))
# query arguments
POST("http://httpbin.org/post", query = list(foo = "bar"))
# headers
POST("http://httpbin.org/post", add_headers(foo = "bar"))
# body
## as form
POST("http://httpbin.org/post", body = list(foo = "bar"), encode = "form")
## as json
POST("http://httpbin.org/post", body = list(foo = "bar"), encode = "json")
Many APIs will use just one of these forms of argument passing, but others will use multiple of them in combination. Best practice is to insulate the user from how and where the various arguments are used by the API and instead simply expose relevant arguments via R function arguments, some of which might be used in the URL, in the headers, in the body, etc.
If a parameter has a small fixed set of possible values that are
allowed by the API, you can use list them in the default arguments and
then use match.arg()
to ensure that the caller only
supplies one of those values. (This also allows the user to supply the
short unique prefixes.)
<- function(x = c("apple", "banana", "orange")) {
f match.arg(x)
}f("a")
#> [1] "apple"
It is good practice to explicitly set default values for arguments
that are not required to NULL
. If there is a default value,
it should be the first one listed in the vector of allowed
arguments.
Many APIs can be called without any authentication (just as if you called them in a web browser). However, others require authentication to perform particular requests or to avoid rate limits and other limitations. The most common forms of authentication are OAuth and HTTP basic authentication:
“Basic” authentication: This requires a username and
password (or sometimes just a username). This is passed as part of the
HTTP request. In httr, you can do:
GET("http://httpbin.org", authenticate("username", "password"))
Basic authentication with an API key: An alternative provided by many APIs is an API “key” or “token” which is passed as part of the request. It is better than a username/password combination because it can be regenerated independent of the username and password.
This API key can be specified in a number of different ways: in a URL
query argument, in an HTTP header such as the Authorization
header, or in an argument inside the request body.
OAuth: OAuth is a protocol for generating a user- or
session-specific authentication token to use in subsequent requests. (An
early standard, OAuth 1.0, is not terribly common any more. See
oauth1.0_token()
for details.) The current OAuth 2.0
standard is very common in modern web apps. It involves a round trip
between the client and server to establish if the API client has the
authority to access the data. See oauth2.0_token()
. It’s ok
to publish the app ID and app “secret” - these are not actually
important for security of user data.
Some APIs describe their authentication processes inaccurately, so care needs to be taken to understand the true authentication mechanism regardless of the label used in the API docs.
It is possible to specify the key(s) or token(s) required for basic or OAuth authentication in a number of different ways. You may also need some way to preserve user credentials between function calls so that end users do not need to specify them each time. A good start is to use an environment variable. Here is an example of how to write a function that checks for the presence of a GitHub personal access token and errors otherwise:
<- function() {
github_pat <- Sys.getenv('GITHUB_PAT')
pat if (identical(pat, "")) {
stop("Please set env var GITHUB_PAT to your github personal access token",
call. = FALSE)
}
pat }
One particularly frustrating aspect of many APIs is dealing with paginated responses. This is common in APIs that offer search functionality and have the potential to return a very large number of responses. Responses might be paginated because there is a large number of response elements or because elements are updated frequently. Often they will be sorted by an explicit or implicit argument specified in the request.
When a response is paginated, the API response will typically respond with a header or value specified in the body that contains one of the following:
These values can then be used to make further requests. This will either involve specifying a specific page of responses or specifying a “next page token” that returns the next page of results. How to deal with pagination is a difficult question and a client could implement any of the following:
The choice of which to use depends on your needs and goals and the rate limits of the API.
Many APIs are rate limited, which means that you can only send a
certain number of requests per hour. Often if your request is rate
limited, the error message will tell you how long you should wait before
performing another request. You might want to expose this to the user,
or even include a call to Sys.sleep()
that waits long
enough.
For example, we could implement a rate_limit()
function
that tells you how many calls against the github API are available to
you.
<- function() {
rate_limit github_api("/rate_limit")
}rate_limit()
#> <GitHub /rate_limit>
#> List of 2
#> $ resources:List of 4
#> ..$ core :List of 5
#> .. ..$ limit : int 60
#> .. ..$ remaining: int 30
#> .. ..$ reset : int 1651612531
#> .. ..$ used : int 30
#> .. ..$ resource : chr "core"
#> ..$ graphql :List of 5
#> .. ..$ limit : int 0
#> .. ..$ remaining: int 0
#> .. ..$ reset : int 1651614075
#> .. ..$ used : int 0
#> .. ..$ resource : chr "graphql"
#> ..$ integration_manifest:List of 5
#> .. ..$ limit : int 5000
#> .. ..$ remaining: int 5000
#> .. ..$ reset : int 1651614075
#> .. ..$ used : int 0
#> .. ..$ resource : chr "integration_manifest"
#> ..$ search :List of 5
#> .. ..$ limit : int 10
#> .. ..$ remaining: int 10
#> .. ..$ reset : int 1651610535
#> .. ..$ used : int 0
#> .. ..$ resource : chr "search"
#> $ rate :List of 5
#> ..$ limit : int 60
#> ..$ remaining: int 30
#> ..$ reset : int 1651612531
#> ..$ used : int 30
#> ..$ resource : chr "core"
After getting the first version working, you’ll often want to polish the output to be more user friendly. For this example, we can parse the unix timestamps into more useful date types.
<- function() {
rate_limit <- github_api("/rate_limit")
req <- req$content$resources$core
core
<- as.POSIXct(core$reset, origin = "1970-01-01")
reset cat(core$remaining, " / ", core$limit,
" (Resets at ", strftime(reset, "%H:%M:%S"), ")\n", sep = "")
}
rate_limit()
#> 30 / 60 (Resets at 16:15:31)