HTTP with Python – PycURL by Example

A colleague of mine recently remarked something along the lines of “whenever I need to do HTTP client stuff in any language, I usually go look for cURL bindings straight away”, and I’m starting to agree on that one. HTTP is hard to do right, and cURL has a track record of doing HTTP right. If you don’t know what cURL is, take a look at http://curl.haxx.se

Up to the task of doing some serious HTTP stuff and Python being the language of choice in a recent little project that I did, I naturally went to look for the most popular HTTP client implementation in Python. urllib2 seemed to be the most popular library in that domain. I started fiddling with it for a bit, and while I got more and more confused and annoyed (they tend to go hand in hand, don’t they?) with the API design of this particular module, I ran into an astonishing shortcoming: it doesn’t currently let you do HTTPS over a proxy. Seriously? This was the moment that I hit Google to search for Python cURL bindings. I felt lucky, and this brought me to PycURL.

Using PycURL, I was able to implement my use cases in a snap. Then I wondered – why isn’t this module more popular among Python developers? The PycURL website already states the reasons – it doesn’t have a very Pythonic interface as it is a thin layer over libcurl (which is implemented in C), and it takes some knowledge of the cURL C API to put it to effective use. The first reason I can’t really help with, but I’ll be focusing on the second – gaining some knowledge of the underlying API. I’ll try to explain how this API works by implementing a number of use cases using PycURL.

The simplest use case possible would probably be to retrieve some content from a URL, so let’s get our feet wet.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://news.ycombinator.com')
c.perform()

This will perform a GET request on the given URL, and spit out the contents of the response body to stdout. In case it isn’t obvious enough to you already, we’re instantiating a new Curl object, call setopt() on it to influence cURL’s behavior, and call perform() to execute the actual HTTP request. This is a typical example of how using PycURL works; instantiate, set cURL instructions, perform.

We probably want to catch the output so we can do interesting things with it before dumping it to stdout, so here I’ll show you how to catch the output into a variable instead.

import pycurl
import cStringIO
 
buf = cStringIO.StringIO()
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://news.ycombinator.com')
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()
 
print buf.getvalue()
buf.close()

Here we’re using a string buffer to let cURL write the response to. By setting the WRITEFUNCTION instruction and pointing it to the write method of our string buffer, we can catch the output into the buffer once perform() is called. You can just as well use StringIO instead of cStringIO – but the latter is faster. Again, pretty much any behavior is defined using setopt(). Need to have the request go through a proxy and add connect and socket read timeouts? Here goes:

import pycurl
import cStringIO
 
buf = cStringIO.StringIO()
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://news.ycombinator.com')
c.setopt(c.WRITEFUNCTION, buf.write)
c.setopt(c.CONNECTTIMEOUT, 5)
c.setopt(c.TIMEOUT, 8)
c.setopt(c.PROXY, 'http://inthemiddle.com:8080')
c.perform()
 
print buf.getvalue()
buf.close()

Not necessarily Pythonic, but pretty simple, right? Wait, how do we know what options we can use, and know what they do? Easy, you go to http://curl.haxx.se/libcurl/c/curl_easy_setopt.html, find the option you’re looking for, and set it using setopt(), minus the ‘CURLOPT_’ part (the PycURL module sets it for you so you won’t get bored of typing ‘CURLOPT_’ all the time).

We’ve just scratched the surface of what cURL can do. Here is how you perform a POST request.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://myfavpizzaplace.com/order')
c.setopt(c.POSTFIELDS, 'pizza=Quattro+Stagioni&extra=cheese')
c.perform()

By setting the POSTFIELDS option, the request automatically becomes a POST request. The POST data is obviously set in the value for this option, in the form of a query string containing the variables to send; their values need to be URL-encoded (using urllib.urlencode() helps in demanding cases). So you think something is going wrong, and you want to take a look at the raw request as it’s being carried out? The VERBOSE option will help you there.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://myfavpizzaplace.com/order')
c.setopt(c.POSTFIELDS, 'pizza=Quattro+Stagioni&extra=cheese')
c.setopt(c.VERBOSE, True)
c.perform()

Setting the VERBOSE cURL option will print verbose information to stdout, so you can see exactly what is going on – from setting up the connection to creating the HTTP request, to the headers and the response that comes back after the request. This is very useful while programming with cURL – maybe even more so with sequences of requests between which you want to preserve cookie state. This is actually fairly simple to implement, let’s take a look.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://myappserver.com/ses1')
c.setopt(c.COOKIEFILE, '')
c.setopt(c.VERBOSE, True)
c.perform()
 
c.setopt(c.URL, 'http://myappserver.com/ses2')
c.perform()

Let’s assume here that myappserver.com/ses1 intializes a session by using a session cookie, and possibly sets other cookies as well, which will be needed to proceed to myappserver.com/ses1, so they will need to be sent back to the server upon the second request. What happens in the code above is interesting, in the sense that we’re creating a Curl object once, and use it to perform two requests. When used like this, all the options that are set on the object are taken in regard upon subsequent requests, until they are overridden during the time the handle exists. In this case we’re setting the COOKIEFILE option, which can normally be used to provide a path to a cookie file that contains data to be sent as a HTTP Cookie header. However, we set its value to an empty string, which has the effect that cURL is made cookie-aware, and will catch cookies and re-send cookies upon subsequent requests. Hence, we can keep state between requests on the same cURL handle intact. Since we use the VERBOSE option as well, the code above will show you exactly what happens during the request. This approach can be used to simulate, for example, login sessions or other flows throughout a web application that need multiple HTTP requests without having to bother with maintaining cookie (and also session cookie) state. Performing multiple HTTP requests on one cURL handle also has the convenient side effect that the TCP connection to the host will be reused when targeting the same host multiple times, which can obviously give you a performance boost. In production code you obviously want to set more options for handling e.g. timeouts and providing some level of fault tolerance. Here’s a more extensive version of the previous example.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://myappserver.com/ses1')
c.setopt(c.CONNECTTIMEOUT, 5)
c.setopt(c.TIMEOUT, 8)
c.setopt(c.COOKIEFILE, '')
c.setopt(c.FAILONERROR, True)
c.setopt(c.HTTPHEADER, ['Accept: text/html', 'Accept-Charset: UTF-8'])
try:
    c.perform()
 
    c.setopt(c.URL, 'http://myappserver.com/ses2')
    c.setopt(c.POSTFIELDS, 'foo=bar&bar=foo')
    c.perform()
except pycurl.error, error:
    errno, errstr = error
    print 'An error occurred: ', errstr

In this example we explicitly specify timeouts (which you should always do in real world situations), set some custom HTTP headers, and set the FAILONERROR cURL option to let cURL fail when a HTTP error code larger than or equal to 400 was returned. PycURL will throw an exception when this is the case, which allows you to gracefully deal with such situations.

This should be enough information to get you started on using cURL through Python. There’s a whole world of functionality inside cURL that you can use; for example, one of its very powerful features is the ability to execute multiple requests in parallel which can (when using PycURL) be done by using the CurlMulti object (http://pycurl.sourceforge.net/doc/curlmultiobject.html). I hope you’ll agree that using PycURL is fairly simple, really powerful and pretty fast.

Guess who’s back

I haven’t been posting on this weblog for ages. Why? For a few reasons, mainly because I switched jobs from development to operations which has been taking a lot of my time. I consciously made this switch, since I reasoned that if I spent some time in operations I’d learn more about how delivered code behaves in production. So far, that has been paying off. This is not to say that I’m not programming anymore – I just write code from a different angle and with different goals. I hope to blog a bit more about the things that have kept me busy lately, such as troubleshooting, performance tuning, monitoring and programming in general (which still interests me the most). Hopefully I’ll find enough time to write new posts, elaborate on questions/comments etc. If not, you can safely assume that I’m busy keeping sloppy applications running in high traffic environments.