HTTP with Python – PycURL by Example

A colleague of mine recently remarked something along the lines of “whenever I need to do HTTP client stuff in any language, I usually go look for cURL bindings straight away”, and I’m starting to agree on that one. HTTP is hard to do right, and cURL has a track record of doing HTTP right. If you don’t know what cURL is, take a look at http://curl.haxx.se

Up to the task of doing some serious HTTP stuff and Python being the language of choice in a recent little project that I did, I naturally went to look for the most popular HTTP client implementation in Python. urllib2 seemed to be the most popular library in that domain. I started fiddling with it for a bit, and while I got more and more confused and annoyed (they tend to go hand in hand, don’t they?) with the API design of this particular module, I ran into an astonishing shortcoming: it doesn’t currently let you do HTTPS over a proxy. Seriously? This was the moment that I hit Google to search for Python cURL bindings. I felt lucky, and this brought me to PycURL.

Using PycURL, I was able to implement my use cases in a snap. Then I wondered – why isn’t this module more popular among Python developers? The PycURL website already states the reasons – it doesn’t have a very Pythonic interface as it is a thin layer over libcurl (which is implemented in C), and it takes some knowledge of the cURL C API to put it to effective use. The first reason I can’t really help with, but I’ll be focusing on the second – gaining some knowledge of the underlying API. I’ll try to explain how this API works by implementing a number of use cases using PycURL.

The simplest use case possible would probably be to retrieve some content from a URL, so let’s get our feet wet.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://news.ycombinator.com')
c.perform()

This will perform a GET request on the given URL, and spit out the contents of the response body to stdout. In case it isn’t obvious enough to you already, we’re instantiating a new Curl object, call setopt() on it to influence cURL’s behavior, and call perform() to execute the actual HTTP request. This is a typical example of how using PycURL works; instantiate, set cURL instructions, perform.

We probably want to catch the output so we can do interesting things with it before dumping it to stdout, so here I’ll show you how to catch the output into a variable instead.

import pycurl
import cStringIO
 
buf = cStringIO.StringIO()
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://news.ycombinator.com')
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()
 
print buf.getvalue()
buf.close()

Here we’re using a string buffer to let cURL write the response to. By setting the WRITEFUNCTION instruction and pointing it to the write method of our string buffer, we can catch the output into the buffer once perform() is called. You can just as well use StringIO instead of cStringIO – but the latter is faster. Again, pretty much any behavior is defined using setopt(). Need to have the request go through a proxy and add connect and socket read timeouts? Here goes:

import pycurl
import cStringIO
 
buf = cStringIO.StringIO()
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://news.ycombinator.com')
c.setopt(c.WRITEFUNCTION, buf.write)
c.setopt(c.CONNECTTIMEOUT, 5)
c.setopt(c.TIMEOUT, 8)
c.setopt(c.PROXY, 'http://inthemiddle.com:8080')
c.perform()
 
print buf.getvalue()
buf.close()

Not necessarily Pythonic, but pretty simple, right? Wait, how do we know what options we can use, and know what they do? Easy, you go to http://curl.haxx.se/libcurl/c/curl_easy_setopt.html, find the option you’re looking for, and set it using setopt(), minus the ‘CURLOPT_’ part (the PycURL module sets it for you so you won’t get bored of typing ‘CURLOPT_’ all the time).

We’ve just scratched the surface of what cURL can do. Here is how you perform a POST request.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://myfavpizzaplace.com/order')
c.setopt(c.POSTFIELDS, 'pizza=Quattro+Stagioni&extra=cheese')
c.perform()

By setting the POSTFIELDS option, the request automatically becomes a POST request. The POST data is obviously set in the value for this option, in the form of a query string containing the variables to send; their values need to be URL-encoded (using urllib.urlencode() helps in demanding cases). So you think something is going wrong, and you want to take a look at the raw request as it’s being carried out? The VERBOSE option will help you there.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://myfavpizzaplace.com/order')
c.setopt(c.POSTFIELDS, 'pizza=Quattro+Stagioni&extra=cheese')
c.setopt(c.VERBOSE, True)
c.perform()

Setting the VERBOSE cURL option will print verbose information to stdout, so you can see exactly what is going on – from setting up the connection to creating the HTTP request, to the headers and the response that comes back after the request. This is very useful while programming with cURL – maybe even more so with sequences of requests between which you want to preserve cookie state. This is actually fairly simple to implement, let’s take a look.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://myappserver.com/ses1')
c.setopt(c.COOKIEFILE, '')
c.setopt(c.VERBOSE, True)
c.perform()
 
c.setopt(c.URL, 'http://myappserver.com/ses2')
c.perform()

Let’s assume here that myappserver.com/ses1 intializes a session by using a session cookie, and possibly sets other cookies as well, which will be needed to proceed to myappserver.com/ses1, so they will need to be sent back to the server upon the second request. What happens in the code above is interesting, in the sense that we’re creating a Curl object once, and use it to perform two requests. When used like this, all the options that are set on the object are taken in regard upon subsequent requests, until they are overridden during the time the handle exists. In this case we’re setting the COOKIEFILE option, which can normally be used to provide a path to a cookie file that contains data to be sent as a HTTP Cookie header. However, we set its value to an empty string, which has the effect that cURL is made cookie-aware, and will catch cookies and re-send cookies upon subsequent requests. Hence, we can keep state between requests on the same cURL handle intact. Since we use the VERBOSE option as well, the code above will show you exactly what happens during the request. This approach can be used to simulate, for example, login sessions or other flows throughout a web application that need multiple HTTP requests without having to bother with maintaining cookie (and also session cookie) state. Performing multiple HTTP requests on one cURL handle also has the convenient side effect that the TCP connection to the host will be reused when targeting the same host multiple times, which can obviously give you a performance boost. In production code you obviously want to set more options for handling e.g. timeouts and providing some level of fault tolerance. Here’s a more extensive version of the previous example.

import pycurl
 
c = pycurl.Curl()
c.setopt(c.URL, 'http://myappserver.com/ses1')
c.setopt(c.CONNECTTIMEOUT, 5)
c.setopt(c.TIMEOUT, 8)
c.setopt(c.COOKIEFILE, '')
c.setopt(c.FAILONERROR, True)
c.setopt(c.HTTPHEADER, ['Accept: text/html', 'Accept-Charset: UTF-8'])
try:
    c.perform()
 
    c.setopt(c.URL, 'http://myappserver.com/ses2')
    c.setopt(c.POSTFIELDS, 'foo=bar&bar=foo')
    c.perform()
except pycurl.error, error:
    errno, errstr = error
    print 'An error occurred: ', errstr

In this example we explicitly specify timeouts (which you should always do in real world situations), set some custom HTTP headers, and set the FAILONERROR cURL option to let cURL fail when a HTTP error code larger than or equal to 400 was returned. PycURL will throw an exception when this is the case, which allows you to gracefully deal with such situations.

This should be enough information to get you started on using cURL through Python. There’s a whole world of functionality inside cURL that you can use; for example, one of its very powerful features is the ability to execute multiple requests in parallel which can (when using PycURL) be done by using the CurlMulti object (http://pycurl.sourceforge.net/doc/curlmultiobject.html). I hope you’ll agree that using PycURL is fairly simple, really powerful and pretty fast.

23 thoughts on “HTTP with Python – PycURL by Example

  1. @Oren: Yes I know Requests, I knew this one would come up :) It’s certainly more pythonic than PycURL, but I’m not sure if it’s more powerful or even that much easier to use. Like I said in the post, HTTP is hard to get right and I guess I put some trust in a library that has a track record like cURL has.

  2. If you like requests and curl you should checkout human_curl on github. Supports more use cases than requests, same syntax, relies on curl for http awesomeness.

  3. PyCurl absolutely rocks. I played with urllib, urllib2, httplib and everything possible before i used PyCurl.

    Each of the native python options works but they are real mess when it comes to lower versions of python. Especially the HTTPS support sucks in Python

  4. I’ve just moved from requests to pycurl.
    Orderer wanted the info supplied by curl which he have found in the php library.

    First glance is pycurl couses about 10 times smaller system load than requests. I just call them froim apscheduler. The other actually I have a problems with memory protection faults. I haven’t tried to create the pool with multi.

    KAcper

  5. Very nice example. Although I’ve been working with Python for the last 3 year or so, I have never touched PycURL. Will give it spin this weekend.

    Nice work.

  6. hi

    suppose i want to post from a variable say

    variable x = 10

    variable y = 100

    now i want to post the values how can i do that ?

  7. thanks so much for this tutorial!! one thing i’ve been unable to find anywhere is how to make pycurl use the credentials from the store… for example:

    $curl https://some.secure.site.com/ -u : –negotiate -k

    This works awesome, but i can’t figure out how to do it in PyCurl… so far, i got this:

    c = pycurl.Curl()
    c.setopt(c.URL,’https://some.secure.site.com’)
    c.setopt(c.SSL_VERIFYPEER,False)
    ……
    c.perform()
    Thanks so much!!

  8. I use .curlrc for authentication credentials when using curl (for example: -H “Password:XXXXX”.

    how do I force pycurl to read .curlrc?

  9. Hi

    Nice tut :)

    For some reason HEADER doesn’t seem to work. Ive got the below but I still get the entire page returned in the buf. any ideas?

    #!/usr/bin/python

    import pycurl
    import cStringIO

    buf = cStringIO.StringIO()

    c = pycurl.Curl()

    c.setopt(c.HEADER, 1)
    c.setopt(c.FOLLOWLOCATION, 1)
    c.setopt(c.URL, ‘Http://www.google.com’)
    c.setopt(c.PROXY, ‘http://192.168.1.64:8080′)
    c.setopt(c.WRITEFUNCTION, buf.write)

    c.perform()

    print buf.getvalue()
    buf.close()

  10. Hi,

    Make sure you’re using the NOBODY cURL option when you want to retrieve the headers only. So adding:

    c.setopt(c.NOBODY, 1)

    should do the trick.

    HTH.

  11. Thanks objectified,

    That worked a treat! im having issue with SSL sites. I know there is a –insecure option for curl but cant seem to see anyway to use this in pycurl any ideas?

    muchly appreciated

  12. Hi,

    In order to ignore SSL verification, you can turn off the SSL_VERIFYPEER option. Like this:

    c.setopt(c.SSL_VERIFYPEER, 0)

    I wouldn’t recommend it as it breaks security somewhat, but there those situations in which you might need this.

    HTH.

  13. @Tony: that’s in the tutorial – read the part where the write function of a StringIO buffer is passed while setting the WRITEFUNCTION pycurl option.

Leave a Reply

Your email address will not be published. Required fields are marked *