Entries Tagged as 'europython'

Europython 2009 handout material

The following is the handout material for my talk at Europython 2009. It was generated from an rst file, which is also available on the Europython Wiki.

Improving client-side HTTP

Author: Ivo Timmermans
Date: June 2009
Info: http://www.treparel.com/
Description: Handout for the EuroPython 2009 talk
Organization: Treparel Information Solutions B.V.

Abstract

The current functionality of the Python standard library in supporting the client side of HTTP communication is rather rudimentary (httplib, http.client in Python3 and urllib). Furthermore, we are not aware that any external libraries exist that support our use cases. Via an introductory talk and further discussion, we hope to be able to define a roadmap for the future of HTTP clients written in Python. First, we will define our requirements, then we will explain the shortcomings of both the internal and external solutions, and finally, we hope to create a lively discussion about the future of HTTP clients.

1   Introduction

1.1   Client-side HTTP in Python

There are thousands of web frameworks that are written in Python: Django, Mochikit, etc. There are many web application servers: SkunkWeb, CherryPy, etc. 28% of the talks in this conference are about the server side in some way or other, so this is probably a forte of Python. But if the server side is so well represented, howcome the client side is so badly represented?

Of course, if all you need to do is retrieve a document, there are plenty solutions: urllib, http.client, PycURL, etc. None of them support any more complex use cases very well.

1.2   Outline of the presentation

  1. First, we will tell a little about our application. Knowing our requirements is important to understand why the current solutions are inadequate.
  2. Then, we will discuss the current state of the Python standard libraries and some external libraries. We will not cover every possible implementation there is, because there are quite a lot of them. The discussion will involve some code fragments.
  3. Then, we will quickly outline next steps that could be taken, and future plans from our side.
  4. Finally, this session ends in an interactive discussion, where there is room for questions and wild ideas. If there is a need for it, discussion will continue in the open space.

By the way, if, at some point during the presentation, you know that some package or module exists somewhere that does what we need, please shout it out!

2   About Treparel

I work for Treparel Information Solutions B.V. [9] in the Netherlands. This is a small company that creates high-performance data mining software.

For those that don’t know, data mining is the process of discovering patterns in data you already have. For example, our application works on Oracle and other databases (thanks to SQLAlchemy), but the data in those databases is not gathered by us but by some other, external, process. We’re talking about mining huge amounts of data; our customers are mostly big multinational companies.

Being a high-performance software package brings out some challenging and interesting architectural decisions, but discussing that is far beyond the scope of this presentation. The gist of it is that the server is usually some heavyweight server (blades) running on Linux, with Windows clients.

2.1   Client

We have written a Python client that talks to Python server via HTTP. Although, HTTP in a corporate environment means that you have to deal with proxies, authentication (basic, kerberos, NTLM), SSL or not, IP-whitelisting or not, etc.

Usually, we sell the software as a complete, standalone product, sometimes as a SaaS (Software as a Service) or PaaS (Platform as a Service) solution. The standalone product has to integrate with the corporate environment (think: file/backup policies, database connections, user authentication and authorization).

The client application (running on the user’s Windows desktop) connects to the server. The "service" solution must connect to the internet: proxies with authorization.

The client application is the visualization and data management tool that users see. This is programmed in Python, using (amongst others): py2exe for distribution, PyQt for the user interface, pyOpenGL for the visualizations, some custom C++ libraries for data management on top of Qt.

2.2   Server

Mostly Python, some key parts in C++ using Boost-Python.

Some parts, e.g. the calculation and text processing pipelines, have been written in C++. The wrapping of this code is done using Boost-Python.

The server uses a custom HTTP/1.1 server because we wanted an small, extensible, server that can at the same time manage processes. The solution involves some POSIX message queue magic.

2.3   Protocol

2.3.1   (Ab)using HTTP

The communication is mostly XML-RPC over HTTP. Although, at some point, we may switch to protobuf (Google), Thrift (Facebook), or some such. This is undecided as of now, in the meantime we will stick with XML-RPC and pickle.

Apart from doing XML-RPC over HTTP (using POST requests), we have additional methods using e.g. GET for file transfer from the server to the client, and PUT for file transfer from the client to the server. We use OPTIONS to figure out what capabilities the server has.

The semantics of an XML-RPC call will not vary much when changing the serialization from XML-RPC to e.g. protobuf: probably only the Content-Type header has to be changed.

2.3.2   Access patterns

Access patterns define what kind of solution can be used. As we will see, each of these access patterns requires different things from a client library.

The application uses several types of calls to the server. This is normal and can be expected: each type supports some use case.

  • Lots of calls can be made using the application:
    • User interface components: showing a list of projects, objects
    • Data transfer: upload data sets to work on, download result data
    • Data management: removing uploaded files
    • Job scheduling: "I want to execute this complex query, and I will retrieve the results later"

The result is that we have a protocol that is mostly transaction-based: return a transaction id immediately, and have a method to check whether the transaction has ended. The result of this call also includes the result data of the method call.

Some methods, however, return a data id instead of a direct value. This is the case with methods that return a lot of data, for example: visualization data, that includes vertices, annotation data, coloring information, region specifications, instructions on e.g. how to zoom in and whether more detail is available. The latter is necessary to
be able to supply details if the user requests them: the visualizations are multi-resolution.

These requests can be large, ranging from 1kB to 500MB. Of course not every result data is 2 gigabytes, but this is approximately the range of sizes we’re talking about. Even if the data size is only 500MB, you still don’t want to have all of it in memory if you can prevent it. Our situation is further complicated because we transmit it pickled, meaning that we need to have the full state of the state machine (and the machine memoizes pretty much everything).

In the future, we may want to support even larger data set sizes. These cannot be transmitted right now, because of the Windows 2GB address space restriction (and sanity). We will break the response in smaller chunks, and transmit these chunks.

3   Our requirements

First off, the mandatory document to read here, is RFC 2616, which defines HTTP version 1.1. This is the basis for our communication between the client and server. The standard describes a lot of functionality that we won’t use, but it did serve as the basis for the implementation of SpitFire.

3.1   Use of the API

Understand body in file-like objects
Instead of requiring the client to retain the entire request body in memory, allow it to pass a file-like object (i.e. with a suitable read method) as the body of the request.
Use file-like objects for server response
The converse is also true: when reading enormous amounts of data in the response from the server, it’s never wise to keep that in memory.
Authentication schemes
It must be possible to implement other authentication schemes besides Basic (think: Digest, Kerberos, NTLM). Whether or not these implementations are part of the library doesn’t matter. (See RFC 2617 for Basic and Digest mechanisms.)
Dynamic authentication
We do not know beforehand what the authentication will be; for e.g. Kerberos or Basic authentication, quite different values need to be passed, but only after we get our first response with a WWW-Authenticate header.
Server SSL certificate validation
When the server presents a security certificate, the application should be enabled to ask the user to verify and validate it.

3.2   HTTP protocol quirks

Keepalive
The client must maintain an open socket until it is explicitly closed by either the calling code or the server. This should be done by default when using HTTP 1.1, otherwise, heed the contents of the Connection response header.
HTTP/0.9 support
Seriously, how many non-trivial HTTP/0.9 servers are in use at this time?
Other commands than GET
The client must be able to do other HTTP commands, such as PUT or OPTIONS. The library must understand the functional difference between these request methods, and never blindly assume that a redirect changes the request method. This may be so with POST, but definitely not with PUT. Another example is that OPTIONS may or may not have a body in both the request and the response.
Use and understand chunked encoding
HTTP 1.1 clients are required to be able to receive chunked encoded messages from the server. We must be able to send chunked encoded messages to the server as well.

3.3   Understanding the server

Redirect handling
The server sends redirect status codes in response to various situations, sometimes the request body needs to be resent to the new location (e.g. in case of PUT).
Heed 100 Continue status
We explicitly ask the server for a 100 Continue response using the
Expect: 100-continue header in our request. The client library
must signal the application when we are ready to send the request
body.
SSL
The client must be able to talk to the server over SSL or TLS.
Client SSL certificate authentication
The client must be able to authenticate itself to the server using an SSL certificate.

3.4   Managing server connections

Communication via a proxy
The library must be able to transparently talk to a HTTP proxy. This proxy may require authentication, SSL certificate validation, etc., as with a plain request directly to the origin server, and the library must be able to handle it all. The main problem is acquiring proxy configuration for all the platforms that Python supports.
Communication over unix domain sockets
For unit testing purposes, we use unix domain sockets (AF_UNIX) to communicate to an embedded test server. It would be nice if the library supported this somehow.
Multiple simultaneous open connections to the server
Multiple connections will be used, and preferrably the library should offer some control over which connection is used for a request (for example, by placing requests on a specific instance, and coupling instances to a socket).

3.5   Library usage

Cross-platform
Of course, the code must work on all platforms (win32, linux x86, linux x86-64, linux itanium, solaris, hp-ux). This should not be much of an issue, but watch out for request size limits on 32-bit platforms.
Pythonic
We’re writing Python code, and code using the library should preferably "look and feel" like python. What the library is made of doesn’t matter.

4   Other implementations

4.1   http.client (former httplib)

The library requires you to explicitly pass the request method, so in theory it supports them all, however the logic only appears to have native support for GET and POST.

http.client always sends the entire request in one go, completely ignoring 100 Continue statuses in the response. To support 100 Continue, 68 lines in the HTTPResponse class need to be duplicated, as shown in figure 1. The image is for the HTTPResponse class, found in httplib (Python 2.5). However, since this class had very little changes in py3k (checked against Python 3.1 alpha), so the situation hasn’t changed at all.

Figure 1: Code that needs to be duplicated

Figure 1: The code of httplib.HTTPResponse that needs to be duplicated to support 100 Continue; the area in red is the code that’s actually different.

4.2   urllib (former urllib2)

urllib is basically a chain of handlers for response status codes, built on top of http.client. This means that most of the incapabilities of http.client are shared.

The way to get urllib to execute another request besides GET and POST, is to make get_method of the Request instance return the method you want. For example:

request.get_method = lambda: 'PUT'

This kind of works; however, we use a redirect (301) from a general resource to a specific resource in response to the PUT (this is explicitly described as a possibility in section 9.6 of RFC 2616). This is not handled gracefully by urllib, which always discards the body and changes the request to GET in response to a redirect response. This is a valid assumption if you’re working with the POST-redirect-GET pattern.

The proper way to handle this for PUT however, is to create a redirect handler, and insert that in the handlers chain below the original redirect handler. Re-implementing this feature requires us to duplicate some of the code in HTTPRedirectHandler.

4.3   httplib2

httplib2 is a project to improve some of the shortcomings of httplib. The website lists project goals, and the most important (to us) is that it supports Keep-Alive. The rest of the library seems to be focused more on caching strategies rather than advanced HTTP usage, but it does acknowledge that there may be other use cases.

However, it uses the socket module’s builtin SSL support. It only provides some authentication mechanisms and no extension is possible, so no Kerberos authentication at this time. It has no direct support for 100 Continue.

4.4   bzrlib’s _urllib2_wrappers.py

Bazaar has quite similar goals to us: transfer possibly large amounts of data in several requests over the same socket connection, supporting outlandish authentication and network setups. They have written a series of connect handlers on top of urllib2 to manage this. For example, they have implemented proper redirect handling.

However, one of the developers has expressed the desire to re-implement this in a different library, should it become available; mostly because the existing code is a mess.

4.5   PycURL

Even though the API is not very pythonic, this library is also usable for other languages.

Its documentation is complete, but rather unstructured because everything is done through setopt and getinfo. But it’s nice in that it allows you to pass custom read and write methods for the body. In general, it offers very detailed control over the messages, but you need to understand HTTP.


4.6   Side-by-side comparison

Library http.client PycURL urllib
Keepalive x v x [7]
Commands x [4] v [1] y
Chunked x [5] v x [7]
100-continue x v x [7]
File-objects x v x [7]
Redirect x [6] v v
Authentication x v y
Dynamic auth x v [3] z
SSL z u z [7]
Verify cert z u u
Client cert x u x
Proxy x u v [8]
Unix sockets x x x
Multi-connections x v x
Cross-platform v v v
Easy yes yes yes
Control no yes some
Pythonic a bit no yes

Key:

v: The feature is available and works as desired.
x: The feature is not available.
y: The feature is not available, but it can easily be enabled.
z: The feature is not available, but it can be enabled with some external code and monkey-patching, but without involvement in the internals of the library.
?: Unknown if the feature is supported.
u: Untested, but support is claimed.

Notes:

[1] PycURL does allow you to send a custom request method, but it only knows about GET, HEAD, POST and PUT [2]. Other custom request methods can be used, but in that case, the more fine-grained control of the cURL options must be used.
[2] PUT can only be used to upload a file, and its behavior is fixed.
[3] PycURL will set the http status code to 401, and set the flags of getinfo(HTTPAUTH_AVAIL) to the list of supported mechanisms.
[4] Custom commands are possible, but it that case, the request must be entirely built up by hand.
[5] Chunked encoding is supported for the server response, but not for the sent request.
[6] Redirects are followed, but the request body is always discarded, even with PUT.
[7] (1, 2, 3, 4, 5) Ultimately, urllib uses http.client to do the requests and to read the response. Therefore, urllib shares many of the disabilities of http.client.
[8] urllib has several hooks for proxy authentication, but allowing them to work fully takes some effort.

5   Past efforts

PEP 268 is about authorization and WebDAV. It has been rejected:

This PEP has been rejected. It has failed to generate sufficient community support in the six years since its proposal.

Let’s not have that happen here! The issue we have with the existing http.client is that it’s difficult to re-use if more functionality is required, not necessarily that it doesn’t support our use cases.

6   Discussion

  • Any suggestions for other existing client-side HTTP libraries?
  • Do other programming languages suffer from lacking libraries? (PHP does, at least…)
  • Are there any libraries for other languages we can build on?
  • Do we need to design and build a new library?
  • Is it better to start from scratch, or to reuse or extend http.client? httplib2? urllib?

7   References and footnotes

  • RFC 2616 – Hypertext Transfer Protocol — HTTP/1.1
  • RFC 2617 – HTTP Authentication: Basic and Digest Access Authentication
  • Bazaar version control system homepage
[9] The Treparel website contains a little more information about the targeted users and the product.