Main Contents

RFC 3986 considered harmful

October 26, 2008

Way back in the mists of ancient time when the web was young–around 1996–web browsers needed a mechanism for submitting data to servers. This mechanism was designed as a way to allow searches and similar operations.

Hence, the standards for URIs and URLs included a mechanism called query parameters. Consider the following example URL:

http://www.example.com/article/23?date=2008-10-20&author=mathew

The string after the question mark is the query, made up of a sequence of values of the form key=value, separated by ampersands. As RFC 1738 said:

An HTTP URL takes the form:

http://<host>:<port>/<path>?<searchpart>

where and

are as described in Section 3.1. If :

is omitted, the port defaults to 80. No user name or password is allowed.

is an HTTP selector, and is a query string. The

is optional, as is the and its preceding "?". If neither

nor is present, the "/" may also be omitted.

The interpretation of the query was left to the HTTP specifications. RFC 2396 replaced RFC 1738. It says:

The query component is a string of information to be interpreted by the resource.

This is important: the query component is not part of the address of the resource being requested. Rather, it is something to be passed to the resource identified by the rest of the URL. This is made explicit in many web APIs; for example, Apache puts the query in a separate variable from the path. Our example URL becomes:

PATH_TRANSLATED="/article/23"
QUERY_STRING="date=2008-10-20&author=mathew"

Unfortunately, a lot of people who should have known better–including the developers of IBM Lotus Domino–decided to use the query syntax to encode things other than queries; they made query parameters part of the address of resources. For example, opening a form on a Domino server might require a URL like this:

http://www.example.com/db.nsf/Form?OpenForm&ParentUNID=6bc72a92613fd6bf852563de001f1a25

This misuse of query parameters caused problems. The most noticeable to Domino developers was that search engines would skip any link containing a question mark in the URL, as that was assumed to be a dynamic query. So Domino was quickly patched to allow exclamation marks to be used instead of question marks. However, for historical compatibility, Domino still generates inappropriate query URLs by default.

Perhaps because of this misuse of query parameters, the most recent revision of the URI spec has changed their definition. RFC 3986 says:

The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI’s scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.

In the common case of an HTTP GET request, the difference isn’t very important, search engines excepted. However, when you consider HTTP POST, the meaning of the request is changed. The HTTP specification RFC 2616 says (section 9.5):

The POST method is used to request that the origin server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line.

Consider our example URL again:

http://www.example.com/article/23?date=2008-10-20&author=mathew

As per the standards prior to RFC 3986, the resource is http://www.example.com/article/23.

According to RFC 3986, the resource is http://www.example.com/article/23?date=2008-10-20&author=mathew

So this seemingly trivial change in RFC 3986 has fundamentally changed the meaning of HTTP POST requests to any URL that includes a query parameter.

What’s more, the definition in RFC 3986 is ridiculous in the context of URL-encoded HTTP POST. If the query is part of the address of the resource you’re posting to, what are you posting? Where is the data? There is none. So POST with data in the URL no longer makes sense if you believe the new URI RFC.

Unfortunately, some people are now demanding that one should be able to HTTP POST with data of type application/x-www-form-urlencoded–that is, data in the body of the request–to a URL with a query in it. That, according to the new RFC, does make sense. It means incorporating one set of query data in the address of the resource, and posting a completely different set of data to the resource in the body of the HTTP request. I can see why one might want to do that, but the HTTP protocol wasn’t designed that way.

Furthermore, the definition of resource addresses in the new RFC implicitly demands that if

POST http://www.example.com/article/23?date=2008-10-20&author=mathew

is a valid request, then

GET http://www.example.com/article/23?date=2008-10-20&author=mathew

must also be a valid request, a request to get the named resource that you just posted. This is not the case for most existing software.

There’s one more problem, and it’s a killer. The HTTP spec (RFC 2616) explicitly says (section 3.2.2):

The "http" scheme is used to locate network resources via the HTTP protocol. This section defines the scheme-specific syntax and semantics for http URLs.

http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

If the port is empty or not given, port 80 is assumed. The semantics are that the identified resource is located at the server listening for TCP connections on that port of that host, and the Request-URI for the resource is abs_path (section 5.1.2).

So the HTTP specification explicitly states that the query is not part of the address of the resource; and RFC 3986 explicitly states that it is.

In short: the latest HTTP RFC and the latest URI RFC are fundamentally incompatible, and the latest URI RFC is incompatible with the behavior of most existing software.

What this means is that if you do anything involving HTTP POST requests with body data sent to URIs that include query parameters, you can expect a world of pain. Some software will ignore the query parameters, some won’t. So don’t do it.

My view is that RFC 3986 is in error, and its definition of query parameters should be ignored in favor of the definition in every prior standard, also the definition in common use.

Filed under: Programming, Standards | Comments (0)

Leave a comment

Login