Main Contents

Random thought

November 11, 2008

What is the shortest regular expression that matches only itself, including the / delimiters?

Filed under: Programming | Comments (1)

Java SSL/HTTPS via JSSE: Write once, run everywhere?

November 4, 2008

A common Java problem is to connect to an authenticated Web Service via HTTPS. Doing so while preserving portability can be tricky.

There are a lot of helpful tutorials out on the web that say to do something like this:

Security.addProvider(new com.sun.net.ssl.internal.ssl.Provider());
    System.setProperty("java.protocol.handler.pkgs","com.sun.net.ssl.internal.www.protocol");
    final MyAuthenticator auth = new MyAuthenticator(username, password);
    Authenticator.setDefault(auth);
    try {
      final URL url = new URL(httpsurl);
      try {
        final HttpURLConnection urlc = (HttpURLConnection) url.openConnection();
        try {
          // Rest of code

However, if you try to deploy the code on a system using IBM’s JVM, you’ll get a rude surprise:

Exception in thread "main" java.lang.NoClassDefFoundError: com.sun.net.ssl.internal.ssl.Provider
        at java.lang.J9VMInternals.verifyImpl(Native Method)
        at java.lang.J9VMInternals.verify(J9VMInternals.java:72)
        at java.lang.J9VMInternals.initialize(J9VMInternals.java:134)
        [...]

In fact, the addProvider and setProperty stuff is completely unnecessary. You can simply do:

final MyAuthenticator auth = new MyAuthenticator(username, password);
    Authenticator.setDefault(auth);
    try {
      final URL url = new URL(httpsurl);
      try {
        final HttpsURLConnection urlc = (HttpsURLConnection) url.openConnection();
        try {
          // Rest of code

The key detail is emphasized in bold above. The class to use is javax.net.ssl.HttpsURLConnection, rather than the com.sun.* class Eclipse may offer you. Your code will then run on IBM JVMs as well as the Sun JVM and OpenJDK.

Of course, if you’re getting the exception from someone else’s code, you might need to get them to fix it. Because of the prevalence of incorrect examples, it’s quite common to find the error in frameworks and libraries.

Filed under: Java | Comments (0)

RHEL mystery

November 3, 2008

# rpm -qa | sort | head
062845dae2b345a49371764d56c35c11-1.0-1
0badaf030552acad2c866faa9a9b7041-1.0-1
0ec92291e1c987d498b8127186f2de9b-1.0-1
10ff063ad03a4993d1be5545c5685c8f-1.0-1
1101e7ff1501ff09f8f99f96fac79f7a-7.00-1
115f65330febf7f2d8958be28a4889a1-1.0-1
13b480741c84e31718fb0a4c68c4a33f-1.0-1
15c7eaa92e96476ed2412ac5141fecdd-7.00-1
15e441e2305aae6f02da4be7da455c3b-1.0-1
193d1ac50dcc41117bfa5d3e09a0d94e-1.0-1
 

What are these mysterious nameless packages? There seem to be about 70 of them.

Filed under: Linux | Comments (0)

RFC 3986 considered harmful

October 26, 2008

Way back in the mists of ancient time when the web was young–around 1996–web browsers needed a mechanism for submitting data to servers. This mechanism was designed as a way to allow searches and similar operations.

Hence, the standards for URIs and URLs included a mechanism called query parameters. Consider the following example URL:

http://www.example.com/article/23?date=2008-10-20&author=mathew

The string after the question mark is the query, made up of a sequence of values of the form key=value, separated by ampersands. As RFC 1738 said:

An HTTP URL takes the form:

http://<host>:<port>/<path>?<searchpart>

where and

are as described in Section 3.1. If :

is omitted, the port defaults to 80. No user name or password is allowed.

is an HTTP selector, and is a query string. The

is optional, as is the and its preceding "?". If neither

nor is present, the "/" may also be omitted.

The interpretation of the query was left to the HTTP specifications. RFC 2396 replaced RFC 1738. It says:

The query component is a string of information to be interpreted by the resource.

This is important: the query component is not part of the address of the resource being requested. Rather, it is something to be passed to the resource identified by the rest of the URL. This is made explicit in many web APIs; for example, Apache puts the query in a separate variable from the path. Our example URL becomes:

PATH_TRANSLATED="/article/23"
QUERY_STRING="date=2008-10-20&author=mathew"

Unfortunately, a lot of people who should have known better–including the developers of IBM Lotus Domino–decided to use the query syntax to encode things other than queries; they made query parameters part of the address of resources. For example, opening a form on a Domino server might require a URL like this:

http://www.example.com/db.nsf/Form?OpenForm&ParentUNID=6bc72a92613fd6bf852563de001f1a25

This misuse of query parameters caused problems. The most noticeable to Domino developers was that search engines would skip any link containing a question mark in the URL, as that was assumed to be a dynamic query. So Domino was quickly patched to allow exclamation marks to be used instead of question marks. However, for historical compatibility, Domino still generates inappropriate query URLs by default.

Perhaps because of this misuse of query parameters, the most recent revision of the URI spec has changed their definition. RFC 3986 says:

The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI’s scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.

In the common case of an HTTP GET request, the difference isn’t very important, search engines excepted. However, when you consider HTTP POST, the meaning of the request is changed. The HTTP specification RFC 2616 says (section 9.5):

The POST method is used to request that the origin server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line.

Consider our example URL again:

http://www.example.com/article/23?date=2008-10-20&author=mathew

As per the standards prior to RFC 3986, the resource is http://www.example.com/article/23.

According to RFC 3986, the resource is http://www.example.com/article/23?date=2008-10-20&author=mathew

So this seemingly trivial change in RFC 3986 has fundamentally changed the meaning of HTTP POST requests to any URL that includes a query parameter.

What’s more, the definition in RFC 3986 is ridiculous in the context of URL-encoded HTTP POST. If the query is part of the address of the resource you’re posting to, what are you posting? Where is the data? There is none. So POST with data in the URL no longer makes sense if you believe the new URI RFC.

Unfortunately, some people are now demanding that one should be able to HTTP POST with data of type application/x-www-form-urlencoded–that is, data in the body of the request–to a URL with a query in it. That, according to the new RFC, does make sense. It means incorporating one set of query data in the address of the resource, and posting a completely different set of data to the resource in the body of the HTTP request. I can see why one might want to do that, but the HTTP protocol wasn’t designed that way.

Furthermore, the definition of resource addresses in the new RFC implicitly demands that if

POST http://www.example.com/article/23?date=2008-10-20&author=mathew

is a valid request, then

GET http://www.example.com/article/23?date=2008-10-20&author=mathew

must also be a valid request, a request to get the named resource that you just posted. This is not the case for most existing software.

There’s one more problem, and it’s a killer. The HTTP spec (RFC 2616) explicitly says (section 3.2.2):

The "http" scheme is used to locate network resources via the HTTP protocol. This section defines the scheme-specific syntax and semantics for http URLs.

http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

If the port is empty or not given, port 80 is assumed. The semantics are that the identified resource is located at the server listening for TCP connections on that port of that host, and the Request-URI for the resource is abs_path (section 5.1.2).

So the HTTP specification explicitly states that the query is not part of the address of the resource; and RFC 3986 explicitly states that it is.

In short: the latest HTTP RFC and the latest URI RFC are fundamentally incompatible, and the latest URI RFC is incompatible with the behavior of most existing software.

What this means is that if you do anything involving HTTP POST requests with body data sent to URIs that include query parameters, you can expect a world of pain. Some software will ignore the query parameters, some won’t. So don’t do it.

My view is that RFC 3986 is in error, and its definition of query parameters should be ignored in favor of the definition in every prior standard, also the definition in common use.

Filed under: Programming, Standards | Comments (0)

Form design: You’re doing it wrong

October 2, 2008

One of the most valuable qualities for a software developer is an ability to recognize and admit when they have been doing something wrong.

Yesterday I discovered an excellent summary of HCI research regarding online form design, and discovered that I’ve been designing forms incorrectly for years.

In summary, the best way to lay out a form is with labels above the fields, left-aligned. The labels should be in smaller regular text, not bold.

It also goes without saying that you should be using CSS for your form layout, not tables.

Filed under: Design | Comments (0)

The disastrous Facebook redesign

September 19, 2008

Facebook seem dead set on an adversarial relationship with their users.

Last week someone worked out how to get the old Facebook design back, by adding the Facebook Developer application and then bookmarking a URL that turned on some kind of preview mode.

The day after I found out about it, Facebook deleted the group that contained the information, and yanked the Facebook Developer application from all the people who had added it.

What fascinates me is how something as awful as the new design made it through the development process. I can only assume that they didn’t do any focus groups or user testing. From my perspectice as a web developer, let’s go through a few of the things that are wrong with the new design of the main page (the one you get to when you click the word Facebook top left). Obviously, it’ll help if you open your version of the page in another window

(more…)

Filed under: Design | Comments (0)

Thoughts on version control

September 18, 2008

I’ve used a number of version control systems, including CVS, PVCS, Subversion, and Arch. While people are scathing about CVS, I have to say that to my mind, the worst of the four is Arch. I tried using it because it had the right back-end storage, and discovered belatedly that its user interface was ghastly, with commands I found it impossible to remember how to use. Judging from its mailing list archives, Arch is dead, so I guess everyone else had the same experience I did.

Again considering those four, SVN was probably the closest to comfortable for me, and I still use it with projects like Ruby and its libraries. But for my own use, I’ve moved to something newer that I like better: Bazaar.

Bazaar was an attempt to do Arch again, but with a usable command set and the horrible file naming conventions removed. The developers clearly thought about the average user’s key requirements more carefully than the SVN developers.

For example, the first major release of SVN was designed to use versioned WebDAV for its file storage. Hands up everyone who has a versioning WebDAV server set up? Yeah, me neither. So they also gave it a dedicated server backend–but they made it store all your data in a virtual filesystem layered on top of Berkeley DB files. Those tended to become corrupt and need recovery. But hey, once they made SVN work on ordinary filesystems, it was pretty good.

No such stupidity with bzr. All you need is an FTP or SFTP server. If you want to publish changes but not allow people to commit, you can just point an ordinary HTTP server at your bzr repository and people can use the http URL to check out the code. This means if you have cheap $5 web hosting, you can run a bzr repository.

The commands are simple too. Suppose you’ve started a Java project in Eclipse and want to add it to bzr and publish it on a server somewhere.

$ cd ~/eclipse/workspace/MyProject
$ bzr init
$ bzr add src/**
added src
added src/com
added src/com/example
added src/com/example/myproject
added src/com/example/myproject/MyProject.java
added src/com/example/myproject/MyProjectTest.java
$ bzr push --create-prefix sftp://server.example.com/srv/bzr/projects/myproject/trunk
Created new branch.

Done. Now you can work on the code for a while, bzr commit each time you have it in a good state, and when you’re ready to publish the new revision just bzr push and it’ll use the same URL as last time.

If you later decide you’re too lazy to remember to bzr push and prefer the SVN/CVS way of working where there’s a central repository, then do:

$ bzr bind sftp://server.example.com/srv/bzr/projects/myproject/trunk

Now whenever you commit, your local copy will automatically be pushed to the server. Starting to work with someone else’s existing repository is easy too:

$ bzr get http://repo.example.com/projects/something/trunk

When you’ve made a bunch of changes you want to send to the owner for consideration, bzr send -o patches will bundle up all the necessary info into a file you can just e-mail–you don’t need a place to publish your branch. Or if you prefer the ‘auto-push’ model and are given commit access to the remote repository, you can do your initial checkout with

$ bzr checkout sftp://repo.example.com/projects/something/trunk

and then just bzr commit and bzr push your changes.

Of course, you can change your mind later and bind and unbind as you wish, or as your permissions change.

It’s also worth noting that by default, a bound branch (checkout) has all the necessary info to let you keep working if you find yourself unexpectedly without a network connection. Again, the designers of bzr obviously thought a lot about the way people work in the real world.

Meanwhile, it seems like the new hotness in version control is Git. Several open source projects I use have switched to it. I have to say, I don’t understand why.

There’s some discussion of git limbo from a Gnome developer that I think deserves reading before being tempted to use git. The idea of being able to do a partial commit easily is obviously very powerful, but it seems to me like leaving a loaded gun lying around.

I’ve worked with people who were in the habit of doing partial commits, and they were also in the habit of making the main trunk unbuildable. They’d fairly regularly commit a set of files that didn’t match any boundary of the version dependencies. This is particularly prone to happen during refactoring; it’s easy to change a library API, and forget to include one of the files that contains a call to the library when you’re checking in.

Besides, why do a partial commit anyway, when you could just turn what you think would be a good partial commit into a new branch, check that it builds and passes the unit tests, and then merge it? Isn’t that the whole point of having a VCS with fast lightweight branching?

Sure, git is fast. But bzr is faster than git 1.0, which was deemed fast enough. Meanwhile, "someone will write a GUI" is a lousy excuse for git’s horrible command line UI. The merge command should be ‘merge’, for example, not a variant of the ‘pull’ command. To me, git’s commands have Arch smell, and that’s not a good smell.

Filed under: Programming | Comments (1)

The fractious leap second debate

September 16, 2008

You might not have heard about it, but there’s a debate going on which threatens to redefine time as we measure it. I’m something of a time nerd; all the computers in our house are synchronized to atomic clocks, as are several of our regular clocks, my wristwatch, and my phone. The debate going on concerns leap seconds. To understand the importance of it, it’s necessary to understand what a leap second is.

Recording time is made difficult for us by the fact that we live on a large rotating object with high mass, in orbit around a star. We like our time measurements to correspond to the apparent observed motion of the star in our sky; in short, we like day to be light, and night to be dark. We also like to set our calendar based on the earth’s orbit around the sun, so that winter is always cold and summer is always hot.

Inconveniently, the earth does not make an exact number of rotations per year. Hence every now and again it’s necessary to have a leap year, inserting an extra day into the calendar to bring it back into sync with the earth’s orbit, so that the months don’t gradually drift against the cycle of hot and cold weather.

The problem of wanting day to be light is solved by having time zones, with different parts of the world choosing a different offset in hours so that noon is roughly when the sun is overhead.

Historically, the offsets were measured from GMT, time as measured at the Greenwich Observatory in England, calculated from the position of the sun. However, the development of atomic clocks of increasing accuracy, and telescopes of increasing power, made scientists aware of problems with this simple scheme.

(more…)

Filed under: Java | Comments (0)

Pride comes before a fall

September 8, 2008

2006-10-27:

As part of its strategy to win more trading business and new customers, the London Stock Exchange needed a scalable, reliable, high-performance stock exchange ticker plant to replace its earlier system. [...] Using the Microsoft® .NET Framework in Windows Server® 2003 and the Microsoft SQL Server™ 2000 database, the new Infolect® system has been built to achieve unprecedented levels of performance, availability, and business agility.

Benefit: One-hundred-percent reliable on high-volume trading days

Or as Microsoft headlined it:

London Stock Exchange: Achieving Record Reliability Using Windows over Linux

Contrast with 2008-09-08:

The London Stock Exchange (LSE.L: Quote, Profile, Research, Stock Buzz) suffered its worst systems failure in eight years on Monday, forcing the world’s third largest share market to suspend trading for about seven hours and infuriating its users. [...]

The Johannesburg Stock Exchange, which uses the LSE’s trading platform TradElect, also suspended trading.

Meanwhile, the New York Stock Exchange uses AIX and Linux.

I wonder how long it will take Microsoft to take down the banner ad.

Filed under: Microsoft | Comments (0)

Java, JDBC and “memory leaks”

September 5, 2008

Every time Java is discussed on Slashdot, someone says that the overheads of automatic memory management aren’t worth it because Java still has memory leaks.

After further discussion, it generally turns out that they’re not talking about memory leaks; rather, they are talking about failure to free up resources in a timely manner–resource hogging. It’s a subtle distinction. In a memory leak, the system loses track of the memory, so it never gets freed during the life of the program. In the case of Java resource hogging, the Java system is still keeping track of the resources, and will eventually free them–it just doesn’t do it soon enough.

A common situation where resource hogging occurs is JDBC, querying a SQL database from a Java application or servlet environment. The problem is, JDBC query code is surprisingly tricky to get completely correct. It’s easy to write code where an exception causes active JDBC objects to be left unclosed, leading to the application being unreliable, overloading the database server, or using more memory than it needs.

MySQL and PostgreSQL are extremely liberal in what they are prepared to accept. For example, you can generally close a connection and rely on the database to implicitly close everything else, including abandoning any uncommitted transactions. This is not the case with IBM DB2, which will actually refuse to let you close a connection unless you have cleared out everything properly. So it’s not just a resource usage issue–you can also suddenly find yourself having to do a ton of debugging when your data load increases and you need to swap out your development database engine for something more scalable.

So, it pays to get your JDBC code right the first time. To illustrate the painful construction of some hopefully correct JDBC query code, I’m going to discuss the process of writing a simple example program in Eclipse.

(more…)

Filed under: Java | Comments (0)