Main Contents
September 25, 2009
I often have to shunt around Lotus Domino databases, as well as all kinds of log files and bundles of XML data. I’ve got a cable modem connection to my home office, but still, uploads can take a while. So data compression is still important to me, and the newer LZMA algorithm can make a big difference.
Here’s an example using a random database template:
Example file compression
| bytes |
file |
| 9437184 |
dct.ntf |
| 2618157 |
dct.ntf.gz |
| 2256744 |
dct.ntf.bz2 |
| 1806715 |
dct.ntf.lz |
| 1784916 |
dct.ntf.7z |
| 1775175 |
dct.ntf.lzma |
Switching from bzip2 to LZMA is a bigger improvement than the switch from gzip to bzip2 was. However, notice that there are three LZMA-compressed files. The first was created using lzip; the second, using 7-zip; the third, using lzma-utils, now renamed xz or xz-utils.
I don’t like that there are competing LZMA archive formats, but since that’s the world I’m in, I decided I had to make a choice which one to use. The obvious answer is to go with the one with the best compression, but in my view 0.3% isn’t enough of a difference to make that the sole criterion. So I’ve tried my usual approach to evaluating open source software: I’ve looked at the user interface and documentation.
7-zip fails immediately because it doesn’t behave like a Unix program. It produces 10 lines of output when successfully compressing a single file, and there doesn’t seem to be any way to get it to shut up. So, it’s a two horse race.
The xz-utils site has no documentation I can find. Searching the Ubuntu documentation site locates the xz man page, however.
In contrast, lzip’s web site has a link to go straight to the user manual and tutorial. (Yes, there’s also a standard Unix man page.)
Comparing the two, I see that xz has many more options. It has all kinds of tweaks to specify how much memory it uses, tweak various internal details of the LZMA algorithm, and filter the data. None of these options are adequately explained. To quote Ted Nelson quoting Roger Gregory, “An option means the programmer didn’t have a clear idea of what the module was supposed to do.” Or as Steve Krug puts it, “Don’t make me think.”
In contrast, lzip’s user interface is much simpler, and closer to the Unix philosophy of “do one thing, and do it well”. The only two tweaks to the LZMA algorithm lzip provides are adequately explained if you know the basics of how compression algorithms tend to work, and there’s a table showing how they correspond to the compression levels -0 to -9. The only borderline gratuitous option is to split the compressed file into chunks, and that’s at least a useful one. It also gets the SI units right.
So, lzip wins by a landslide on UI and documentation.
You might be thinking I’m being superficial here; surely documentation alone isn’t a good way to evaluate software? So this time, I took a look at the source code.
Lzip’s source is around 5,680 lines of code (excluding comments), supplied with an autoconfigure script and test suite. It compiles to a 85,753 byte binary.
XZ Utils’ source is around 31,183 lines of code (excluding comments). It compiles to a 516,779 byte binary.
In XZ’s favor, its source code is much better commented (30% vs 10% comment-to-code ratio). Then again, at 6× the size, it had better be. So on balance, I think lzip still wins.
So once again, my “look at the documentation” heuristic worked. The question is, are there any good exceptions to prove the rule? That is, examples of excellent code that has terrible documentation?
Update: The GNU project decided to go with xz rather than lzip, of course, and implemented xz support rather than lzip support in GNU tar.
Filed under: Design, Linux, Programming |
Comments (0)
August 31, 2009
WIRED magazine has finally noticed that the long term trend for technology is cheap, simple and ubiquitous.
Back in the 1990s everyone was excited about “cyberspace”. We were going to build a whole new world in virtual reality, with virtual banks, virtual shopping malls, and virtual libraries. We would drive around in virtual cars and be represented by 3D avatars that looked just like us. Even the web would be replaced with cyberspace–remember VRML?
I always thought that was a stupid idea. We already have a world that’s far higher resolution and more interactive than can possibly be experienced via a screen, or even via special goggles and exotic input devices. Rather than have massively powerful computers try to simulate a virtual world, what made more sense was for lots of small and cheap computers to become ubiquitous in the real world we already have.
Rather than a virtual library, I want a real library where the books all have RFID, and an augmented reality application can just guide me to the book I want. Rather than a virtual mall, let me search the real mall from my phone.
There’s certainly a place for virtual worlds and cutting-edge hardware; broadly speaking, that place is the video game industry, at least as far as the average person is concerned.
Ask a photographer what the best camera is, and he’ll probably tell you: it’s the camera you actually have with you. A cheap camera in your pocket is better than a $3000 SLR at home. Similarly, you’ll get more done at the coffee shop with the $300 laptop that you carry in your shoulder bag than the sleek 17″ behemoth you leave at home.
Filed under: Business |
Comments (0)
July 20, 2009
I keep all my music on a server in the corner of my office, so it’s accessible from any machine in the house. I recently rebuilt the OS install on the machine (Linux, of course), and reformatted the hard drives to get rid of ReiserFS and switch to ext3.
After putting back all the data to a single temporary partition, I took a look at my inode usage. df -i reported that the main data partition had 101875 inodes used out of 9281536, and df --si reported 135G used.
135,000,000,000 / 101875 = 1325153, so I’m using 1 inode per 1.3MB of disk space. And it’s worth noting that I also keep e-mail in the same partition, using Maildir format, so I have a ton of small files there too.
Why is this interesting? Because by default, Linux mkfs.ext3 creates an inode for every 16KiB of disk, at least on Ubuntu. So in spite of all those tiny mail messages, I had roughly 80x as many inodes as I needed, each eating up 128 bytes of space. All those unused inodes were chewing up over a gigabyte of disk.
By using mkfs.ext3 with the -i parameter, you can pick a more reasonable inode ratio for the long term home of my data. It’s best to err on the side of caution, as you can’t add more inodes later, so I decided to go with mkfs.ext3 -i 262144, or about five times as many inodes as I observed that I was using. In addition, since the data partition isn’t my root partition, I used -m 0 to skip reserving any space for root’s exclusive use; if the partition fills up, the system won’t fall over.
Filed under: Linux, System administration |
Comments (0)
July 13, 2009
When Wolfram Alpha was made public, like everyone else I went to the site to see what it could do. The demonstration queries produced pages which looked really useful. I could instantly imagine how I might find the site invaluable.
Then I hit a wall. I thought of things I would like to see displayed by Wolfram Alpha, and for each, I wondered: how would I persuade it to show such a thing?
Let’s go through a real example.
I’m interested in the state of the economy, and how bad things have actually gotten. A fairly basic task is to find a graph of the US unemployment rate, from some recent year to the present. So I go to Wolfram Alpha and type us unemployment rate 1990-2009. The result: (insufficient data available)
.
The naïve user would probably take the error message at face value, and think “Wolfram Alpha doesn’t have enough unemployment data to show me the last 20 years?” and conclude that the tool is useless. Since I’m a computer scientist, however, I know that it’s just an extremely badly worded error message, and I press on.
The page shows my query with boxes around all the words except 2009, so I try again with a single year: us unemployment rate 2009. This time, Wolfram Alpha says Assuming “us unemployment rate 2009″ is international data | Use as economic data instead
.
Given that it says it understood what I meant by “US” and “unemployment rate”, it’s pretty dumb that it decided that was not economic data, but that link looks helpful. I click it, and get the graph I’m looking for–but only for 2009, of course. (A quick second query proves that yes, it does have data going back to 1990, and was lying when it said “insufficient data available”.)
So clearly Wolfram Alpha can do what I want, but it needs to be in some kind of special “economic data” mode. Entering my original query in the box on the economic data page doesn’t interpret my query in “economic data” mode, so how do I select that?
I try adding the words to my query: us unemployment rate 1990-2009 economic data. No deal, Wolfram Alpha now says it “isn’t sure what to do with” my input. (Another bad error message: it knows exactly what to do with my input, i.e. parse it, the real problem is that it doesn’t know how.)
Moving the “economic data” to the front doesn’t help either. Turns out, the clue is in the results page I got for a single year. There’s a box containing the words “civilian unemployment rate”. That’s the magic phrase to use instead of “unemployment” in order to put Wolfram Alpha into economic statistics mode, at least for this one query. So finally, I type us civilian unemployment rate 1990-2009 and get what I was looking for. Or rather, something close to what I was looking for:

Screen grab from Wolfram Alpha results page
It looks like a graph of US unemployment rate. However, it starts at 1950 rather than the year I requested. Also, what is up with that vertical axis? Looks like Wolfram Alpha is giving me either source material for the next Edward Tufte book or a submission for The Daily WTF. Another nagging doubt: does “civilian unemployment rate” mean it’s excluding unemployed former government or military workers?
This is how it goes every time I think of a problem for which Wolfram Alpha might be a solution: the “intelligent” search interface on the front acts as a frustrating wall between me and the actual tools. Except, of course, for the trivial problems like “£36 in $”, which Google already handles quite adequately.
I decided to compare my Wolfram Alpha experience with the Google experience. I typed graph us unemployment rate and hit Enter, and the first link in the Google results was a small graph next to a link to Google’s public data site. I follow the link, and to my astonishment, it’s exactly the graph I want. It has a meaningful vertical axis. It cites the data source. It defines exactly what it means by “unemployment rate”. It even randomly chose the same range of years I picked for my example. What’s more, it has useful tools for adjusting the graph to show specific states, and hovering the mouse over the line reads off the value at that point.
I swear, I didn’t even look at Google before picking my example. I didn’t even know Google had a public data graphing site. And yes, granted, Google offers no way to adjust the date range, but still: 0 out of 10 to Wolfram, 10 out of 10 to Google.
So my initial superficial impression of Wolfram Alpha was “Hmm, nice idea, pain to try and use, doesn’t really work, maybe it’ll be worth using some day.” I filed the site away in my bookmarks in case I started hearing rumors it was useful, and went on with my work.
Sadly, I now realize that I was being overly generous with my assessment. Not only does Wolfram Alpha not work, the current design is a pretty stupid idea to start with. As others have now pointed out:
The task of “guess the application I want to use” is actually not even in the domain of artificial intelligence. AI is normally defined by the human standard. To work properly as a control interface, Wolfram’s guessing algorithm actually requires divine intelligence. It is not sufficient for it to just think. It must actually read the user’s mind. God can do this, but software can’t.
Google succeeds because for complex information searches, it directs you to a sub-site with a special-purpose interface. Wolfram Alpha tries to make the text entry box be the only interface; even when it understood my query, it didn’t take me to a graphing area or an employment statistics area, so chances are for any related query I’d have had to go through the whole frustrating query experimentation process again.
English language input barely works for Interactive Fiction, where there are at most a few hundred objects you might be referring to and maybe a hundred things you might want to do with them. Even assuming state-of-the-art AI, it’s madness to think that an English language interface to something whose problem domain is the whole of mathematics could ever be usable.
The alternative? Another quote:
… the human skull contains an organ called a “brain,” which has spent several million years learning to use tools. Therefore, if you are building a control interface, ie a tool, the prudent way to proceed is to (a) assume your users will need to learn to use your tool, (b) make it as easy as possible to learn the tool, and (c) make the tool as effective as possible once it is learned.
There is a caveat to this: people may be good at learning to use tools, but there’s a sizeable population who do not want to learn anything, particularly not how to use a tool, and certainly not when it might be the right tool for the job. You’ll see these people everywhere in the business world. They’re the ones using Microsoft Word, but not using styles to format their text. They send you file attachments which turn out to be an Excel spreadsheet consisting of a list of names of people arranged in one column. I’ve not yet seen anyone come up with a viable UI approach for people like that, unless you count the cluebat as a user interface.
So my thought for the day is: when you have a complex system, make your user interface be as dumb as possible.
Filed under: Design |
Comments (0)
July 10, 2009
This week saw the 40th anniversary of IBM CICS, the Customer Information Control System. It had 7 releases before being re-engineered as CICS TS (Transaction Server); CICS TS has just seen the release of version 4.1. I’m going to take a guess that they don’t use agile software development methods in the CICS department at Hursley. Yet for all its age, CICS keeps up with the times: it now supports Atom feeds, Java, and Web Services–REST as well as SOAP.
Of course, that’s not why 90% of Fortune 500 companies use it. It’s one of those rather dull products that runs invisibly, year after year, processing millions of transactions reliably and actually getting the answers right. Key parts of its behavior are formally specified in Z notation. It’s the back-end software that processes your ATM transactions and airline reservations. Yet in spite of that, it seems to have a fan following–there’s an I ♥ CICS group on Facebook. If you’re wondering what you’re missing, the University of Maryland has some screenshots of their CICS system. Of course, these days you can do your CICS development using Eclipse and give your applications a web interface.
Another really old IBM product that’s still in service is IMS, a database so old that it predates the relational model, let alone SQL. Built to track all the parts required to build the Saturn V rocket for the Apollo moon missions, it’s still being used today, 41 years later. It’s the product you use if your database is 60 terabytes in size. After CICS has processed your ATM transactions, chances are at some point the movement of the money will be noted in the Federal Reserve’s massive IMS database.
Like CICS, IMS has kept up with the times, with support added for web technologies. Ironically, the fact that XML is hierarchical makes IMS a better match for XML processing than the more modern relational databases–you can translate your entire document directly into IMS fields, preserving the hierarchical element structure directly, and then perform XPath queries via IMS’s JDBC interface.
I don’t use CICS or IMS myself. I only occasionally use a 5250 emulator, mostly to perform System i administration tasks; I haven’t logged in to z/OS in over a year. The mainframe I do use runs Linux, and I like it that way–Unix is quite old enough for me. Yet there’s something oddly fascinating about these mainframe products and their continued existence; it’s almost like discovering there are trilobites still living in the back of your filing cabinet.
Filed under: Culture |
Comments (0)
May 21, 2009
A flaw in the SSH protocol is starting to get more widespread attention.
It appears that a workaround is available: disabling CBC ciphers in favor of CTR. To do so, edit /etc/ssh/sshd_config and add the following:
Ciphers arcfour128,arcfour256,arcfour,aes128-ctr,aes192-ctr,aes256-ctr
That’s the default list of SSH ciphers, minus the CBC ones.
Filed under: Linux, System administration |
Comments (0)
May 8, 2009
apt-get install openjdk-6-openjdk icedtea6-plugin
update-java-alternatives -s java-6-openjdk
For some inexplicable reason, Eclipse for Java Developers doesn’t include JDBC.
Eclipse J2EE edition doesn’t work with OpenJDK.
Oh well.
Filed under: Java, Linux, System administration |
Comments (0)
April 9, 2009
A common technique for getting XML data out of IBM Lotus Domino is to build an agent which outputs the DXL encoding of a document and call it via HTTP. The code typically looks like this:
Print "Content-type: text/xml"
Dim session As New NotesSession
Dim doc As NotesDocument
[...obtain your data somehow in the variable doc...]
Dim exporter As NotesDXLExporter
Set exporter = session.CreateDXLExporter
exporter.OutputDOCTYPE = False
Dim stream As NotesStream
Set stream = session.CreateStream
Call exporter.SetInput(doc)
Call exporter.SetOutput(stream)
Call exporter.Process
Print stream.ReadText()
However, there’s a subtle error in the above code. The kind of error that can make everything look fine in testing, then cause your integration work to fall over in production.
(more…)
Filed under: Domino |
Comments (0)
April 6, 2009
I do a lot of integration work, and often it involves REST Web Services and other forms of XML data transfer. Generally I want to provide dates and times in RFC3339 format, or some variant of ISO8601; for example, 1996-12-19 16:39:57 -0800.
When writing a Web Service in LotusScript, it’s not too hard; I built myself a function which converts a NotesDateTime or Variant of type 7 into UTC and then formats it appropriately, and I use that all over the place. Similarly, in Java it’s easy enough to convert everything to UTC and avoid needing to work out the correct numeric time zone offset.
Then one day I had to put the value of @Modified into a view column, for export. Since view columns can only use formula language, I was faced with trying to produce the appropriate format using only @Formula and column properties. To make things worse, DST absolutely had to be handled correctly. So did time zones that aren’t on an hour boundary, such as -0330. (We don’t currently have any servers in Newfoundland, but I don’t want to get phone calls if we ever do set up a site there.)
For a while, I struggled with @Zone, @Text, and a mess of string manipulation–but it didn’t seem to behave properly. There’s something odd about the output of @Zone that makes it behave differently from an identical pure string when converted to a number.
Date/time columns have various ways to customize their output. However, there’s no way to say “always show dates and times in UTC”; that would be far too useful. There’s an “always show dates and times in local time zone”, but that changes twice a year because of the DST madness, and frankly I don’t trust it to change on the right dates. I could set the server to be in UTC–and in fact I often do–but I don’t want my application to break if deployed to a server with different locale.
Then I suddenly remembered that Notes 6 gained an @GetCurrentTimeZone formula which returns the current time zone where the code is running–either the client PC’s time zone, or the server’s time zone, whichever context it is in. The formula language also has @TimeZoneToText, which will convert a time zone value to a string like (GMT-06:00). Suddenly the answer was obvious: it’s trivial, so long as you use two columns (or two fields, in the case of a form).
The first column/field just displays the date/time value, formatted with custom preferences to look like RFC3339 minus the time zone part. The second field is then @Right(@TimeZoneToText(@GetCurrentTimeZone; "S"); "GMT");
The result is a clean ISO-formatted date/time value. Most importantly, because both fields/columns use the local time zone but display that zone numerically, the output indicates the correct instant in time even if the system changes to/from DST on the wrong day.
True, this isn’t rocket science. However, it’s one of those interesting situations where I was unable to see the obvious solution for a long time, because I was in a mindset: I was looking for a single formula to produce all the output at once. If such a thing exists, I bet it’s very large.
Filed under: Domino |
Comments (0)
March 25, 2009
A company called OnLive have announced a new product. The idea is simple to understand: instead of buying a gaming PC or expensive console, they’ll render all the graphics for you on a server farm, send the video to a lightweight decoder box or software client via the Internet, and have it send back your controller input.
Basically, you’ll be playing games on a remote system via the net. Think of it as video games over VNC. The benefit is that you can play leading edge games like Crysis on dirt cheap hardware–supposedly.
A lot of people are skeptical about this supposed product, pointing out that US Internet infrastructure is pretty low bandwidth compared to other countries. Yet bandwidth isn’t the big problem; latency is the problem. (As Stuart Cheshire put it, “It’s the latency, stupid“.)
My typical ping times to a major Internet site are under 100ms–e.g. 25ms to hit Yahoo’s edge–but I’m one of the lucky ones with a good Internet connection. It’s not uncommon to have 250ms ping times. But let’s assume up to 100ms one way, and think about how this will work.
The OnLive hardware cloud will render the video. State of the art for low latency video encoding and decoding is 70ms or so for an encode/decode cycle. So the image gets to you after 100 + 70 = 170ms. You respond by moving the controller. Assuming no delay at all in reading the controller, your response gets back to the server after 270ms.
So basically, using today’s technology you’re talking over a quarter second of lag between something happening in the game and your being able to respond to it. And that’s a pretty optimistic case; note that the state-of-the-art HD video system I linked to had 500ms total latency in their demo across the public Internet.
Online video games use all kinds of tricks in order to stay responsive in spite of network lag. Even so, ping times can vary from minute to minute; the Internet offers no guarantees. When the net slows down, you tend to see bursty updates to other players’ positions, and fast gameplay gets frustrating. Some people have even taken advantage of this behavior, building “lag switches” which fool their copy of the game into thinking the network is lagging, allowing them to shoot you while you’re motionless (to them). They then turn off the lag, at which stage the game reconciles the conflicting opinions of the different players’ systems, often by killing you unfairly.
I note from the announcement that OnLive’s demos so far have shown a game being played across 5 miles of network. Their actual data centers will be up to 1,000 miles away from you, they say. I think I’ll wait until they try the thousand mile round trip before I believe it’s a real product.
OnLive say that their hardware encodes the video with only 1ms of lag. I guess that’s possible, if you throw enough expensive hardware at the problem–which brings me to the second problem: economics.
OnLive are floating the idea that it’ll be $50 a year for the service.
Now of course, they can share hardware and network costs between multiple users. However, their ability to do that is going to be severely limited by the requirement for low latency. They’ll only be able to time share my hardware node with other people in the same geographical area as me, and chances are most of us are going to be wanting to play video games at about the same times of day. I doubt they’ll be able to get better than about 4 or 5 customers per node on average.
They’ll also need much more expensive hardware than a typical web host. Gaming PC hardware is pretty high end, and low latency real time HD video encoders don’t come cheap. They’ll need a much more expensive network connection too–low latency will require the kind of multiple peering backbone connections that Yahoo and Google use, not the piece of copper your average web host uses.
Support costs will be higher too. They’ll need to hand-hold people through fixing latency problems, possibly intervening with their ISPs.
So they need much more expensive hardware, much more expensive network connections, will have higher support costs, and they’ll be able to support fewer users per hardware node. Yet they think they’ll be able to cover all that for $50 a year? The price of cheap, shoddy web hosting from GoDaddy? I don’t see any way that price is workable.
In summary: I could be wrong, but I think this OnLive thing is going to be vaporware.
Of course, you could do something like OnLive if you restricted yourself to games where latency isn’t an issue and the graphics aren’t tough to compress quickly. But then, you can play games like that on a $250 Wii console, or even a $99 PlayStation 2. So where’s the market?
Final thought: Even multiplayer online games like World of Warcraft ship with DVDs full of content to cache and render on the client. If this “video games via streaming video” idea was really a good one, wouldn’t people be doing it already?
Update 2009-03-30:
EuroGamer seems to agree
Filed under: Business |
Comments (0)