Main Contents
June 28, 2011
When you have a DB2 database which multiple people can update, sooner or later you are likely to end up with a deadlock. You’ll issue a simple query to update some data, and wait… and wait… and wait. I hit this problem this morning. It took a while to work out how to diagnose the problem and fix it.
In the DB2 Control Center object view, drill down through the systems, instances and databases until you find the database you’re interested in. Right click and select “Applications”. It’ll take a while, but you should get a list of all the applications running–or deadlocked.
If you click to sort the list by authorization ID, you should find it easy enough to locate your deadlocked command. It’ll be listed with status “Lock Waiting”. Click to select it, then click the “Show Lock Chains” button.
Now you should see a graphical flow chart showing your task as a box, linked to a box representing whichever task holds the lock that is preventing your task from finishing. You can then right-click the task that’s causing yours to deadlock, and choose “Force” from the pop-up menu.
You need admin privileges to do this, of course. There’s a warning dialog explaining why it’s a bad idea. But sometimes you need to do things that are bad ideas. In my case, it turned out that a colleague had 77 open sessions with locks on various tables.
Further interrogation of the guilty party revealed that they were using a graphical query tool to browse the data, rather than writing SQL. Refining an existing query made the tool helpfully lock down the output of the query, so that the additional filter clauses could be edited interactively in a window without any unexpected changes occurring to confuse things. Of course, the tool didn’t unlock the table until the entire browsing session was killed. So yet another example of the dangers of using pointy-clicky interfaces instead of actually knowing how to program.
Filed under: System administration |
Comments (0)
June 7, 2011
Recent versions of DB2 have support for Unicode, if your databases are flagged as Unicode-enabled. This is a good thing, so you may have done it without thinking too much about the consequences. After all, i18n is good, right?
Unfortunately there are some major snags to be aware of. In particular, in Unicode-enabled databases the length limits in VARCHAR(n) are no longer character length limits. Instead, they are storage length limits. So your 18 character string may or may not fit in a VARCHAR(20) column, depending on the representation DB2 uses to store it.
This became a problem for me. One of the pieces of Java code I maintain is a data pump which takes end user input (read: messy, frequently invalid, may contain all kinds of strange characters and control sequences), cleans it up, arranges it in a relational structure, and puts it in a database. I started to get errors caused by people making liberal use of registered trademark characters in document titles. It was time to fix the code to ensure that data wouldn’t overflow the space DB2 had allocated for it.
One option is to use NVARCHAR in your database DDL instead of VARCHAR. DB2 inexplicably calls this VARGRAPHIC instead. The downside is that the internal representation is then always UTF-16, so all your text columns take up twice as much space. Depending on how often you use non-ASCII text, this may not be a tradeoff work making. I thought there had to be a better way, a smarter way to clean up data.
So, how do we find out the representation DB2 uses for non-ASCII data, to know how much space it will take up in a VARCHAR? According to the documentation for DB2 v9 for Linux, there are codepage values to represent the various Unicode encodings. So I checked my database DDL using db2look, and sure enough it said “-- Database Codepage: 1208“. According to the documentation, that’s UTF-8.
Now comes the next problem: Java’s String representation is UTF-16 (or UCS-2 if you’re using a really old JVM). So in order to ensure that a string is short enough to fit into a column, you can’t just truncate it using String.substring() or check its length in bytes. Instead, you need to write a function that will truncate a UTF-16 string, returning a UTF-16 string, but ensuring that the UTF-8 representation of the returned string is no longer than a specified value.
The obvious method is to use String.getBytes("UTF-8"), truncate the returned byte array, then convert back to a String. Obvious and wrong, because UTF-8 uses a variable number of bytes per character. If your truncation point happens to be half way through a multi-byte character, you’ll end up with an invalid UTF-8 string, and an exception when you try to convert back to UTF-16.
So I found myself writing a method to perform UTF-8 truncation of Java String objects. I tried it out on my data, and it seemed to work fine. However, Unicode manipulation is one of those fiddly areas of software where I thought “Seems to work fine” was definitely not good enough.
I wrote some unit tests. Or rather, I wrote a single unit test method that would generate tens of thousands of random Unicode UTF-32 characters, collect them into random Java UTF-16 strings, perform the truncation, and then compare the result of the original string encoded to UTF-8 with the truncated string encoded to UTF-8. Sure enough, the code failed to perform as required. Sometimes it didn’t truncate properly, sometimes it truncated too much.
Unfortunately, random Unicode strings are quite hard to examine usefully as anything other than a hex dump, and that’s tedious. So I took a look at the list of defined Unicode planes and decided to find one that had large code points and graphical representations in the fonts I use in Eclipse. I also wanted symbols that I would recognize if they got mangled. My Han Chinese is nonexistent, so I settled on alchemical symbols, 0x1F700 to 0x1F77F.
I wrote a new test method to generate strings of alchemical symbols, and looked at how my truncation code performed. I soon shook out a number of bugs. The tricky one involved surrogate pairs.
Just as UTF-8 can require multiple 8-bit bytes to encode a Unicode character, so UTF-16 can require multiple 16-bit words to encode a Unicode character. If you encounter a value from 0xD800 to 0xDBFF, it’s a high surrogate. You need to combine it with the next word, which will be from 0xDC00 to 0xDFFF, to form the complete character. Confusingly, the next word is the low surrogate even though its code point is higher, because it holds the low bits of the final character.
This surrogate decoding process is a nuisance, but luckily the end result is always in the range 0×10000 to 0x10FFFF. That means that when it’s re-encoded into UTF-8, in a process made even more unpleasant by Java’s lack of an unsigned byte data type, it will always require 4 bytes for representation. So I could skip all the encoding and decoding. A surrogate pair always meant the UTF-8 width was 4 bytes; I just had to make sure I didn’t break the pair. I made my code count the high surrogate as width 4 and the low surrogate as width 0 so it would always get appended.
Characters over 0x10FFFF can’t be represented in UTF-8 or UTF-16. All the other ranges are straightforward. Strictly speaking, one should check that high and low surrogates only occur in pairs and in the correct order, but since I was getting all my UTF-16 strings from Java I decided to trust the JVM to deal with that. I went back and re-ran my code against the full ugly random Unicode string test, and it passed. Here’s the final code (also on GitHub):
public static String utf8truncate(String input, int length) {
StringBuffer result = new StringBuffer(length);
int resultlen = 0;
for (int i = 0; i < input.length(); i++) {
char c = input.charAt(i);
int charlen = 0;
if (c <= 0x7f) {
charlen = 1;
} else if (c <= 0x7ff) {
charlen = 2;
} else if (c <= 0xd7ff) {
charlen = 3;
} else if (c <= 0xdbff) {
charlen = 4;
} else if (c <= 0xdfff) {
charlen = 0;
} else if (c <= 0xffff) {
charlen = 3;
}
if (resultlen + charlen > length) {
break;
}
result.append(c);
resultlen += charlen;
}
return result.toString();
}
Of course, there may still be bugs lurking, and the code comes with no warranty.
Filed under: Java |
Comments (0)
April 28, 2011
Scenario:
You are using WordPress with OpenID, using the openid plugin.
Symptom:
You get an error page looking something like
Catchable fatal error: Object of class WP_Error could not be converted to string in /home/meta/public_html/lpar/wp-includes/formatting.php on line 2822
when users try to log in.
Diagnosis:
You can patch the appropriate code in formatting.php so that it actually reports the error:
function wp_strip_all_tags($string, $remove_breaks = false) {
$string = preg_replace( '@<(script|style)[^>]*?>.*?</\\1>@si', '', $string );
$string = strip_tags($string);
if ( $remove_breaks )
$string = preg_replace('/[\r\n\t ]+/', ' ', $string);
return trim($string);
}
You then get a useful error in error_log:
[28-Apr-2011 08:54:18] [OpenID] User was created fine, but wp_login() for the new user failed. This is probably a bug.
No, really?
Anyhow, it turns out that the problem is caused by previous OpenID plugins or user registrations resulting in user entries with no e-mail address. For whatever reason, WordPress now seems to require one, and user creation and login falls over if there are any users without one. I discovered this by reading a topic at the WordPress support forums.
The obvious fix is to delete all the users who have no e-mail address, using the standard WordPress interface. However, what if you have users with no e-mail address whose comments you want to keep? Well, my suggestion in that case would be to insert a dummy e-mail address in the appropriate records:
UPDATE wp_users SET user_email = 'suckweasel@example.com' WHERE user_email = ''
It’s probably possible to patch the appropriate WordPress code to not fall over on missing e-mail addresses, but since there could be any number of other bits of WordPress code that are similarly lacking in resilience, it’s probably better to clean up the user database.
Filed under: Uncategorized |
Comments (0)
April 27, 2011
I just became aware of an interesting JavaScript ‘feature’. The code
y = x % 1;
is equivalent to
y = x - Math.floor(x);
because ECMA-262 says:
… the floating-point remainder r from a dividend n and a divisor d is defined by the mathematical relation r = n − (d * q) where q is an integer that is negative only if n/d is negative and positive only if n/d is positive, and whose magnitude is as large as possible without exceeding the magnitude of the true mathematical quotient of n and d.
So if d is 1 then q is the largest integer that’s smaller than n/1, so r = n – Math.floor(n).
I will not be using this feature in my code.
Filed under: JavaScript |
Comments (1)
April 5, 2011
Recently someone on Stack Exchange asked Why are there no alternatives to TeX, or, why is TeX still used?
Here are some reasons.
It works.
It sounds trite, but TeX has a robustness and reliability that other software lacks. Recently, there was a discussion of a bug in LuaTeX triggered when a document hits 3,987 pages. I can’t imagine creating a document with over 1,000 complex pages using something like Microsoft Word or OpenOffice Writer.
It’s predictable.
If you’ve ever wondered why your word processor is formatting something a particular way, well, sometimes there’s really no way to answer the question. Microsoft’s OOXML spec resorted to defining many aspects of formatting by saying that they had to be the same as particular versions of Word; even Microsoft couldn’t specify what that behavior actually was. Pro typesetting software like Adobe InDesign isn’t much better when it comes to maintaining compatibility between versions.
With TeX, there are fewer ghosts in the machine. Even if the details underneath are pretty scary, at least you can always see exactly what macros you are applying to a piece of text and what parameters you have tweaked. Nothing is hidden, and there’s more of a feeling of being in control.
It keeps working.
I’ve used a lot of word processors over the years. Wordwise, View, LocoScript, 1st Word Plus, MacWrite, WordStar, WordPerfect, AppleWorks, DisplayWrite, Word Pro, ClarisWorks… They’re all dead and gone now. Good luck reading any files you still have stored in their special file formats. In fact, there have been discussions on Macintouch recently about the problems of converting a large number of legacy ClarisWorks documents into something readable, and that product was only declared end-of-life in 2007. Yet if I still had a copy of my dissertation, I could typeset and print it today, because I wrote it in TeX.
ConTeXt is the newest and most rapidly changing TeX macro package. Even so, switching from a 2 year old version to today’s bleeding-edge beta is mostly a non-event.
It runs everywhere.
Using TeX, I can produce finished documents on Mac, Windows, Linux, or any Unix-like OS. I don’t need a high-end CPU, I don’t even need a high resolution display.
You can edit it using whatever tool you prefer.
TeX files are just UTF-8 text files. Edit them with TextWrangler, vim, Emacs, jEdit, pretty much any text editor you want. It helps to have syntax highlighting, but it’s not essential.
Edit on your laptop, on someone else’s laptop, on a tablet, on your phone. Anywhere there’s a text editor.
Put your documents in version control. They’re just text, so you can diff them. They’re small and compress really well, so you can keep every version forever and not run out of space. Store them on a remote server and edit via SSH. Transfer them in a fraction of a second across a modern network connection.
It’s cheap.
I use the word ‘cheap’ rather than ‘free’ because really, I’m not at all averse to paying for software. However, my budget is limited.
There are some solid proprietary page layout systems that can deal robustly with large documents. Adobe FrameMaker, for example. MSRP for that is $999. If you want something a bit cheaper, there’s InDesign, for $699. Both are a bit pricey for my liking.
Or there’s TeX, which is where FrameMaker got its typesetting algorithms from anyway. Cost: $0. I think it’s something of a bargain.
It gets out of the way.
Recently, a lot of writers have been discovering the joy of ultra-minimalistic tools which get out of the way and let you focus on the actual writing. There are programs like WriteRoom, DarkRoom, jDarkRoom, Byword, OmmWriter, and so on, which attempt to remove distractions and give you just text on an otherwise empty screen. Then there are tools like Markdown, reStructuredText and wiki syntax, markup languages which allow you to edit plain text with minimal annotations and convert it into something with pretty formatting later on.
All of these tools are rediscovering something that was lost when we moved from word processors like WordPerfect, which used plain text with embedded formatting commands, and entered the WYSIWYG world of the 90s with its endless buttons, sliders, ribbons and rulers.
Being a markup language, TeX has that minimalist essence. It’s just text.
It’s extensible, on the fly.
Sometimes I’m writing some documentation, and I suddenly realize I’ll have to repeatedly refer to some kind of thing–a set of menu entries, say–and format them in a special way throughout the text.
In a normal word processor, I stop writing. I decide whether I need a paragraph style or a text style. I decide what existing style it needs to be based on. I click around to create my new style and give it a name. I adjust its formatting to be distinctive so I can see what I’m editing. I map it to a key. And then I try to remember what I was about to write before my multi-minute jaunt into wordprocessorland.
In TeX, I make up an imaginary macro off the top of my head–{\Menu File – Quit}–and keep on writing. After I’m done writing whatever was in my head, then I worry about extraneous details such as how it’ll look on the page and whether it needs to be based on the look of anything else. Even if I decide I need a more elaborate calling pattern–say, if the macro needs some arguments–it doesn’t matter, because the plain text nature of my made-up-on-the-spot macro makes it easy to find and replace.
Obviously, actually applying the formatting is harder than with a word processor. But once I’ve done it once, I can re-use that magic incantation endlessly.
It handles vector art properly.
One place a surprising number of word processors fall down is incorporating images into the text. Somehow it seems to be acceptable to offer fonts that are infinitely scalable vectors, and then fail to offer any reliable way to insert and adjust a vector diagram. Want to bring your process diagram into OpenOffice? Sorry, can’t read the SVG, why don’t you just draw it again with the horrible OpenOffice Draw?
In business, it’s not at all unusual to receive documents that have diagrams as horrible resized bitmaps, or tables as embedded objects that look like crap, for pretty much this reason.
With TeX, I can use any vector graphics tool I like, and there’s almost always a way to get PDF out–either directly, or via a print-to-PDF driver. Once I have PDF, it’s painless to place it onto the page as vectors. So like everything else, it scales correctly for the display. The only things in my documents that end up as ugly bitmaps are the screen captures, and that’s the way it should be.
There are more reasons if you’re a mathematician or scientist. I’m just listing reasons why a person who wants to write ordinary everyday business and personal documents might choose to use TeX to do so.
Nevertheless, I would absolutely use something more modern that met the same basic requirements. If you know of a word processor that runs on any modern computer, can deal with 5,000 page documents, imports vector art in standard formats and handles it in vector form, writes to tagged PDF with working hyperlinks, is guaranteed to be around and supported for the next 40 years, will let me edit without having to learn a new user interface or have anything but my text on the screen, has output the quality of FrameMaker or InDesign, is arbitrarily extensible, and costs under $100, do please let me know.
Filed under: Business, Standards |
Comments (0)
April 1, 2011
Yesterday I ran into a familiar SSL problem. I learned that a Sun engineer named Andreas Sterbenz had written a handy utility to solve the problem, and posted it on his Sun blog.
I looked to see what else he had posted. The last entry mentioned that he had jumped ship to Google, and pointed at his new Google blog. Go look at it, it’s pretty typical of a Google employee blog.
Not every new employee takes the hint. Sometimes they get fired as a result.
It’s known as the Google Vortex or Google Black Hole, and it affects products as well as people. Things pass through the event horizon in Mountain View, and you never hear from them again.
There’s something deeply ironic about it. Google is, after all, a company that just got slapped by the FTC and fined in France for not protecting privacy. If it’s your information, well, information wants to be free, right? But if it’s information about what the Google cafeteria is like or what the working hours are at Google, ironclad secrecy applies. You can’t even tell people that you’ve signed an NDA saying you can’t tell them anything.
Microsoft has a reputation for paranoia, but you see far more open commentary from Microsoft developers than you do from Google employees. Can you imagine a Google developer writing publicly about the internal limits of Google algorithms? Microsoft lets developers write that kind of stuff. Apple is notoriously secretive too, but get on the developer mailing lists and you’ll find helpful Apple employees actually answering questions. The only other Google-like situation I can think of was the person I knew who got a job at GCHQ, the UK’s equivalent of the NSA. Obviously that’s all I ever learned about that. But speaking of the NSA, I wonder if they have cheese and wine parties with their Google friends and laugh about who has the more restrictive employment contract?
Obviously IBM is rather different from Google as well. I actually started this work blog because IBM encouraged me to do so. Big Blue has some guidelines for public blogging, of course; if you want, you can go and read them, because they’re public too. How weird is that, eh? Obviously IBM got on the cluetrain. Google has not yet done so.
I can’t help thinking about the USA vs the Soviet Union during the Cold War. It’s widely viewed that the culture of secrecy in the USSR held back their scientific progress. So far Google has managed to entice enough smart people to move behind the iron curtain that they’ve kept ahead–but then, the Soviet Union managed to entice a few defectors in the early days too. Ted Nelson’s “Computer Lib” contained this anonymous quote:
“IBM is run by and for people who really believe in authority. IBM is, to my way of thinking, the way the Soviet Union would be if the Soviet Union worked.”
If there’s ever a third revised edition of Computer Lib, I can’t wait to see what it says about Google.
(Oh, and Ted, if you ever happen to see this: I’d love to help you put together a complete web edition of Computer Lib. I know you’re not keen on the web because of its limitations, but Computer Lib and Dream Machines are important historical documents that ought to be preserved and made available to the public, and even a read-only hypertext copy would be a good start. People could at least easily link to it and comment on it then, and spread the ideas.)
Filed under: Business, Culture |
Comments (0)
March 29, 2011
When you have IBM Lotus Domino in your organization, sooner or later you come up with a requirement to move data between Domino and some other system–often a relational database. There are many ways to do this, and not much guidance is offered as to which to pursue, so here’s a summary of my own experience.
LEI
LEI is Lotus Enterprise Integrator. It’s basically a general purpose solution for pumping data between Domino and other systems–typically relational databases, but you can also use it to pump data between non-replicating Domino servers.
With LEI, you specify data pumping tasks in a database, schedule how often they should run, and the LEI server add-in moves the data for you. If it’ll do what you need to do, it’s very convenient. However, it does require a special LEI task to be installed on the server and run at server startup.
My experience is that LEI works well until you need to change data structures during the pump operation–say, to summarize information from response documents into the main document each time it’s copied, or to turn multi-valued fields in Notes into multiple records in a destination relational table. At that point, the limitations of its scripted activities become apparent, and you tend to need a more specific tailored solution. That said, I haven’t tried LEI 8 yet, and it may be more capable now.
ODBC
The first option anyone with a Microsoft background will think of is ODBC. It requires the NotesSQL driver, which is only supported on Windows–no unixODBC option. That’s a non-starter for me, as I don’t want to have to keep a Windows system up and running just to provide back-end database connectivity.
If that’s not enough to dissuade you, consider that you’ll also need to write some Windows software to perform the ODBC operations, so you’ll need some Microsoft dev tools, and you’ll be writing platform-specific code for ODBC access from C, or possibly C# if you’re lucky.
Oh, and reportedly the NotesSQL driver only supports strings up to 256 characters in length, which would also have made it completely useless for all my projects so far.
CORBA
The next option, working roughly in order of descending age, is CORBA. Once upon a time, CORBA was going to be the universal glue to take any client program, written in any language on any platform, and connect it to any server on any platform.
The nice thing about CORBA is that the same code works whether you’re running client and server on the same system, or running them connected across the Internet.
One of the reasons CORBA didn’t catch on is that most documentation makes it sound incredibly complicated. It really isn’t. You download the jar files, place them in your Java run-time’s path, then write Java code more or less as if you were writing a Domino agent in Java and accessing the database directly. CORBA abstracts away all the details of database drivers and network protocols.
The downside of CORBA is that the server needs to be reconfigured to support it, and the appropriate ports need to be opened for the DIIOP server task.
Although CORBA doesn’t require design changes to the database, you need to bear in mind the general performance issues for agent-based solutions–see below.
Java (notes.jar)
This is basically the same as using CORBA, but without the remote network access. You write a Java agent, and run it on the server with notes.jar in the JRE runtime path. You no longer need to configure the server for DIIOP access, but that’s because you now have to run your client code on the server. Same agent performance issues apply.
SOAP
Originally, SOAP stood for “Simple Object Access Protocol”. The acronym was quietly dropped when version 1.2 of the spec was released. I hear that SOAP is truly horrible if you have to worry about the details, but fortunately Domino abstracts most of that away.
I’ve used SOAP, but I don’t want to use it again. To find out why, you can head over to my posting from 2009 about SOAP and Domino. Basically, every time you make any change to the Web Service API, even one that should have no impact on existing remote method calls, you have to regenerate and recompile the code for every single client that uses the Web Service, and redeploy the changed client code.
If that doesn’t put you off, SOAP isn’t that bad of a solution. However, another minor limitation is that it doesn’t support the full range of Domino data types, so you end up having to pack data (including multi-valued fields) into Strings and then unpack them again. Not ideal.
JDBC
The standard Java way to access databases is JDBC. Theoretically you can access Domino data this way. However, it’s done using a JDBC wrapper over ODBC. This means you get all the disadvantages of ODBC, with the added fragility of a wrapper layer.
HTTP and XML
If you only need to get data out, and the database is web-enabled, there are Domino URLs to extract view contents in XML format. Basically, take the URL of any view, and replace the ?OpenView with ?ReadViewEntries. You should get an XML version of the contents of the view. Check the Domino help for more information.
The main problem with this approach is that you can’t get at rich text information. Also, the normal restrictions on number of view entries returned still apply, so you can end up having to page through the data via multiple HTTP requests.
While this technique doesn’t strictly need design changes to the database, if you happen to be lucky enough to find a web-accessible view that suits your needs, typically that won’t be the case.
HTTP and JSON
You can add Outputformat=JSON to ReadViewEntries URLs and get view data in JSON. Very handy for AJAX. Same issues as for HTTP and XML above, though.
Files in a directory
Domino agents can dump data into files in a directory. You can then have some other task on the server that does something with those files. You basically have an unconstrained choice of file format, so long as you’re willing to write LotusScript or Java code to generate the format you want.
The main downside is that you need access to write random files into a directory on the server. This means the agent has to be given unrestricted access. Also, the usual agent performance caveats apply (see below)
DB2
Domino 8 and up can use DB2 as a data store, instead of Domino’s own NSF storage format.
This is pretty much the holy grail as far as integration goes–you can use a Type IV JDBC connector to connect straight to DB2, do SQL queries at high speed, and so on. However, it requires that DB2 be installed on the server, that you be given access to DB2 JDBC services, and that Domino be configured to use DB2 as data store. I’ve never been in this lucky position yet, so I can’t say too much about how well things work.
C/C++
Lotus offers a C API. You can do pretty much anything with it that can be done in a Domino agent. The big downside, of course, is that you have to do it in C.
There’s also a C++ API, but it doesn’t seem to be supported for Domino 8.5 and up.
OLE
It’s possible to use OLE on Windows to access Notes/Domino data. This allows you to use pretty much any scripting language that runs on Windows, including Ruby, Python, Perl, and so on. The downside, as with ODBC, is that you are irrevocably wedding yourself to Windows.
Mathew’s hybrid HTTP and XML solution
So what do I do? I use a modification of the “HTTP and XML” approach.
Step 1 is to obtain a list of all the UNIDs of documents modified since the data pump last ran. To do this, the data pump calls a web agent, supplying a date and time in ISO 8601 format. The agent opens a view which lists all relevant documents by last modification, formatted as an ISO8601 date/time string. The agent uses NotesView.GetViewEntry(datetime, False) to find the first value which comes after the supplied date/time value in alphabetic sort order. Since ISO 8601 format sorts alphabetically into date/time order, this is also the first entry with a modification date/time larger than the supplied date/time value. The agent then iterates to the end of the view, printing the UNIDs of every document, which I put in the second column of the view so that the NotesDocument object does not have to be accessed.
I could do much the same using the ReadViewEntries URL, but if I did that I would need to make the client page through view entries N at a time, backwards, until hitting a cutoff date. That’s harder work.
Step 2 is that the client calls a second web agent, passing it the UNIDs one at a time in the URL. The agent uses NotesDatabase.GetDocumentByUNID to fetch each document and dump it out in XML format via NotesDXLExporter. Since Domino transparently handles persistent HTTP connections, this is very easy to code.
My approach needs two agents and a view. However, it doesn’t require any special access or server configuration changes. You can write your client in any language you like, and run it on any platform you like. You can also change your database design without breaking it. Obviously if you change field names or contents in a way the destination system doesn’t understand then you’ll break things, but nothing else should cause a problem.
Agent performance
There are caveats about Domino agent performance that apply to all of the data I/O solutions involving LotusScript or Java code.
Any access to a NotesDocument object causes the entire document to be loaded from the database. This is slow, so you want to avoid doing it unless you know you need to read that document–for example, to copy the data to some other system.
So for performance, you should try to make sure that you can work out which documents you need to load, by accessing only NotesViewEntry and NotesViewEntryCollection objects. Typically you’ll want to build a view which contains all the documents you want to access, in some suitable ordering.
If you’re planning to mirror all data into another system, for example, you’ll want a view ordered by last modification date. You’ll load a NotesViewEntryCollection from it, and iterate through the NotesViewEntry objects, looking at the ColumnValue array to read the modification date/time. If it’s more recent than the copy of the document in the other system, then you’ll access NotesViewEntry.Document to actually load the document from the database.
So, what if you can’t build a view, perhaps because you don’t own the database or don’t have design access?
One option is to use NotesDocumentCollection to iterate. However, this will basically mean pulling a copy of every document every time, so you’ll find that as the database gets bigger, you rapidly run into problems.
Another option is to use a search to try and narrow down the NotesDocumentCollection to only include documents you probably need to access. This involves using either NotesDatabase.Search or NotesDatabase.FTSearch to build your document collection. Of the two, Search is probably the best bet, especially if you don’t know whether the database will be full text indexed. It’ll be slow, but it probably won’t be as slow as reading every document.
So in summary:
Summary
| Method |
Requires server config changes |
Requires database design changes |
Platforms |
Limitations |
| LEI |
Yes |
No |
Any Domino platform |
Special LEI task must run on server. Limited data remapping flexibility. |
| ODBC |
No |
No |
Windows |
Max 256 characters per field. |
| CORBA |
Yes |
No |
Any Java platform |
None. |
| Java (notes.jar) |
No |
No |
Any Domino platform |
Client code must run on server. |
| SOAP |
No |
Yes |
Any Java platform |
Many Domino types must be transported as Strings. |
| JDBC |
No |
No |
Windows |
“Same as for ODBC, as it’s just a wrapper.” |
| HTTP and XML |
No |
Yes (typically) |
Any |
No rich text field access. |
| HTTP and JSON |
No |
Yes (typically) |
Any |
No rich text field access. |
| Files in a directory |
Yes |
Yes |
Any Domino platform |
Client code must run on server. |
| DB2 |
Yes |
No |
Any Domino platform |
None. |
| C/C++ toolkit |
No |
No |
Any Domino platform |
None. |
| OLE |
No |
No |
Windows |
None. |
| MHHXS |
No |
Yes |
Any |
None. |
Special case easy options
There are a few special cases where getting data out of Domino is incredibly easy.
- E-mail access: Use IMAP.
- One-time export of data from a view: File menu in the Notes client, Export. Select tabular text or CSV.
- One-time export of everything: File / Export again, and choose Structured Text. You’ll get something a bit like an RFC822 mbox.
Any options I’ve forgotten? Let me know.
Filed under: Domino, Java, Linux, LotusScript, Programming |
Comments (0)
March 23, 2011
Back in 1978, when the TeX project began, there were no scalable fonts. If your printer supported 10 point and 12 point text, those were the two sizes you could use in your documents. Even when the Macintosh came along, you still had a fixed set of text sizes, unless you were rich enough to have a PostScript laser printer.
TeX, on the other hand, had METAFONT–a system for defining vector fonts, which could then be rasterized at any size, for any printer that could handle a bitmap image. So for years, TeX was the only way a lot of people could create documents using a wide range of typeface sizes and styles.
Then around 1990, Adobe Type Manager (ATM) went on the market. If you installed it on your Macintosh or Microsoft Windows system, you could use PostScript fonts on your desktop at any size, and ATM would handle the rasterization for your screen–and for your printer, if it wasn’t a Postscript printer. Suddenly you could have any font size you wanted, anywhere.
Fast forward 20 years, and vector fonts are ubiquitous; even my phone has vector fonts. Type 1, TrueType, OpenType… but not METAFONT. Like Plain TeX, METAFONT was way too complicated for most graphic designers. So nowadays, like everyone else, I want to be able to use the dozens of attractive vector typefaces installed on my computer, rather than the rather anemic selection of METAFONT fonts.
Then there was TeX’s way of getting output to the printer. First you ran your source TeX file through TeX itself; that spat out a DVI file with just the typeset text. Then you took the DVI file, and ran it through a converter which would insert the diagrams from their files, embed a set of fonts generated by METAFONT, and output a PostScript file. Then you either sent your PostScript file to your PostScript printer, if you were lucky enough to have one; or else you ran the PostScript through another converter (such as GhostScript) to turn it into your printer’s proprietary printer dump format (such as ESC/P or PCL, and copied that to your printer. It was all tedious but necessary in 1990. Today, it looks insane.
Then there’s the whole character encoding issue. Back in 1990, the only thing that could reliably be translated between different computer systems was plain ASCII text, so TeX had macros for curly quotes, accented characters, typographic dashes, dingbats, non-Roman alphabets, and so on. Nowadays, every major OS supports Unicode, so you can put Cyrillic or Greek text in your document on the Mac and be reasonably confident it’ll still be there if you open the file in Linux.
So TeX has been adapted to the modern world. pdfTeX got rid of the DVI stage, and went straight from TeX source files to PDF files; it later added support for microtypography. XeTeX got rid of METAFONT fonts, and allowed direct access to OS-installed vector fonts; it also added Unicode support. LuaTeX extended pdfTeX by adding Lua as an extra scripting language, allowing more complicated functionality to be supported than was feasible with TeX macros, and also added OS font support. LuaTeX has now been adopted as the successor to pdfTeX.
Since I wanted the most up-to-date Unicode and font support, I set about trying ConTeXt on Linux and Mac, using both LuaTeX and XeTeX. I created a document using a custom set of fonts–one OpenType, one Type 1, and one TrueType.
XeTeX was far easier to set up. Whereas LuaTeX requires that you set up environment variables and run a script to scan your fonts, XeTeX just calls the OS font routines. However, LuaTeX gets points because it’s the only supported engine for the latest ConTeXt release, Mark IV.
On the minus side, XeTeX on Linux is a bit buggier than on the Mac, as it’s a recent port; I had trouble with some Type 1 fonts. LuaTeX is not without bugs either: the ConTeXt mailing list recently discussed a bug triggered by a 5,000 page document, which caused LuaTeX to crash on page 3,987. That bug is fixed in a recent beta. Meanwhile, if anyone wants to try assembling a 4,000 page document in Microsoft Office, I’d be interested to know if it’s possible.
Since my documents are well under 3,987 pages, I’ve been happy with LuaTeX so far. So I had picked my TeX platform: ConTeXt running on LuaTeX. Now I had to sort out my macros and other necessary tools.
Filed under: TeX |
Comments (0)
March 15, 2011
When I posted that I was going back to using TeX, I mentioned that TeX had changed a lot in 20 years, but didn’t really go into too many details. Time to remedy that.
TeX is two layers of software. Underneath is the core of TeX, written in a variant of Pascal. These days it gets translated to C before being compiled to a binary. On top of the TeX core, you have a set of macros which provide all the handy \commands you use to typeset documents.
There are a number of different macro packages. Knuth’s own is known as “Plain TeX”; it’s what I used back when I last wrote TeX documents. It’s extremely flexible, and I managed to make it format my dissertation in a way that was so un-TeX-like that the examiners asked what I used. Unfortunately, Plain TeX is rather ugly to use. For example, here’s the code of a Plain TeX macro for placing two pieces of text side by side:
\def\xsplit#1#2#3#4#5{{
\setbox1=\vbox{\hsize= #1 #4}
\setbox2=\vbox{\hsize= #3 #5}
\ifdim\ht2>\ht1
\setbox1=\vbox to \ht2{\hsize= #1 #4 \vfill}
\else
\ifdim\ht1>\ht2
\setbox2=\vbox to \ht1{\hsize= #3 #5 \vfill}
\fi\fi
\hbox{\box1\hskip#2\box2}}}
\def\split#1#2#3#4{
\dimen1=\hsize
\advance\dimen1 by -#1
\advance\dimen1 by -#2
\xsplit{#1}{\dimen1}{#2}{#3}{#4}}
I remember spending most of a summer vacation poring over a copy of Knuth’s The TeXbook, slowly assembling my own set of macros. I based my page layout on that of Apple’s Macintosh Human Interface Guidelines. I had found the book particularly readable, and reasoned that Apple probably knew what it was doing as far as designing page layouts for technical documentation. Bending TeX to my will wasn’t much fun.
It’s a problem lots of people had. So to get away from all that, Leslie Lamport wrote an alternate set of macros called LaTeX. He provided standard templates for letters, technical articles, reports, books, and overhead projector slides. (Kids: Ask an old person what an overhead projector was.) LaTeX also had macros for bibliographies, tabular data, simple diagrams, indexes, and pretty much everything else needed for academic documents. Its book–LaTeX: A Document Preparation System–was about half the thickness of Knuth’s book, more focused on end users, and came with a handy quick reference card.
LaTeX spread through academia faster than Far Side cartoons. It was particularly popular with mathematicians, physicists, chemists, computer scientists, and anyone else who needed to be able to typeset mathematical equations. (It was also about the only way to typeset Tibetan in 1990, which led to my helping out some humanities students.) LaTeX is probably the most popular TeX macro package. There’s even a full graphical editor for LaTeX, so you can avoid the markup language entirely. If you have a mental image of what a TeX document looks like, chances are it’s the look of a standard LaTeX template.
I had toyed with returning to TeX a year or so ago, and picked up a copy of Lamport’s book in order to try out LaTeX. There’s a problem with LaTeX, though. If you don’t care about page design, it makes it really easy to put together a document that looks exactly as specified by its templates; but if you want to design your own page layout from scratch, you quickly enter a world of pain. That’s probably why almost all LaTeX documents look alike.
So I did some more research, and decided to try a newcomer to the TeX package market: ConTeXt. Its development started in 1990; it attempts to make TeX behave in ways familiar to people used to modern DTP packages. It’s more flexible than LaTeX, yet easier to use than Plain TeX. Here’s how you put two paragraphs side by side in ConTeXt:
\defineparagraphs[sidebyside][n=2]
\setupparagraphs[sidebyside][1][width=.45\textwidth]
\setupparagraphs[sidebyside][2][width=.45\textwidth]
\startsidebyside
First paragraph goes here.
\sidebyside
Second paragraph goes here.
\stopsidebyside
It’s still a bit long-winded, but that’s because it’s completely general. The first chunk is the setup; you can adjust the number of columns, define each column’s width differently, give different columns different text styles, and so on, and give each setup its own name and pair of \macros to apply to your paragraphs, as in the second chunk of text.
So, I had chosen a macro package; but there were more decisions to make…
Filed under: TeX |
Comments (0)
March 10, 2011
Last week I had a bad experience with several pieces of office software.
It started with a simple enough task: I had some existing documentation, and I needed to extend the “How to perform common tasks” section. There were two sub-headings to add, each of which needed a few bulleted paragraphs of instructions.
I fired up LibreOffice, opened the document, and started typing–but something was wrong. When I clicked to turn my instructions into a bulleted list, the indentation was wrong. It didn’t match the similar bulleted lists above or below in the document.
I assumed I had mixed up the styles somehow, perhaps applying bullets to the wrong paragraph styles, so I checked the style of the correct paragraphs, then applied that style to the new ones. Still wrong.
Perhaps my correct paragraphs had somehow been modified? I selected both the correct and incorrect ones, and applied the style they were supposed to be in. They stayed determinedly different.
I tried fiddling with the rules manually, then re-applying the styles. The rulers reset to the tabbing defined in the styles, but the text continued to be indented incorrectly. I even tried creating a whole new style, in case the style definition had somehow become corrupt. Still the newly-type text would not match indents with the existing paragraphs.
Finally, I pondered the possibility that LibreOffice had a pretty severe formatting bug. I opened the document in IBM Lotus Symphony, which was forked from OpenOffice 1.x and (of course) reads and writes standard ODF documents. Hurrah! My paragraphs were all indented correctly!
I thought I was done. I made a few last edits, went to the top of the document, and refreshed the table of contents.
Oh dear.
Now my table of contents had garbage in, entries with a page number and no text, pointing at pages that had no headings on (including the front cover).
I cussed, saved the document, and quit Symphony. I opened the document in LibreOffice again, and refreshed the table of contents–now it looked right once more. I nervously scrolled down, and was somewhat surprised to discover that my paragraphs were all indented correctly. I hurriedly saved out a PDF before anything else went wrong.
So it was that I was forcibly reminded of how much I hate office suites, and WYSIWYG word processing in general. That well-known commercial office suite from Redmond is no better than the ODF gang; a few quick searches will unearth countless tales of horror involving corrupt files, misnumbered pages, font problems, and so on.
Meanwhile, I had another piece of documentation to start writing, and the thought of doing so in LibreOffice now filled me with a mixture of rage and dread.
I’m not a luddite, I quite like GUI software when it works. On the Mac, Apple’s iWork suite does a good job, at least with the kind of short document I find myself writing. However, I don’t have a Mac at work, so that wasn’t an option. So I returned to a piece of software I hadn’t used in 20 years: TeX.
Madness? Perhaps. I appreciate that TeX is perhaps not a tool for everyone; though it’s really no worse than word processors of the 80s, which ordinary people nevertheless learned to use. I suspect that most people just don’t have the patience any more. They don’t want to sit down and learn something, which is why we have Microsoft Word, and why hardly any office documents even use the style system properly. Which, I suppose, is how a major formatting bug gets into released code without being noticed.
It turns out that quite a lot has happened in the TeX world in the last couple of decades. Obviously computers have gotten a lot faster; I remember watching my 8MHz Atari ST grinding away at my dissertation. As I recall, it took about a second per page to typeset my text to DVI; I would then wait about 2 seconds per page to flip through and inspect the pages in the DVI viewer.
These days the console output scrolls past too quickly to read. TeX dumps its output directly to PDF, which loads instantly into an integrated PDF viewer, which of course supports antialiased text. Plus, my screen has four times the resolution. So even before considering software features, it’s a very different experience these days.
So what of the software? Modern TeX supports OpenType fonts, and accepts diagrams in PDF or bitmap format. It handles Unicode, so you don’t need to use escapes for accented letters and other special characters. It can place hypertext links in your PDF documents, and use color and other effects. And yet, TeX documents written in 1980 are not merely readable today; they can still be typeset into beautiful documents.
So from my point of view TeX has a lot of advantages:
- The source files are plain text, so I can stick them in bzr for version control.
- I can edit my documents with vim, meaning I can easily do things like searching and replacing control sequences and styles.
- The output is of higher quality than any office suite. CTAN has a sample of text typeset by Microsoft Office, next to the same text typeset by TeX. While the differences are subtle, the overall effect is that the well-typeset text is simply less fatiguing to read.
- Web page captures look a lot better in TeX than in a word processor. I can render a web page to PDF using wkhtmltopdf, crop out the bit I want, and pull it into my document–and because all the text stays as vectors, it’s searchable and looks good at any resolution. Much better than using a bitmap screen capture, and good luck pulling SVG or PDF into Office.
- Similarly, I can draw diagrams in Inkscape, save as PDF, and pull straight into my document. Much better than trying to draw anything with the tools provided in an office suite.
- I can be confident that what I write today will be readable, editable and printable for years to come, by anyone willing to put in the time to learn how to install and run TeX and edit a text file in a markup language that’s really no worse than HTML.
- TeX stays out of the way. It’s like writing HTML or wiki pages; you don’t need to mess with menus and mice to start a new section, you just type \section{My new section} and carry on.
- I can write my own macros. For example, I quickly built a macro for web site references. I provide the URL and title; the macro handles formatting the title to show that it’s a link, making it clickable, pointing it at the URL, adding a footnote, and placing the URL in the footnote in monospaced font for people who have printed the document.
- It scales. While office suites can start to crap out at a hundred pages, TeX can easily deal with thousand page reference manuals.
- It’s bug-free. Core TeX hasn’t had a significant bug found in it in years, and the length of the complete known bug list for the TeX Live DVD speaks for itself.
- It’s free software, it runs anywhere, and on today’s computers it’s fast.
If necessary, I can always turn my TeX into HTML, RTF, ODF, or whatever. But for now, I’m going to try writing new documents in TeX.
Want to see what TeX can do these days? Try some pages from a German chess book typeset with it, or take a look at sample spreads from humanities books typeset with TeX.
Filed under: TeX |
Comments (0)