Deferring Index costs for table to table copies in PostgreSQL

When bulk copying data to a table, it is much faster if the destination table is index and constraint free, because it is cheaper to build an index once than maintain it over many inserts. For postgres, the pg_restore and SQL COPY commands can do this, but they both require that data be copied from the filesystem rather than directly from another table.

For table to table copying (and transformations) the situation isn’t as straight-forward. Recently I was working on a problem where we needed to perform some poor-man’s ETL, copying and transforming data between tables in different schemas. Since some of the destination tables were heavily indexed(including a full text index) the task took quite a while. In talking with a colleague about the problem, we came up with the idea of dropping the indexes and constraints prior to the data load, and restoring them afterwards.

First stop: how to get the DDL for indices on a table in postgres? Poking around the postgres catalogs, I managed to find a function pg_get_indexdef that would return the DDL for an index. Combining that with a query I found in a forum somewhere and altered, I came up with this query to get the names and DDL of all the indices on a table. (this one excludes the primary key index)

With that and the query to do the same for constraints its straightforward to build a helper function that will get the DDL for all indices and constraints, drop them, yield to evaluate a block and then restore the indices and constraints. The method is below:

Use of the function would look like the snippet below. This solution would also allow for arbitrarily complex transformations in Ruby as well as pure SQL.

For my task loading and transforming data into about 20 tables, doing this reduced the execution time by two-thirds. Of course, your mileage may vary depending how heavily indexed your destination tables are.

Here’s the whole module:

Creating thumbnails of PDFs with attachment_fu

We needed to create some thumbnails from uploading PDF files for a new site feature – We’re using attachment_fu which doesn’t support that (yet?), but we’re using RMagick as our processor and it understands PDF files.

I came up with the hack below (warning, first draft, only briefly tested) which works without having to modify the attachment_fu plugin itself. One day I’ll loop back and figure out a cleaner way to do this and see which of attachment_fu’s other image processors can even support pdfs.

There are three methods to override to make a go of this:

  1. self.image? : consider pdf files as an image so thumbnail process will happen
  2. thumbnail_name_for : change the extension of the saved thumbnail filename to png
  3. resize_image: override to change format via block passed to to_blob

Apologies for the crappy source formatting, I have to install a plugin to do that well one of these days


###Hacks to allow creation of png thumbnails for pdf uploads - depends on RMagic being the configured processor
## likely very fragile

def self.image?(content_type)
(content_types +  ['application/pdf']).include?(content_type)
end

alias_method :original_thumbnail_name_for, :thumbnail_name_for
def thumbnail_name_for(thumbnail=nil)
return original_thumbnail_name_for(thumbnail) unless (content_type == 'application/pdf' && !thumbnail.blank?)
basename = filename.gsub /.w+$/ do |s|
ext = s; ''
end
"#{basename}_#{thumbnail}.png"
end
#copied from rmagick_processor with change in last few lines
def resize_image(img, size)
size = size.first if size.is_a?(Array) && size.length == 1 && !size.first.is_a?(Fixnum)
if size.is_a?(Fixnum) || (size.is_a?(Array) && size.first.is_a?(Fixnum))
size = [size, size] if size.is_a?(Fixnum)
img.thumbnail!(*size)
else
img.change_geometry(size.to_s) { |cols, rows, image| image.resize!(cols<1 ? 1 : cols, rows<1 ? 1 : rows) }
end
img.strip! unless attachment_options[:keep_profile]
if content_type == 'application/pdf' # here force the output format to PNG if its a pdf
self.temp_path = write_to_temp_file(img.to_blob {self.format = 'PNG'})
else
self.temp_path = write_to_temp_file(img.to_blob)
end
end

Livepipe UI controls for JS

I’ve been pretty pleased so far with the popup window control from LivePipe. It plays nice with Prototype and is easy to style with regular CSS. We had considered using Prototype Window but I was put off that all their default styles looks like operating system windows and restyling their windows required a table and 9 images.

I’d recommend anyone looking for a popup window solution at least consider Livepipe. There are downsides however, chiefly that the project is pretty immature – technically I suppose this is an alpha release since Beta One is being worked on, so the community remains small. While there are some folks already submitting patches, progress on merging the patches is alarmingly slow, as one can see from their lighthouse page.

If you’re doing RESTful stuff in Rails however, you will need the contents of ticket #10 which modifies the popup window to accept an option to use different HTTP verbs.

Scriptaculous docs

Every now and then my iPhone has this issue where it can’t tell time properly. I wake it up, and it shows me a time several hours ago, then as if waking from a drunken stupor, slowly tries to catch up to reality, moving the clock forward by a small, random number of minutes. During these episodes the whole UI is sluggish, and it apparently doesn’t even accept phone calls. When “phone” is 5/6 of your name one would think at least that would work all of the time!

Check out this screenshot from the missed call sheet. It recorded 3 missed calls that arrived over the course of an afternoon all with the exact same arrival time, 9:40 AM. The phone never rang.

That was with v2.01, so I sure hope this is fixed in the future.
For all the complaining I often do about the poor documentation of the scriptaculous project, I finally did something to help that today, creating (very thin) documentation for their new (if released in January is new) Effect.Tween function here on their github wiki.

I was creating a method to scroll the viewport so that the contents of an AJAX-loaded div would be fully visible on the screen – the (still undocumented) Effect.ScrollTo doesn’t quite do it because it doesn’t consider the height of the element it scrolls to, but in doing so I stumbled over Tween in the code. Once the math to figure out how much scrolling is needed, its easy to use Effect.Tween to smoothly scroll the window by repeatedly calling window.scrollBy();

This certainly isn’t rocket science, but here’s an outline of how to do it (this code only deals with downward vertical scrolling):

var elementHeight = element.getHeight();
var screenHeight = document.viewport.getHeight();
var elementScreenPos =element.viewportOffset()[1];
var amountToScroll = elementHeight - (screenHeight - elementScreenPos);
if (amountToScroll > 0){
var scrollPos = document.viewport.getScrollOffsets().top;
new Effect.Tween(null,scrollPos,scrollPos+amountToScroll,{},function(n) { window.scrollTo(0,n);});
}
}

The short circuit article

About three and a half years ago I co-wrote an article named “Increase stability and responsiveness by short-circuiting code” for IBM’s developer works site, and for some reason in the past few days it has repeatedly asked for attention (“hi this is 2004, your article is on the line, and its woefully dated”). First, the one page abstract we submitted fell out of a book on my bookshelf, then I was asked about it at at least one interview. Sadly, that was one of the top results for my name in google for a while and people still find it.
I figure its about time to revisit, and disavow, the implementation in the article, if that isn’t already obvious to anyone.

The idea was to provide a way to time-box operations that could take an unknown amount of time. In this way for example, a web page that must be displayed faster than a certain time can be guaranteed to run in that time, if it can do without the results of operations that take too long to execute.
One obvious flaw is that the code creates LOTS of new threads for a short period of time. It should have used a thread pool to reduce that churn.
The best reason not to use that code is that Java 1.5 introduced a whole set of Concurrency utilities. ExecutorService and Future. There are lots of examples about, so you can check them out.

The high level view is that you package your functionality in a Runnable or Callable (depending if you need to return a result), submit it to an instance of ExecutorService to run. It will return a Future object which can be queried to get the result. One can call get on the Future class, which will return right away if the task is done execuiting, or block until the sooner of a specified timeout or the task completing. Even better, one can submit multiple tasks at once with invokeAll(..) and that will return when all tasks are complete or the timeout has expired.

Interviews on Trivia

I’ve started interviewing again now that I should be finished with my Master’s degree in a month or so. I’m reminded again of the wide range of interview styles people use. My least favorite is the trivia test. This seems to happen more often with Java-related job interviews than Ruby-related ones.

I may never understand why any employer would value memorizing the Java API over being able to reference the docs and know where to find things.

I had such an interview just last week. Here are some of those questions

  • How do you execute a PL-SQL stored procedure from JDBC?
  • How do you import classes into the classpath of a JSP page? (apparently ‘no one in their right mind does that anymore’ isn’t a good answer to this one)

Who memorizes that stuff?

My favorite question of all was this : what are the two conditions under which a finally block is not called. I got one of them, (System.exit()) but the interviewer wouldn’t even tell me the other one (“You won’t learn that way”). I googled it later to find the answer not well defined. One of the ways I saw mentioned was the thread “dying” but Thread.stop() is severely deprecated so that shouldn’t ever happen. The other answer I saw floating around doesn’t really fit – when a exception is thrown from the finally block it doesn’t complete, but the finally block is still called.

I was talking about this with Frank and he came up with another way: infinite loop in the try block. I then thought of calling PowerSystem.getMainPower().setPosition(OFF).

Now I can’t wait to get that question again!

Ruby operator precedence (the ors and ands of it)

I found out (by introducing a bug into the application I’ve been working on) that “or” and “||” do not have equal precedence in Ruby.

More importantly, the assignment operator “=” has higher precedence than “or” so that means that while the expression


>> foo = nil || 2
=> 2
>> foo
=> 2

results in foo being assigned the value 2 as you might expect, the following expression leaves foo assigned the value nil.


>> foo = nil or 2
=> 2
>> foo
=> nil

This is well covered ground online (see this post) but I was surprised that this oddity didn’t warrant an explicit mention in the operator precedence section of the Pickaxe book.

Late to the cloud computing party

I saw on nytimes.com today “I.B.M. to Push ‘Cloud Computing, ’ Using Data From Afar“. The idea is to make it easier to process large volumes of data using centralized computing capacity in the “cloud”. Despite having several hundred phds working on this, they’re offering the exact same software (Hadoop) and vaunted “google programming model” as you can already use, (at 10 cents per server-hour) on Amazon’s Elastic Compute Cloud.

When you’re out innovated and out maneuvered in bringing a (open source) technology to market by a book company, its no wonder your stock price is basically stagnant.

I recently had the opportunity to use Amazon EC2 to process ~35 gigs of log files for someone and it was quite easy, costing me about $15 in data transfer and CPU time fees, and I bet it would have been cheaper/faster if I’d taken the time to use something like hadoop.

Boston Ruby User’s Group meeting

I attended my first Boston Ruby User’s group meeting earlier tonight. I wasn’t sure what to expect exactly, but I was surprised how many people attended (in the neighborhood of a hundred I would guess).
Both of the speakers were quite interesting.

  • David Black gave an interesting talk on the way Ruby implements inheritance with a particular emphasis on giving objects that “spring from” the same class different behaviors without defining additional classes.
    Learned a lot from this exercise in meta programming because I’ve really only dabbled in Ruby so far.
  • Zed Shaw had a really energetic, engaging and entertaining presentation touching on his http server, Mongrel, its competitors, evildoers and anti-social behavior on the internet and how he aims to address that with his Utu project

The sessions were video taped so they’ll apparently be up on Google video sometime soon. You don’t really have to know or care about ruby to enjoy and learn from Zed’s talk.

One of the great things about living somewhere like the Boston area is that people I’ve heard of before show up at things like this – attendees of the meeting tonight included Martin Fowler and John Resig (wrote JQuery), along with many other folks much smarter than me.