Sunday, June 17, 2012

Using StarCluster for some heavy computing

Update: 2012-10-22

I recently put the scripts I used in a little github repo called starcluster-image-sharpener. It's nothing big but it'll get you started if needed. Go get them! ;)


First a little context:

I've photographed a ton recently and on one occasion I screwed things up a bit and had a lot of slightly out of focus images. I've tinkered with the Canon Digital Photo Professional settings to find a good level of sharpening (after noise reduction because a lot was shot at high iso (for me and my 50D, iso 1250 is a lot already)) but wasn't happy with it. I've found reasonable settings for noise reduction, which makes the image a bit softer, but sharpening wouldn't do anything to compensate let alone actually sharpen.

Since we're talking about roughly 1000 photos, hand-processing them is not an option. I've already identified some photos that need to be processed by hand but those are just a few.

So I turn to ImageMagick, my goto tool for all image processing that needs batching/automation. It took me a few hours of experimentation with the convert settings to find a combination where I liked the result: sharpened but very subtle and most importantly: none of the terrible rough edges that happen on oversharpening.

The settings are:

user@local:/tmp$ convert -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -quality 100 infile.jpg outfile.jpg

YES, this is not a typo. The great thing about this command is, that the result is really nice. The bad thing is: gaussian-blur is considerably slower than blur (but it blurs the edges less than the areas so it's better than blur for my purposes here) and sharpen is usually used with 0x??, but having a 10x?? yields a much better result but it also takes a lot more time.

So the above command, with a 12MB jpeg file, takes close to 4mins to complete, on a quadcore 2.4GHz. ImageMagick does use all cores so not much room for improvement there. It's going to be 4000mins (67hrs, almost 3 full days) then?

No, I won't accept that! Let's get out the big guns! I love big toys and I've basically waited for an opportunity like this to arrive. An opportunity to use the big instances on EC2, that is.

StarCluster to the rescue

StarCluster is a cluster computing framework built on top of Amazon EC2 and Sun's Grid Engine (Since the acquisition of Sun by Oracle it's of course called Oracle Grid Engine).

Why a cluster/grid you ask? Simple: I want to run jobs in parallel and in a controlled fashion. SGE does just that. I can submit a gazillion jobs to the grid and SGE takes care each node is busy but not overloaded.

This StarCluster thing is really easy to use. Setting up a cluster takes no time and almost no knowledge of clusters. You just configure your EC2 credentials and cluster characteristics (ami, number of nodes, etc), enter one command and lean back while StarCluster

  • launches the EC2 instances with the AMI you selected
  • adds a cluster user sgeadmin
  • sets up the hosts file with friendly hostnames
  • sets up passwordless login to all nodes via ssh public keys
  • configures the master and nodes into an actual cluster
  • and probably much more I didn't even notice

I ended up with a setup with just 1 EC2 cc2.8xlarge instance, and a second one starting and stopping as needed when I thought the inbox would get to big.

The Setup

So here's a little rundown of the setup I have running at the moment. It is not exactly what I have now but what I'd do next time because I've learned a few lessons. Anyway, here goes:

A StarCluster plugin installs imagemagick on every fresh node because it isn't installed by default (I Installed this manually while the node was already operational and so a bunch of jobs failed and had to be resubmitted).

My local notebook here at home instruments the whole thing. It has 2 shell scripts running: the first script uploads the images into an inbox folder on the master, the second script downloads the processed image if it hasn't been touched for 2mins (processing has finished) from the outbox on the master. The inbox and outbox are located in /home/sgeadmin, this home directory is shared across all nodes.

On the master node, a script iterates over all files that haven't been touched for 2mins (sftp upload has finished) in the inbox, moves them to the pending folder and submits a job to the grid with qsub.

The script executing the job, called by the grid scheduler or whoever triggers execution, then processes the image in the pending folder, writing the output to the outbox, and then removes the image in the pending folder where the download script will pick it up.

Initially, I've had some problems with the cluster manager not putting the full load on the cluster node(s). This was because StarGrid checks the load average and if a node goes above a certain treshold, no new jobs will be sent to this node. This setting can easily be changed with qconf. After all I'm paying for this machine by the hour, please let this thing run full steam ahead!

So since I've started this thing a few hours ago I've had a constant stream of uploading, processing and simultaneous downloading of the results. Thanks to a handful of small scripts I've automated everyting and can lean back while the entire thing is processed an estimated 4 times faster.

Incidentally, my upload rate and the rate of processing on the grid are roughly balanced with the upload being slightly faster so I can just start another node into the cluster. When doing this, until I have the cluster setup plugin, I'll have to ensure the new node doesn't accept any jobs, which can be done via qconf.

So all in all a successful experiment and I've learned a lot today. And it's fun to play with heavy machinery like this, even when it's located somewhere far away in Virgina ;)

Update 2012-06-18: The numbers are in: AWS charged me with 12hrs of total cc2.8xlarge instance time which costs about $28.80. Well worth it in my opinion for an almost 6x speedup compared to running this stuff locally. Next time I'll use the resources better, some of the time was 'unproductive' and used to set up all the scripts and test them (I had to learn calling qsub from ssh isn't as easy as it may sound because you need to use the full path and manually setup some env variables - that's just one example) and some performance was lost due to me tinkering with a second node and an only half-filled job queue.

Just some random notes, nothing to see here

the queue all.q is just the normal default queue where everything is sent. you can create your own queues, with soft and hard limits and whatnot... StarCluster offers a lot of features!

Set the max load average treshold to ensure the load factor won't slow down job submission to the nodes.

qconf -mattr queue load_thresholds np_load_avg=100 all.q

A CPU equals 1 slot but since ImageMagick does multithreading I've decided to go from 32 to 8 slots and set the environment variable MAGICK_THREAD_LIMIT to 6. So 1 dual-8core instance runs 8 convert processes parallel, each with 6 threads, which leads to a slight oversubscription of the cpu's resources but no thrashing (I hope).

qconf -rattr queue slots 8 all.q@master

Configure Slots of a new node to 0 so it won't accept jobs until I've installed ImageMagick. After the installation, I set it to 8, same as the master.

qconf -rattr queue slots 0 all.q@node001

If you start another node and the nodes just won't work at full load (but there's enough work available), it's not the loadavg, it's that the new node has processed jobs before you could set slots to zero! fix that, resubmit the jobs and all is good!

1 comment:

  1. This article was of great use for me and my colleagues. Thank you so much! Another point - any any ideas to write on virtual data room solutions ? It is quite your sphere of interest, would be interesting to know your opinion