Sunday, June 17, 2012

Using StarCluster for some heavy computing

Update: 2012-10-22

I recently put the scripts I used in a little github repo called starcluster-image-sharpener. It's nothing big but it'll get you started if needed. Go get them! ;)

Introduction

First a little context:

I've photographed a ton recently and on one occasion I screwed things up a bit and had a lot of slightly out of focus images. I've tinkered with the Canon Digital Photo Professional settings to find a good level of sharpening (after noise reduction because a lot was shot at high iso (for me and my 50D, iso 1250 is a lot already)) but wasn't happy with it. I've found reasonable settings for noise reduction, which makes the image a bit softer, but sharpening wouldn't do anything to compensate let alone actually sharpen.

Since we're talking about roughly 1000 photos, hand-processing them is not an option. I've already identified some photos that need to be processed by hand but those are just a few.

So I turn to ImageMagick, my goto tool for all image processing that needs batching/automation. It took me a few hours of experimentation with the convert settings to find a combination where I liked the result: sharpened but very subtle and most importantly: none of the terrible rough edges that happen on oversharpening.

The settings are:

user@local:/tmp$ convert -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -gaussian-blur 1x1 -sharpen 10x1 \
    -quality 100 infile.jpg outfile.jpg

YES, this is not a typo. The great thing about this command is, that the result is really nice. The bad thing is: gaussian-blur is considerably slower than blur (but it blurs the edges less than the areas so it's better than blur for my purposes here) and sharpen is usually used with 0x??, but having a 10x?? yields a much better result but it also takes a lot more time.

So the above command, with a 12MB jpeg file, takes close to 4mins to complete, on a quadcore 2.4GHz. ImageMagick does use all cores so not much room for improvement there. It's going to be 4000mins (67hrs, almost 3 full days) then?

No, I won't accept that! Let's get out the big guns! I love big toys and I've basically waited for an opportunity like this to arrive. An opportunity to use the big instances on EC2, that is.

StarCluster to the rescue

StarCluster is a cluster computing framework built on top of Amazon EC2 and Sun's Grid Engine (Since the acquisition of Sun by Oracle it's of course called Oracle Grid Engine).

Why a cluster/grid you ask? Simple: I want to run jobs in parallel and in a controlled fashion. SGE does just that. I can submit a gazillion jobs to the grid and SGE takes care each node is busy but not overloaded.

This StarCluster thing is really easy to use. Setting up a cluster takes no time and almost no knowledge of clusters. You just configure your EC2 credentials and cluster characteristics (ami, number of nodes, etc), enter one command and lean back while StarCluster

  • launches the EC2 instances with the AMI you selected
  • adds a cluster user sgeadmin
  • sets up the hosts file with friendly hostnames
  • sets up passwordless login to all nodes via ssh public keys
  • configures the master and nodes into an actual cluster
  • and probably much more I didn't even notice

I ended up with a setup with just 1 EC2 cc2.8xlarge instance, and a second one starting and stopping as needed when I thought the inbox would get to big.

The Setup

So here's a little rundown of the setup I have running at the moment. It is not exactly what I have now but what I'd do next time because I've learned a few lessons. Anyway, here goes:

A StarCluster plugin installs imagemagick on every fresh node because it isn't installed by default (I Installed this manually while the node was already operational and so a bunch of jobs failed and had to be resubmitted).

My local notebook here at home instruments the whole thing. It has 2 shell scripts running: the first script uploads the images into an inbox folder on the master, the second script downloads the processed image if it hasn't been touched for 2mins (processing has finished) from the outbox on the master. The inbox and outbox are located in /home/sgeadmin, this home directory is shared across all nodes.

On the master node, a script iterates over all files that haven't been touched for 2mins (sftp upload has finished) in the inbox, moves them to the pending folder and submits a job to the grid with qsub.

The script executing the job, called by the grid scheduler or whoever triggers execution, then processes the image in the pending folder, writing the output to the outbox, and then removes the image in the pending folder where the download script will pick it up.

Initially, I've had some problems with the cluster manager not putting the full load on the cluster node(s). This was because StarGrid checks the load average and if a node goes above a certain treshold, no new jobs will be sent to this node. This setting can easily be changed with qconf. After all I'm paying for this machine by the hour, please let this thing run full steam ahead!

So since I've started this thing a few hours ago I've had a constant stream of uploading, processing and simultaneous downloading of the results. Thanks to a handful of small scripts I've automated everyting and can lean back while the entire thing is processed an estimated 4 times faster.

Incidentally, my upload rate and the rate of processing on the grid are roughly balanced with the upload being slightly faster so I can just start another node into the cluster. When doing this, until I have the cluster setup plugin, I'll have to ensure the new node doesn't accept any jobs, which can be done via qconf.

So all in all a successful experiment and I've learned a lot today. And it's fun to play with heavy machinery like this, even when it's located somewhere far away in Virgina ;)

Update 2012-06-18: The numbers are in: AWS charged me with 12hrs of total cc2.8xlarge instance time which costs about $28.80. Well worth it in my opinion for an almost 6x speedup compared to running this stuff locally. Next time I'll use the resources better, some of the time was 'unproductive' and used to set up all the scripts and test them (I had to learn calling qsub from ssh isn't as easy as it may sound because you need to use the full path and manually setup some env variables - that's just one example) and some performance was lost due to me tinkering with a second node and an only half-filled job queue.

Just some random notes, nothing to see here

the queue all.q is just the normal default queue where everything is sent. you can create your own queues, with soft and hard limits and whatnot... StarCluster offers a lot of features!

Set the max load average treshold to ensure the load factor won't slow down job submission to the nodes.

qconf -mattr queue load_thresholds np_load_avg=100 all.q

A CPU equals 1 slot but since ImageMagick does multithreading I've decided to go from 32 to 8 slots and set the environment variable MAGICK_THREAD_LIMIT to 6. So 1 dual-8core instance runs 8 convert processes parallel, each with 6 threads, which leads to a slight oversubscription of the cpu's resources but no thrashing (I hope).

qconf -rattr queue slots 8 all.q@master

Configure Slots of a new node to 0 so it won't accept jobs until I've installed ImageMagick. After the installation, I set it to 8, same as the master.

qconf -rattr queue slots 0 all.q@node001

If you start another node and the nodes just won't work at full load (but there's enough work available), it's not the loadavg, it's that the new node has processed jobs before you could set slots to zero! fix that, resubmit the jobs and all is good!

Friday, June 8, 2012

Local postfix as relay to Amazon SES

Introduction

Alright, this is a quick guide for the impatient but otherwise experienced linux admin/hacker/hobbyist. Some past postfix experiences might be advantageous for general understanding and troubleshoooting.

Why would I want a local postfix and relay to another smtp anyway? Simple: When my application code needs to send an e-mail, there is an SMTP server ready to accept the e-mail from me. It will then take care of everything else like re-delivery, dealing with being grey-listed and many other things. Also, if connectivity to the SES SMTP happens to be interrupted it's no big deal because here too, the local postfix will handle re-sending for me. Nice, huh?

The documentation for setting up postfix as an SMTP relay to Amazon SES is correct but seems incomplete and I had to hunt a few bits of extra information so below is my complete config that I use on my EC2 instances and my development notebook.

Configuration (EC2)

So let's make this quick, here are the configs:

main.cf

This is just condensed and stripped to the bare minimum. We only acceppt connections from the localhost, this ensures we don't relay e-mail from any party.

##
## default config (condensed, no coments)
##
queue_directory   = /var/spool/postfix
command_directory = /usr/sbin
daemon_directory  = /usr/libexec/postfix
mail_owner        = postfix
inet_interfaces   = localhost
mydestination     = $myhostname, localhost.$mydomain, localhost
unknown_local_recipient_reject_code = 550
alias_maps        = hash:/etc/aliases
alias_database    = hash:/etc/aliases
debug_peer_level  = 2
sendmail_path     = /usr/sbin/sendmail.postfix
newaliases_path   = /usr/bin/newaliases.postfix
mailq_path        = /usr/bin/mailq.postfix
setgid_group      = postdrop
html_directory    = no
manpage_directory = /usr/share/man
sample_directory  = /usr/share/doc/postfix-2.3.3/samples
readme_directory  = /usr/share/doc/postfix-2.3.3/README_FILES

##
## some extras
##
inet_interfaces    = loopback-only
masquerade_domains = $mydomain

##
## use amazon ses via smtp with starttls
##
relayhost                      = email-smtp.us-east-1.amazonaws.com:25
smtp_sasl_auth_enable          = yes
smtp_sasl_security_options     = noanonymous
smtp_sasl_tls_security_options = noanonymous
smtp_sasl_password_maps        = hash:/etc/postfix/sasl_passwd
smtp_use_tls                   = yes
smtp_tls_security_level        = encrypt
smtp_tls_note_starttls_offer   = yes
smtp_tls_CAfile                = /etc/ssl/certs/ca-bundle.crt

master.cf

The only change to the default config is to comment out the fallback line just below smtp and the rest of the master.cf is unchanged

smtp      unix  -       -       n       -       -       smtp -v
# When relaying mail as backup MX, disable fallback_relay to avoid MX loops
relay     unix  -       -       n       -       -       smtp
#       -o fallback_relay=
#       -o smtp_helo_timeout=5 -o smtp_connect_timeout=5

Configuration (Development)

When applying the configuration to my development machine I've had to copy the ca-bundle.crt file. So for my development machine the last line of the main.cf file looks like this:

smtp_tls_CAfile                = /etc/ssl/certs/amazon-aws-ca-bundle.crt

Also, the path daemon_directory is different on my notebook. You can figure out the correct path by checking where your package manager installs all the postfix binaries, that's your daemon_directory. (or just look at the distro's config before overwriting it)

daemon_directory = /usr/lib/postfix

And that's all that's different between my EC2 and development hosts.

Amazon SES Credentials

Amazon requires sending systems to authenticate against their SMTP, so you need to first create an SES user and password in the AWS SES Console. Those will be generated by Amazon for you and you have 1 (one) opportunity to see/download those credentials. Put them in the file /etc/postfix/sasl_passwd as you can see here:

email-smtp.us-east-1.amazonaws.com:25 <user>:<password>
ses-smtp-prod-335357831.us-east-1.elb.amazonaws.com:25 <user>:<password>

Then run "postmap hash:/etc/postfix/sasl_passwd" and remove the file.

Verify your sender address

Amazon SES only allows you to send from a verified address. So head on over to the AWS Console and add a sender address, wait for the E-Mail and verify the sender address.

Note: If you're in sandbox mode, the recipient must also be a verified address, otherwise SES will bounce the e-mail!! For test/dev this is usually good enough, if you need to be able to send e-mails anywhere then it's time to request production access in the AWS Console.

Sending some test e-mails

This was my biggest pitfall when setting SES up. Sendmail must be invoked in a special way so it behaves nicely and the mail will actually be sent. The first line is the shell command, and the rest is typed into the console.

root@remote:/tmp$ sendmail -t -f verifiedsender@example.com
To: recipient@example.com
Subject: Testing postfix relay via Amazon SES
Hey there, this is a test E-Mail!!
.

Note the line with a single dot and nothing else on it. This tells sendmail the body of the e-mail has ended and the e-mail will be sent.

And that's it already. If you know what you're doing it'll take you 10mins tops to set this up and get it running. If you're doing it for the first time like I did it's more like 2-3hrs with troubleshooting. Anyway, I hope this will save some folks some time so they can concentrate on more important matters!

Troubleshooting

If you want some debugging Information from postfix, you can append "-v" on the smtp line in the master.cf and restart postfix. The postfix log is /var/log/maillog on most systems. You need to read the log carefully but it'll tell you everything you need to find most if not all problems.

Monday, June 4, 2012

Accessing Request Parameters in a Grails Domain Class

It's not exactly elegant to work with request parameters in a domain class but it was necessary. I have a bunch of domain classes with "asMap" methods where they render themself into a map and cascade to other domain objects as needed. In the controller, the resulting map is given to the renderer and we get a nice json response.

So now I've changed some fields and in order to stay backwards compatible, I created a new apiKey (a parameter needed for all calls to my app) that distinguishes old and new clients.

And now, without further ado, the code:

Loading ....

And that's that. Once all clients work with the new format/apiKey the offending code above will be removed, promised!