Thursday, July 01, 2010

how to upload in bulk to Amazon S3

Recently I had to upload a folder containing many files (over 60000) to Amazon S3 storage service.


Amazon doesn't have a tool for this, and their APIs seem to support transferring only one file at the time. I was looking for a solution that wouldn't tie up my computer for a long duration while doing the transfer piecemeal. So I thought why not make a tarball with my files and transfer it in one fell swoop via ftp on a Linux box running as an instance of Amazon elastic computing cloud EC2. Then the plan was to run some scripts to unpack the thing and transfer the files fast between EC2 and S3 servers, through Amazon's internal network.


Assuming you already have a subscription to Amazon's services, and know your way around EC2, here are the steps to take (you may need to change the things in pink to match your needs) :


1. look for a suitable AMI to base your instance on:
ec2-describe-images -o self -o amazon | grep getting-started


2. pick and run an instance
ec2-run-instances ami-3c47a355 -k gsg-keypair


2.5 wait until your instance is running, check on its status with
ec2-describe-instances 


3. log in to the instance
ssh -i id_rsa-gsg-keypair root@ec2-184-73-124-245.compute-1.amazonaws.com


4. on the instance, look for and install ftp
yum list vsftpd
yum install vsftpd.i386

5. configure the ftp server (uncomment "#anon_upload_enable=YES") in
/etc/vsftpd/vsftpd.conf


6. start the ftp server
/etc/init.d/vsftpd start



7. prepare the destination for anonymous ftp (with total disregard for the security implications)

mkdir /var/ftp/pub/upload
chmod 777 /var/ftp/pub/upload



8. from your computer ftp the tarball to the server (ftp anonymous@ec2-184-73-124-245.compute-1.amazonaws.com/pub/upload)

9. after the ftp transfer completes, go back to the instance and unpack the tarball

10. install an awesome script from http://timkay.com/aws/
cd
curl timkay.com/aws/aws -o aws

11. configure the tool with your Amazon credentials, by creating/editing an .awssecret file and placing your Access Key ID on the first line, followed by your Secret Access Key on the second line

12. save the .awssecret file and set its permissions
chmod 600 .awssecret

13. install the tool
perl aws --install

14. give it a try, make a destination bucket to S3
s3mkdir tartordesign

15. go where your files are
cd /var/ftp/pub/upload

16. prepare an awk script to do the upload and, optionally, set the visibility of your files to public (I needed mine to be available on the web), edit a vi do.awk file with the following content:

{
printf "putting %s\n", $1
cmd = "aws put tartordesign/ " $1
system(cmd)
cmd = "aws put tartordesign/" $1 "?acl --set-acl=public-read"
system(cmd)
}

17. run the script that does the job
ls | grep .jpg | awk -f do.awk  -
note: sometime the piping construct above may fail because "ls | grep .jpg |" produces truncated names. A more robust approach is this:
ls | grep .jpg > list
awk -f do.awk list

18. go about your business until it's finished (you may choose to send the output of that script to a file and later check on the progress by tailing on that file, that way you may end your ssh connection established at step 3, so you're not tied up to anything, instead you reconnect only periodically to check the progress)

19. once the script is done terminate your running EC2 instance, the one you started at step 2
ec2-terminate-instances i-2b57bf41

That's it folks.

2 comments:

zelishe said...

Hi, we are facing the same problem now, but our amount of files is ~2M. Can you tell, how fast that upload was, do you have any time calculations? BTW, thanks for very useful information!

Unknown said...

@zelishe, unfortunately I haven't done any measurements regarding how fast the method is. For me it was a one-time experience, as I didn't have to run the script again, to give me the chance to collect some stats. In any case, it was something reasonable, I believe I left the EC2 instance running over night and in the morning it was all done.