Preventing rsync from doubling–or even tripling–your S3 fees.

Using rsync to upload files to Amazon S3 over s3fs?  You might be paying double–or even triple–the S3 fees.

I was observing the file upload progress on the transcoder server this morning, curious how it was moving along, and I noticed something: the currently uploading file had an odd name.

My file, CAT5TV-265-Writing-Without-Distractions-With-Free-Software-HD.m4v was being uploaded as .CAT5TV-265-Writing-Without-Distractions-With-Free-Software-HD.m4v.f100q3.

I use rsync to upload the files to the S3 folder over S3FS on Debian, because it offers good bandwidth control.  I can restrict how much of our upstream bandwidth is dedicated to the upload and prevent it from slowing down our other services.

Noticing the filename this morning, and understanding the way rsync works, I know the random filename gets renamed the instant the upload is complete.

In a normal disk-to-disk operation, or when rsync’ing over something such as SSH, that’s fine, because a mv this that doesn’t use any resources, and certainly doesn’t cost anything: it’s a simple rename operation. So why did my antennae go up this morning? Because I also know how S3FS works.

A rename operation over S3FS means the file is first downloaded to a file in /tmp, renamed, and then re-uploaded.  So what rsync is effectively doing is:

  1. Uploading the file to S3 with a random filename, with bandwidth restrictions.
  2. Downloading the file to /tmp with no bandwidth restrictions.
  3. Renaming the /tmp file.
  4. Re-uploading the file to S3 with no bandwidth restrictions.
  5. Deleting the temp files.

Fortunately, this is 2013 and not 2002.  The developers of rsync realized at some point that direct uploading may be desired in some cases.  I don’t think they had S3FS in mind, but it certainly fits the bill.

The option is –inplace.

Here is what the manpage says about —inplace:

This option changes how rsync transfers a file when its data needs to be updated: instead of the default method of creating a new copy of the file and moving it into place when it is complete, rsync instead writes the update data directly  to  the destination file.

It’s that simple!  Adding –inplace to your rsync command will cut your Amazon S3 transfer fees by as much as 2/3 for future rsync transactions!

I’m glad I caught this before the transcoders transferred all 314 episodes of Category5 Technology TV to S3.  I just saved us a boatload of cash.

Happy coding!

– Robbie

0 0 votes
Article Rating
Subscribe
Notify of
guest
2 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
B
B
9 years ago

Thanks for the info. That was helpful and I’ll definitely try to cut the cost&resources.

Ames
9 years ago

I did some experimenting with –inplace, and it does help. Recently I’ve started using another technique, which you may be interested in. I use the btrfs file system to create snapshots, and then btrfs will “send” the differences between snapshots.

To make this work, I developed a utility I call buttersink to synchronize these diffs with S3. This is at least theoretically a lot more efficient than using rsync for backups.

Buttersink will synchronize a set of read-only snapshots in a btrfs filesystem to an Amazon S3 bucket, and vice-versa. It intelligently picks parent snapshots to “diff” from and send, so that a minimal amount of data needs to be sent over the wire and stored in the backend.

I’ve put a fair amount of effort into making buttersink efficient and reliable enough for backups. Currently implemented features include:

– Transfers between a local btrfs filesystem and an Amazon S3 bucket
– Automatically synchronizes a set of snapshots, or a single snapshot,
transferring only needed differences
– Intelligent selection of full and incremental transfers to minimize
costs of transfer and storage, and to minimize risks from corruption
of a difference
– Smart heuristics based on S3 file sizes, btrfs quota information,
and btrfs-tools internal snapshot parent identification (“ruuid”)
– Resumable, checksummed multi-part uploads to S3 as a back-end
– Robust handling of btrfs send and receive errors
– Configurable verbosity and logging
– Conveniently lists snapshots and sizes in either btrfs or S3
– Detects and (optionally) deletes failed partial transfers

If there’s interest, I’ll extend it to sync to a remote btrfs
filesystem as well.

The utility is on PyPi as “buttersink”, and the GitHub page is here:
https://github.com/AmesCornish/buttersink.

Thanks in advance for any feedback!