James Gregory

Also known as: how to move hundreds of files in S3 that you accidentally put in the wrong place because you misconfigured a Kinesis Firehose.

I have a Kinesis Firehose streaming change capture data into an S3 bucket so I can query it using Athena. This has been working great for months, but I just realised the default configuration of Firehose does not put the files in a structure that Athena can partition on, so my Athena queries are scanning the entire dataset instead of partitioning on date.

Kinesis Firehose by default delivers files in a structure like: 2020/07/20/16/Filename-2020-07-20-16-00-00-hash. But Athena wants files in a structure compatible with Hive’s partition format, which uses a variable-based convention: year=2020/month=07/day=20/hour=16/Filename-2020-07-20-16-00-00-hash

In my case, I’ve decided having the hour in the path is making my partitions too small (not enough data per-hour) so I’m going to use a structure of just the date as a single variable: dt=2020-07-20/Filename-2020-07-20-16-00-00-hash.

Now I just need to move hundreds of files from the old structure into the new structure!

The commands

aws s3 ls --recursive s3://bucket-name/ \
  | awk '{ print $4 }' \
  | sed -E "p;s/([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/[0-9]{2}/dt=\1-\2-\3/" \
  | sed -E "s/^/s3:\/\/bucket-name\//" \
  | xargs -n2 aws s3 mv

Running this script will move all files from s3://bucket-name/year/month/day/hour/filename to s3://bucket-name/dt=year-month-day/filename.

Note: this is not particularly elegant: you could likely combine several of these lines together but I’ve favoured readability over efficiency, and it’s also not particularly fast so probably don’t use it for millions of files.

The breakdown

Start by listing all the files in the bucket:

aws s3 ls --recursive s3://bucket-name

This produces an output like:

2020-07-16 15:12:29 10016 2020/07/16/05/Snapshot-2020-07-16-05-07-26
2020-07-17 09:21:12 10838 2020/07/16/23/Snapshot-2020-07-16-23-16-10
2020-07-17 09:27:52  5790 2020/07/16/23/Snapshot-2020-07-16-23-22-50
2020-07-17 09:55:43  5409 2020/07/16/23/Snapshot-2020-07-16-23-50-41

I’m truncating the filenames in this post just to make it a bit more readable. Imagine there being a hash at the end of each line too.

Next, extract just the filename from the output, which is the 4th column:

awk '{ print $4 }'

I’m using awk here because cut doesn’t deal well with space delimited columns and aws s3 ls doesn’t let you control the output format. If the output was tab delimited cut would’ve worked too.

This produces an output of just the filenames:

2020/07/16/05/Snapshot-2020-07-16-05-07-26
2020/07/16/23/Snapshot-2020-07-16-23-16-10
2020/07/16/23/Snapshot-2020-07-16-23-22-50
2020/07/16/23/Snapshot-2020-07-16-23-50-41

Next is the fun part. We need to take that filename and transform it into our new format whilst also preserving the original filename so both can be passed to a move command together. sed works well for this:

sed -E "p;s/([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/[0-9]{2}/dt=\1-\2-\3/"

It’s a pretty dense command, but most of that is a regular expression. To break it down:

-E enable extended regular expressions, this gives us the () group match support.
p; you might not see this part very often, this prints the input as well as the replacement, so for one line in you get two lines out: the original, and whatever the replacement produced. This is key for being able to build the move command easily later as it requires a source and a destination.
s/ start the regular expression.
([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/[0-9]{2} match the year/month/day/hour part of the path, capturing the year, month, and day as groups so they can be reused later.
/dt=\1-\2-\3 specify the replacement; we replace the whole match (year/month/day/hour) with a pattern of dt=year-month-day where \1, \2, and \3 are references to the capture groups in the regular expression.

Optimisation: This could probably have done in our previous awk command which has regular expression support, but it was looking pretty ugly and I’d prefer to waste a few CPU cycles than bend my brain too much.

The output now looks like this:

2020/07/16/05/Snapshot-2020-07-16-05-07-26
dt=2020-07-16/Snapshot-2020-07-16-05-07-26
2020/07/16/23/Snapshot-2020-07-16-23-16-10
dt=2020-07-16/Snapshot-2020-07-16-23-16-10
2020/07/16/23/Snapshot-2020-07-16-23-22-50
dt=2020-07-16/Snapshot-2020-07-16-23-22-50
2020/07/16/23/Snapshot-2020-07-16-23-50-41
dt=2020-07-16/Snapshot-2020-07-16-23-50-41

You’ll notice there are twice as many lines now, due to the p; setting of sed. For each original line there’s now that line plus the reformatted line.

Next up is to prefix each line with the bucket name. The move command expects all filenames to be fully-qualified with their bucket:

sed -E "s/^/s3:\/\/bucket-name\//"

Optimisation: With some smarter regular expressions, this could’ve been folded into the previous sed statement.

The output now looks like this:

s3://bucket-name/2020/07/16/05/Snapshot-2020-07-16-05-07-26
s3://bucket-name/dt=2020-07-16/Snapshot-2020-07-16-05-07-26
s3://bucket-name/2020/07/16/23/Snapshot-2020-07-16-23-16-10
s3://bucket-name/dt=2020-07-16/Snapshot-2020-07-16-23-16-10
s3://bucket-name/2020/07/16/23/Snapshot-2020-07-16-23-22-50
s3://bucket-name/dt=2020-07-16/Snapshot-2020-07-16-23-22-50
s3://bucket-name/2020/07/16/23/Snapshot-2020-07-16-23-50-41
s3://bucket-name/dt=2020-07-16/Snapshot-2020-07-16-23-50-41

The same as before, but now prefixed with the bucket name.

One final step to go, the actual move:

xargs -n2 aws s3 mv

We use xargs (an absolute cornerstone of shell scripting) to execute the aws s3 mv command and finish the job. Let’s break this one down too:

xargs reads from stdin and executes a command with the stdin lines as arguments. So echo hi | xargs cowsay is the equivalent to cowsay hi. In our case, we take the filenames we’ve been processing and pass them to aws s3 mv as arguments.
-n2 this tells xargs to take two lines instead of just one, and pass them both to our command. Because we have one line with the original and a second line with the transformed, xargs calls the move command like so: aws s3 mv original-line transformed-line, which is exactly what’s need to move a file from one place to another.
aws s3 mv as already mentioned, is the command xargs is going to call.

Running that command will take a list of original filenames, transform them into pairs of original and transformed filenames, prepend the filenames with the bucket, and then pass the pairs to aws s3 mv.

And that’s it, done! Files moved.