The anatomy of a quick bash script (bulk rename Kinesis Firehose files in S3)
Also known as: how to move hundreds of files in S3 that you accidentally put in the wrong place because you misconfigured a Kinesis Firehose.
I have a Kinesis Firehose streaming change capture data into an S3 bucket so I can query it using Athena. This has been working great for months, but I just realised the default configuration of Firehose does not put the files in a structure that Athena can partition on, so my Athena queries are scanning the entire dataset instead of partitioning on date.
Kinesis Firehose by default delivers files in a structure like: 2020/07/20/16/Filename-2020-07-20-16-00-00-hash
. But Athena wants files in a structure compatible with Hive’s partition format, which uses a variable-based convention: year=2020/month=07/day=20/hour=16/Filename-2020-07-20-16-00-00-hash
In my case, I’ve decided having the hour in the path is making my partitions too small (not enough data per-hour) so I’m going to use a structure of just the date as a single variable: dt=2020-07-20/Filename-2020-07-20-16-00-00-hash
.
Now I just need to move hundreds of files from the old structure into the new structure!
The commands
aws s3 ls --recursive s3://bucket-name/ \
| awk '{ print $4 }' \
| sed -E "p;s/([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/[0-9]{2}/dt=\1-\2-\3/" \
| sed -E "s/^/s3:\/\/bucket-name\//" \
| xargs -n2 aws s3 mv
Running this script will move all files from s3://bucket-name/year/month/day/hour/filename
to s3://bucket-name/dt=year-month-day/filename
.
Note: this is not particularly elegant: you could likely combine several of these lines together but I’ve favoured readability over efficiency, and it’s also not particularly fast so probably don’t use it for millions of files.
The breakdown
Start by listing all the files in the bucket:
aws s3 ls --recursive s3://bucket-name
This produces an output like:
2020-07-16 15:12:29 10016 2020/07/16/05/Snapshot-2020-07-16-05-07-26
2020-07-17 09:21:12 10838 2020/07/16/23/Snapshot-2020-07-16-23-16-10
2020-07-17 09:27:52 5790 2020/07/16/23/Snapshot-2020-07-16-23-22-50
2020-07-17 09:55:43 5409 2020/07/16/23/Snapshot-2020-07-16-23-50-41
I’m truncating the filenames in this post just to make it a bit more readable. Imagine there being a hash at the end of each line too.
Next, extract just the filename from the output, which is the 4th column:
awk '{ print $4 }'
I’m using
awk
here becausecut
doesn’t deal well with space delimited columns andaws s3 ls
doesn’t let you control the output format. If the output was tab delimitedcut
would’ve worked too.
This produces an output of just the filenames:
2020/07/16/05/Snapshot-2020-07-16-05-07-26
2020/07/16/23/Snapshot-2020-07-16-23-16-10
2020/07/16/23/Snapshot-2020-07-16-23-22-50
2020/07/16/23/Snapshot-2020-07-16-23-50-41
Next is the fun part. We need to take that filename and transform it into our new format whilst also preserving the original filename so both can be passed to a move command together. sed
works well for this:
sed -E "p;s/([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/[0-9]{2}/dt=\1-\2-\3/"
It’s a pretty dense command, but most of that is a regular expression. To break it down:
-E
enable extended regular expressions, this gives us the()
group match support.p;
you might not see this part very often, this prints the input as well as the replacement, so for one line in you get two lines out: the original, and whatever the replacement produced. This is key for being able to build the move command easily later as it requires a source and a destination.s/
start the regular expression.([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/[0-9]{2}
match theyear/month/day/hour
part of the path, capturing the year, month, and day as groups so they can be reused later./dt=\1-\2-\3
specify the replacement; we replace the whole match (year/month/day/hour) with a pattern ofdt=year-month-day
where\1
,\2
, and\3
are references to the capture groups in the regular expression.
Optimisation: This could probably have done in our previous
awk
command which has regular expression support, but it was looking pretty ugly and I’d prefer to waste a few CPU cycles than bend my brain too much.
The output now looks like this:
2020/07/16/05/Snapshot-2020-07-16-05-07-26
dt=2020-07-16/Snapshot-2020-07-16-05-07-26
2020/07/16/23/Snapshot-2020-07-16-23-16-10
dt=2020-07-16/Snapshot-2020-07-16-23-16-10
2020/07/16/23/Snapshot-2020-07-16-23-22-50
dt=2020-07-16/Snapshot-2020-07-16-23-22-50
2020/07/16/23/Snapshot-2020-07-16-23-50-41
dt=2020-07-16/Snapshot-2020-07-16-23-50-41
You’ll notice there are twice as many lines now, due to the p;
setting of sed
. For each original line there’s now that line plus the reformatted line.
Next up is to prefix each line with the bucket name. The move command expects all filenames to be fully-qualified with their bucket:
sed -E "s/^/s3:\/\/bucket-name\//"
Optimisation: With some smarter regular expressions, this could’ve been folded into the previous
sed
statement.
The output now looks like this:
s3://bucket-name/2020/07/16/05/Snapshot-2020-07-16-05-07-26
s3://bucket-name/dt=2020-07-16/Snapshot-2020-07-16-05-07-26
s3://bucket-name/2020/07/16/23/Snapshot-2020-07-16-23-16-10
s3://bucket-name/dt=2020-07-16/Snapshot-2020-07-16-23-16-10
s3://bucket-name/2020/07/16/23/Snapshot-2020-07-16-23-22-50
s3://bucket-name/dt=2020-07-16/Snapshot-2020-07-16-23-22-50
s3://bucket-name/2020/07/16/23/Snapshot-2020-07-16-23-50-41
s3://bucket-name/dt=2020-07-16/Snapshot-2020-07-16-23-50-41
The same as before, but now prefixed with the bucket name.
One final step to go, the actual move:
xargs -n2 aws s3 mv
We use xargs (an absolute cornerstone of shell scripting) to execute the aws s3 mv
command and finish the job. Let’s break this one down too:
xargs
reads from stdin and executes a command with the stdin lines as arguments. Soecho hi | xargs cowsay
is the equivalent tocowsay hi
. In our case, we take the filenames we’ve been processing and pass them toaws s3 mv
as arguments.-n2
this tells xargs to take two lines instead of just one, and pass them both to our command. Because we have one line with the original and a second line with the transformed,xargs
calls the move command like so:aws s3 mv original-line transformed-line
, which is exactly what’s need to move a file from one place to another.aws s3 mv
as already mentioned, is the commandxargs
is going to call.
Running that command will take a list of original filenames, transform them into pairs of original and transformed filenames, prepend the filenames with the bucket, and then pass the pairs to aws s3 mv
.
And that’s it, done! Files moved.