Removing Cloudfront URLs with Regex

Recently helped a customer remove Cloudfront URLs as part of a migration process from AWS to Kinsta. Kinsta’s CDN is powered by Cloudflare and is built-in. There is no need for URLs to reference a CDN-specific location. So restoring all URL references back to their original location was required before shutting down the old AWS Cloudfront account.

Changing internal URL references is typically a one-liner with WP-CLI’ search-replace command. However, this one took some clever logic to get right due to the fact that it was a large multisite network and had extra formatting that needed to be handled in the URL itself. Here is a walkthrough of the steps I took.

Step #1 – Replace Cloudfront URLs per site

The problem is that URLs formatted like random-token.cloudfront.net were scattered throughout the multisite network. These references need to be replaced with each site’s domain which is unique per subsite. To do this properly I wrote the following replace-cloudfront-references.sh bash script to handle.

#!/bin/bash

cd ~/public
path_to_wp=$( pwd )
cloudfront_url="random-token.cloudfront.net"

for url in = $(wp site list --field=url --path="$path_to_wp" --deleted=0 --archived=0 --skip-plugins --skip-themes)
do
        if [[ $url == "http"* ]]; then
                current_domain=${url/http:\/\//}                            # removes https://
                current_domain=${current_domain/https:\/\//}                # removes http://
                current_domain=$( echo $current_domain | awk '{$1=$1};1' )  # Trims whitespace
                current_domain=${current_domain/\//}                        # Removes remaining /
                echo "Replacing $cloudfront_url with $current_domain"
                wp search-replace "$cloudfront_url" "$current_domain" --log --report-changed-only --skip-tables="wp_sitemeta" --url="$url" --path="$path_to_wp"
        fi
done

Running replace-cloudfront-references.sh loops through each site on the multisite network and runs a WP-CLI search and replace unique per each site. I did need to skip the global table wp_sitemeta so that bad references in there didn’t get changed through each loop.

Step 2 – Cleanup extra token in URLs

The Cloudfront URL format included extra data towards the end of the file name. For example, https://random-token.cloudfront.net/wp-content/uploads/sites/70/2020/08/53294582/main.jpg worked with Cloudfront however Cloudfront added the section right before the file name for security reasons. That extra section needs removed in order for the URLs to work now they are pointed back to their respective domains.

This can be handled with a regex search and replace like this. The --dry-run and --log flags are added for safety to verify the regex search is correct by previewing the replacements to be made.

wp search-replace "wp-content\/uploads\/(sites\/\d+\/\d{4}\/\d{2}\/)\d{8}\/" "wp-content/uploads/\1\2" --regex --dry-run --log --all-tables --report-changed-only

Running a regex search and replacement is a very slow and intensive process. Rather than targeting the entire database my final query only ran on tables I knew there to be Cloudfront URLs. Something closer to the following.

wp search-replace "wp-content\/uploads\/(sites\/\d+\/\d{4}\/\d{2}\/)\d{8}\/" "wp-content/uploads/\1\2" wp_*_options wp_*_postmeta wp_*_posts wp_*_hustle_modules_meta wp_*_layerslider --regex --dry-run --log --all-tables --report-changed-only

When satisfied with the search and replacements to be made, last step is to remove --dry-run flag and run the final command.

Step #3 – Additional cleanup

While running the find and replace with the --log I noticed a few other places that needed to be changed. For example, I saw extra tokens removed for URLs referring to custom-bucket.s3.amazonaws.com. So I plugged that into the first script cloudfront_url=custom-bucket.s3.amazonaws.com and re-ran.