Knowledge
WordPress Staging Website on the Same Domain featured image

WordPress Staging Website on the Same Domain

The Staging Dilemma

Here at Follow My Vote, we rely on WordPress to host our website. We are an online company, so this website is our public face and we strive to keep it informative and attractive to guests. But that isn’t always easy. In order to keep a WordPress website secure and reliable, both WordPress itself and all of the plugins must be kept up to date, but sometimes updates cause breakage to the website that can impair its visual appeal or even result in downtime, leaving us scrambling to get all of the pieces we need working together again with the new version. Furthermore, sometimes we want to update the content of the site, and if it’s a small change, it’s (usually) easy and safe to just push it into production immediately. Still, sometimes we want to make big changes and see those in action ourselves before taking them live to show to the world.

In other words, when talking about making changes to the website, whether those be updates, new content, or whatever else… we want to try it before we buy it. We want to see it and try it out like it’s for real, but the public still sees our stable and reliable site while we tidy and polish the next version behind the scenes. And of course, it needs to be easy to set this up without asking our content creators to jump through complex technical hoops.

What we really want is a way to create a staging copy of our website, that we can make changes to and test out new content or configurations with, while the production website is all that is visible to the public eye. Well as it turned out, that wasn’t quite as easy as we expected…

What We Tried

One may suppose that the easiest solution to a problem is the best, so the most obvious thing to try was the features built into WordPress for making drafts of posts and pages. And indeed, these features are great for new content, and they’re reasonable for updating old content, but we find them best suited for quick and basic changes. They can be a bit clunky when doing certain kinds of edits or design changes, and it always feels a bit too easy for a stray click or keypress to publish a draft before it’s ready. For a lot of content work, we find ourselves wanting a bit more, and furthermore, these features leave us completely in the cold for testing updates before we commit to using them on our production website.

So we needed a stronger solution than we found built into WordPress itself. The next thing we tried was to simply copy the WordPress installation and host it on a staging domain, meaning we put the staging site on staging.followmyvote.com rather than regular followmyvote.com and made our updates there.

This was a quick and dirty solution that required minimal sophistication to get up and running, but the problems it created were endless. First of all, this is not particularly discreet. Anyone could type in staging.followmyvote.com and see what we’re up to. There are ways to put it behind a password, but it still feels a bit unprofessional to have this internal resource discoverable to the public, even if it is difficult to guess the password to see inside. Secondly, WordPress doesn’t take kindly to being hoisted up from one domain and plopped down on another. The website needs configuration changes to make it OK with that, and while those are relatively easy to do, it complicates the process, making it a less attractive solution.

We then realized that even if you make WordPress itself work when hosted on a new domain, the pages don’t. Lots of pages had hardcoded links to followmyvote.com rather than relative links based on the current domain, and it was difficult to train ourselves not to copy and paste links without removing the domain from the beginning of them. This made working on the staging site error-prone, and often resulted in us clicking a link that took us back to the public site without our realizing it, which could then lead to making changes there unintentionally! Furthermore, when we eventually did deploy the staging site into production, we realized to our chagrin that we had accidentally copied quite a number of links to the staging site into it, which was a rather embarrassingly public mistake.

All around, then, the approach of using a separate staging domain was painful, error-prone, and ridiculous. It was fraught with difficulties and risks, making it easier to do it wrong than to get it right. The failures of this approach made it far worse than if we had not attempted to use a staging site at all. I considered the possibility that with automation and filtering, we might be able to rescue the strategy, but quickly decided that this would be risky and complicated and would doubtlessly result in still more mistakes.

Nevertheless, I remained convinced that the staging site idea had merit, and for a few years, I contemplated what I would want from such a solution and how it could be done.

A Solution Begins to Take Shape

The first step to solving a problem is always figuring out what one would want in a solution. In the case of a staging solution for our website, I wanted the following features:

Same Domain First and foremost, the staging site and the production site must share the same domain (and URL). No more staging.followmyvote.com nonsense, no more reconfiguring WordPress for different domains, and no more changing links back and forth. Both sites must be accessed by browsing to followmyvote.com.

Not Just WordPress The solution must be built outside of and around WordPress, not using a plugin or any other feature of WordPress. Building such a solution into WordPress would be inherently more complicated than building it outside, and it wouldn’t provide the kind of isolation we would need to test different versions of plugins, themes, or WordPress itself without jeopardizing the production site. Moreover, I want a solution that works for other sites and site frameworks; not merely WordPress.

Professionally Discreet The existence of the staging site must not be publicly obvious. Our public web presence should be buttoned up and shouldn’t have technical details exposed. It should not be feasible for any passerby or bot to access the staging site, even if they happen know it existed. On the other hand, it should be relatively convenient for us to share access to it with others, without expecting them to possess unusual knowledge or dedication.

Zero Downtime Copying the production site into a new staging site, and subsequently deploying a stable staging site into production, should both incur zero downtime. The website should remain up and serving users the entire time with the only public indication that something happened being a change in content.

Straightforward Automation While it is true that a good staging solution will be reasonably convenient to operate manually, it should also be straightforward to implement full automation in the future so that staging sites can one day be configured, spun up, deployed, or retired with a simple back-office control interface.

The Path Comes into Focus

Executive Summary: This section is a technical discussion of how I arrived at the solution I chose. Those who aren’t interested in the finer details may simply note that I found a solution that satisfies all of my above requirements… with the caveat that, while the path to automation looks to be straightforward, this work remains to be done.

A public demonstration of the staging site solution described herein is available at https://stagingsite.followmy.vote/


Following this section is the complete technical guide on how to set this solution up on a private server. It should be readily understandable to any systems administrator who is experienced with, or is otherwise willing to learn, Docker on Linux.

With these requirements in mind, I put on my sysadmin hat and considered our existing deployment infrastructure. Follow My Vote runs a public WordPress site as well as several internal cloud services to facilitate our business operations. All of these exist as containerized microservices in Docker utilizing Traefik as a TLS-terminating reverse proxy sitting in front of these services. This seems to be a typical deployment strategy, and it has worked well for us for years with minimal turbulence after the initial learning curve.

When we tried the staging domain, we just deployed a copy of WordPress with Traefik serving it on the staging domain rather than the primary domain. While easy to implement, this strategy failed due to the difference in domain. In theory, however, the same back-end architecture could be used successfully if instead of using the domain to route traffic between the production and staging back-end services, I used some other discriminant.

My first idea was to use client IP to select between serving the production and staging sites; however, I rejected this approach because it was clumsy and complicated. On the one hand, an IP address isn’t a good identifier of a client anymore due to NAT, and on the other, it makes the Traefik configuration annoying due to frequent updates to an excessive number of rules.

Then I considered that if the browser indicated that it was looking for the staging site in its request, this would also be adequate for Traefik to pick out and decide how to route the request. This was the approach I finally settled on: the client inserts a custom header into its request and Traefik determines whether the header is correct and, if so, routes the request to the staging back-end.

There was one wrinkle in the solution. WordPress sites regularly make requests to themselves, and since WordPress just does a DNS lookup for the domain it’s configured to run on and sends the request there, obviously without our fancy custom header, the staging service’s requests get routed to the production service rather than back to staging! Working around this required a bit of ingenuity, and I’ll cover it in the technical guide below.

This solution was highly successful for us. It’s easy and reliable to implement on the server side, and opting in or out of the staging site is completely client-side. A browser plugin can handle the injection of the header, and plugins exist for major browsers to configure a custom header and to enable or disable the header with a click, making it convenient to switch back and forth between production and staging. Staging environments can be kept private by making the value of the header into a password. If a staging environment is desired to go live, simply remove the custom header requirement from it and give it a higher priority than the production server, and Traefik will switch new traffic to it with zero downtime. The custom header strategy also looks like it will be easy to automate, although for the time being, this is still future work.


OK, now it’s time for the fun part.

How To Run It on Your Server

We show the setup of this staging solution in three stages. First is the foundation stage, where we set up a trivial Traefik reverse proxy and a placeholder webpage hosted by a fallback server, used when nothing else is working. Initially, no other servers will be defined, thus nothing else will be working and therefore the placeholder will be shown. The second phase adds our production website on top of the placeholder, and the third phase adds a staging site on top of that. We will explore the configs used in each of these phases here, but readers can also see the full config directory hierarchy in a git repository where each phase is represented by a commit. The git repository is at https://gitlab.followmy.vote/nathanielhourt/staging-site

The Foundation

We will be using Docker Compose to deploy our server infrastructure. Installing Docker and Docker Compose is left as an exercise for the reader; there is an abundance of tutorials easily discoverable online for virtually all imaginable servers and environments. The reader also must choose a domain on which they wish to host their website and set up DNS for it. If it is not convenient to set up public DNS to follow this tutorial, an entry into a client’s /etc/hosts file will suffice, but in this instance the reader must procure their own TLS certificate and manage it manually. Configurations for both manually managed certs and automatic Let’s Encrypt certs (for deployments on public domain) are shown below.

OK, with Docker and Compose ready to go and either a public domain or a cert for our private domain in hand, let’s write our service configuration. Begin with a new directory to contain the servers on this website, and within it, a subdirectory for the foundation of our deployment, which will hold configs for Traefik and the placeholder site. In this tutorial, we’ll call these directories mywebsite and foundation. In the mywebsite directory, let’s make a file of environment variables we may share with the various services in our deployment, called env. The only variable we need here for this tutorial is the domain upon which we’re hosting our website:

mywebsite/env

In our foundation directory, we’ll add a compose.yaml file, and Docker Compose will look for environment variables in a .env file in the same directory as the compose.yaml file, so make a symlink at mywebsite/foundation/.env to the shared environment file, ../env.

Now let’s explore the contents of the compose config section-by-section, and remember that the full config of this phase is visible on the repo.

We begin by defining a Docker network (which is to say, a VLAN) to interconnect our production services.

mywebsite/foundation/compose.yaml

This creates a network, which we can refer to in this file as prod, but which will be known as production-vlan outside this file. That network will occupy the 172.100.0.0/24 subnet, but the first half of the subnet will be reserved for static IPs and Docker will only assign IPs dynamically out of the second half of the subnet.

Next, let’s make a service that will be Traefik, using the official public Traefik image for it, tracking the 3.1 series:

mywebsite/foundation/compose.yaml (cont’d)

We assign our Traefik service the fitting (if not very original) name traefik and put it on our prod network with a static IP. Note that we’re only dealing with IPv4 in this tutorial, but Docker will invisibly forward inbound IPv6 traffic to IPv4 service networks if desired. Of course, readers may define IPv6 addresses for their services at their discretion.

We also bind the host’s ports 80 and 443 to the same ports on the service. Note that the ports section governs public ports belonging to the host system — these are not private ports on the VLAN we created! We are giving the Traefik service access to bind the host’s ports 80 and 443, and this is the only service to which we will grant direct access to public ports. All other services will have private VLAN networking only, and Traefik will forward traffic to them over said VLANs as appropriate.

We direct Docker to restart the Traefik service if it stops for any reason, and we define a healthcheck command that Docker will run regularly to verify that Traefik is healthy. Note that by default, Docker will take no corrective action if it becomes unhealthy, but it makes for pretty status reports.

Finally, let’s attach some volumes to our Traefik service. Docker containers, by default, have no persistent storage, and all the data they store is subject to be reset to the original image’s contents any time the service restarts or is rebuilt (NB, during normal maintenance). Docker Volumes allow us to persist paths within the container, and also give us a way to place specific files and directories into the container or share certain files or directories from the host with the container.

mywebsite/foundation/compose.yaml (cont’d)

Here, we have done all three. We share the host’s Docker socket with the container, allowing Traefik to access Docker’s API, we make a persistent directory wherein Traefik can store (or access) TLS certificates, and we mount into the container Traefik’s config file which we will have created by the time we start this service and Docker goes looking for it.

Note that when we give Traefik access to Docker’s API, we give up all the security benefits of containerization, at least in Traefik’s case. Having access to the Docker API bestows the full power of the host’s root account. Those concerned about this can look into options like rootless Docker or podman and can still follow this tutorial, but details on how to set this up are out of scope here, and the benefits are questionable: either way, a successful attack on Traefik will compromise everything Docker has access to, and whether this equates to the host’s root or not, it’s probably still everything of value on your server. It would be nice if there were a way to expose a restricted Docker API to Traefik instead of the full gamut, but I am not aware of a way to do this today.

Pressing onward, we define a second service in this compose file, which is our placeholder website. We name the service greeter and assign it a bare nginx image. Attach it to our prod network with no config, and Docker will assign it a dynamic IP automatically. We’ll also mount in the website to host, although for this tutorial this is merely a single config file.

mywebsite/foundation/compose.yaml (cont’d)

At the end, we include a section of labels. These are simple key-value pairs that can be attached to most constructs in Docker, and while Docker itself doesn’t use or care about them, they provide information that systems integrated with Docker (such as Traefik) can use to determine how to handle things running within Docker. These labels are how we tell Traefik how to host our service, and Traefik monitors Docker for services starting and stopping and consults the labels on those services to automatically update its routing according to the services available at the moment.

On our placeholder service, we put a label that tells Traefik that we want it to proxy for this service, another label telling Traefik what traffic this service is intended to host (in our case, traffic addressed to our domain), and finally, a label setting a very low priority for this service, so that of the requests that are addressed to our domain, Traefik will only forward those requests to the placeholder if nothing else is available to take it. Initially, since no other service is running, this will show us that our servers are working properly. In the next phase, when we start our real website, however, this placeholder will no longer be served unless the server of our real website crashes. This is because the real website server has a higher priority, granting it precedence over the placeholder server any time both are available.

OK, that’s our entire services config. Now let’s make the other things we need for our servers. Create a directory at mywebsite/foundation/volumes and a certs directory within. Next, create also a directory mywebsite/foundation/configs, and within, we’ll make two config files, one for Traefik, and one for our placeholder nginx website.

The nginx config is trivial:

mywebsite/foundation/configs/greeter.cfg

Note that we listen only on port 80. No TLS is used within the VLANs; Traefik terminates the TLS and proxies raw HTTP traffic to the services. This way the services need not deal with certs or TLS options at all, simplifying their configuration immensely without exposing unsecured traffic to the internet. All of the TLS and certificate concerns can be handled centrally at Traefik.

Next, the Traefik config. I’ve included comments to elucidate inline, but we’ll discuss in more detail below:

mywebsite/foundation/configs/static.yml

Traefik operates with two classes of configuration. The first is the static config, which is what we’re setting here, and it controls the constant concerns that are the same over an entire execution cycle of Traefik. The second configuration class is the dynamic configuration, which controls the services Traefik is expected to expose and the rules controlling how those services are accessed. The dynamic configuration changes over the course of Traefik’s execution, and Traefik constantly adjusts its behavior to align with this configuration: as new services show up or old ones shut down, Traefik updates its routing and adjusts how requests are directed accordingly. The dynamic config comes from “providers” which Traefik monitors for up-to-the-moment status updates and changes, and the providers Traefik should monitor are defined in the static config.

OK, with that context established, let’s look over the static config above. First, we define entrypoints. These are the ports Traefik should listen on for incoming traffic. We configure two: port 80, for HTTP traffic; and port 443, for HTTPS. We name the HTTP entrypoint web and the HTTPS, websecure. We configure Traefik not to connect HTTP traffic to services, but rather to respond with a 301 redirect to HTTPS. This is, of course, a matter of preference, but it’s a common enough pattern that it’s useful to show it here. We configure the HTTPS entrypoint, websecure, to be a default entrypoint, to simplify the per-service configuration, and we set TLS options so as to inform Traefik that this entrypoint deals with TLS traffic, as it will not infer this from the use of port 443 alone.

In the next sections, the config is different for those wishing to use a public domain and Let’s Encrypt certs than for those configuring their certs manually (as those using a private domain must do). Certificates are a dynamic concern (they change at runtime), so they aren’t set in the static config, but where to get them is. If configuring the cert manually, just put an empty tls config block to indicate that TLS is used, but to accept the defaults. We’ll then set a dynamic config provider that loads in the certs. Alternatively, if using Let’s Encrypt, uncomment the certResolver config so Traefik knows to get certs automatically, as well as the config lines defining the certificate resolver to be used. In this case, the dynamic configuration provider for manually managed certs can be omitted.

Moving ahead to the providers configuration, we configure Traefik to monitor Docker for dynamic configurations, and I prefer to select explicitly which services should be exposed with the traefik.enable=true label, but again, this is a matter of taste. Then, for those managing certs manually, we add a dynamic provider from a file, where we will hook in those certs.

Those using Let’s Encrypt can omit this, but for those who aren’t, here are some typical contents of that dynamic config file. Be sure also to put in your private key and certificate files alongside so the paths provided in the dynamic config point to them in the container’s filesystem (remember that our mywebsite/foundation/volumes/certs directory shows up at /etc/certs inside the container!).

mywebsite/foundation/volumes/certs/manual.yml

Returning our attention to the static config, hopefully the remainder of the configuration is self-explanatory. I do want to note, however, that the ping section is only needed if using the healthcheck on the Traefik service in compose.yaml. If no healthcheck is required, the ping section can be omitted as well. To opt into anonymous usage statistics collection, add a global section with a setting sendAnonymousUsage: true.

With that, our foundation is complete! Go ahead and start the servers by setting your working directory to foundation and running docker compose up. Note that this starts the servers in the foreground with logging to the console; to daemonize it all, pass an additional flag -d and monitor the status with docker compose ps and docker compose logs. You should now be able to open your website in a web browser, though if using Let’s Encrypt, it may take a few extra moments for the cert to be provisioned. If automatic certs don’t seem to be working, turn on the debug logging in the static config and monitor Traefik’s output for details on the ACME negotiations. The certs that Traefik has acquired go into acme.json in the certs volume — that file can also be monitored to determine whether certs are being issued successfully, without enabling debug logging.

Take a few moments to bask in the glory of your new foundational website infrastructure… and then move on to Phase II, the production website.

Pushing to Prod

The simplest phase of our deployment, now we create the production website. For our demo, this is a trivial single-page website, almost identical to the fallback site. We just create another bare nginx service and populate it with a slightly different config. In a real deployment, you might set up a WordPress here or any number of other apps; from the perspective of the staging system as well as Docker and Traefik, it’s all fundamentally the same.

As an aside, if you are setting up an app with dependencies, and those dependencies are in their own containers, I recommend putting these dependencies on a separate VLAN without Traefik and connecting the front end to both the prod VLAN and its support VLAN. This creates a better separation of concerns. If your front end is on multiple VLANs, you need to add another label to it so Traefik knows which VLAN to connect to it on, i.e. traefik.docker.network=production-vlan (remember to use the absolute name, not the local alias).

But back to our demo, let’s begin by creating a new directory mywebsite/production, and in that directory creating our .env symlink back to ../env. Next, create our compose file with our service definition:

mywebsite/production/compose.yaml

We don’t need to define the network this time, as it was already defined in the foundation configs. This time we just give it a local name, mention its absolute name, and set external to true so Docker knows to find it already created rather than creating it anew.

Then we create our service, essentially the same as before, but this time we don’t set an artificially low priority. I used a different syntax for the labels this time… but both formats do the same thing.

Finally, we create our nginx config:

mywebsite/production/configs/website.cfg

To launch it, we do the same thing as before. Make sure Traefik is still running and, from the new production directory, do a docker compose up -d to start the website in the background. Now go to your domain in your browser, and you should see the new production page rather than the placeholder.

And with that, we’re ready to make our staging site!

Staging the Next Thing

Now to launch our staging site, we could actually just copy our production site, tweak the Traefik rule, and be done… except that if your website or app is anything more than a trivial static webpage, it might at some point want to connect to itself as a client (for example, I know for a fact that WordPress does this to trigger background tasks) and if we are too naive about how we set this up, the staging service won’t connect to itself but will rather connect to the production service. This is because the staging service doesn’t know to set the magic header needed to reach itself through Traefik.

This posed a bit of a conundrum for me when I was designing this solution. I couldn’t just convince the app (I was working with WordPress) to connect to localhost (its own container) rather than using DNS to find Traefik because while it was configured to serve over port 80, it was also configured to know that its URL uses the https:// scheme, so it used that and thus couldn’t connect to itself directly — it had to go through Traefik for TLS termination. “OK,” I thought, “but I control the IP of the container, so I can just tell Traefik to route the request-to-self based on source IP, right?” Well, no, because the way Docker routes outgoing traffic, all containers’ outgoing traffic to a public IP address gets NATed together and shares a single source IP. So I have to force my app to connect to itself through Traefik, but also within Docker without going to the public IP associated with my domain on public DNS. Here is how I did that.

We create a new VLAN for the staging site, giving it a different subnet from the production VLAN. We then put Traefik on both VLANs, giving it an alias on the staging VLAN which is our domain. This means that when a container in the staging VLAN does a DNS lookup on our domain, Docker sees that this is an alias for a container on the VLAN and gives that container’s VLAN-private IP as the DNS result, rather than the result of a public DNS query! This causes the staging container to connect to our domain by connecting to Traefik directly over the staging VLAN, thus eliminating NAT and showing its real source IP to Traefik. Traefik is then configured to route any and all traffic destined for our domain coming from the staging VLAN to the staging frontend, even if it doesn’t have the magic header that says “route me to the staging service”!

So first things first, we edit our foundation compose config to add the new network and to give Traefik an alias on it. In the networks: section, we add:

mywebsite/foundation/compose.yaml (insert)

Then in the services.traefik.networks: section, we add:

Remember that you can see the entire config diff on the git repository.

Next, we just copy our entire production directory calling the new copy staging and update all the references to production things to refer to staging things instead, and we update the Traefik routing rule for the staging service like so:

Note that we changed the label key as well as the value; the prod or staging in the label key are names that we define arbitrarily, and they must be globally unique. This is true in many of Traefik’s service labels, so copy and paste carefully!

Examining our rule, it now applies to traffic destined for our domain AND either bearing the magic header, or coming from our staging VLAN subnet. Also, by default a Traefik routing rule’s priority is equal to the number of characters in the rule (a cheap but remarkably effective heuristic for preferring more specific rules over less specific ones), meaning that while our staging traffic will match both the production rule and the staging rule (and the fallback rule too!), the staging traffic will be routed to the staging service by virtue of that service having the longest rule.

Make the appropriate changes to the staging site by updating its config, and we’re ready to run the staging site live! With yet another docker compose up -d we can make it so.

To access the staging site, it is necessary to set a custom header in our requests to our server. An easy way to do this is via the use of the Modify Header Value extension for popular browsers. Go ahead and give it a whirl! (Users of Chromium-based browsers may need to use ModHeader or some other option instead, as Chromium’s Modify Header Value doesn’t seem to work on all requests that match the filter).

Our staging site is now live and working, discreetly, for only those we want it to. We now can make our desired changes to it, and test them out as if they were in prod, on the normal domain and with the normal config. From an application perspective the staging site is exactly identical to the production site except that only we can see it, and only when we want to, while the rest of the world sees the production site without bothering to know any other option might exist.

In fact, the staging site is so similar to the production site that when we decide it’s ready, we can simply make it the production site — at least temporarily.

Zero-Downtime Updates

To deploy the new staging changes into production, we could shut down production, copy over the new changes, and re-deploy, and with the help of our fallback server we could even display a friendly “Scheduled maintenance” page in the meantime… but wouldn’t it be nicer if we could just switch over with no downtime at all? Well as it happens, there are a number of ways to do this. One of them would be to simply relaunch the staging site with a routing rule that makes it preferable to the production site for all traffic. With a health check on it, Traefik will watch it until it reports as healthy after its startup routine, and then switch new traffic over to it because it has a preferable rule. Once traffic switches over, we could shut down the production site, update it to the new code, start it up again, and once it’s healthy, shut down the staging site so traffic goes back to the service labeled as production. Voila! An upgrade with no downtime at all!

Of course, this isn’t the only way to accomplish the same result, and all such mechanisms have advantages and disadvantages, and depending on the exact details of your app, this approach may not work and possibly no zero-downtime approach will work at all. Nevertheless, this approach should work for many applications, and even if the zero downtime bit doesn’t work for your particular app, this overall strategy should be effective and compelling.

I wish you luck and I hope you’ve enjoyed this tutorial. If you get this far, or if you have any questions or trouble, drop me a line on Matrix: @I:nathaniel.land. I’ll see you around! =)

Read More Articles From The FMV Blog

Author

Nathaniel Hourt

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.