Documenting our migration to Docker—challenges and lessons learned
A couple of weeks back, we announced our latest product release and the new any-app capabilities of the Artifakt platform, powered by container technology and Docker.
For the last couple of years, Artifakt has been focusing on PHP stacks. But PHP is not the only language for web applications out there. With the new Docker integration, we have big plans ahead!
Rebasing our PaaS on the defacto standard for application packaging is great news for developer teams of all shapes and sizes.
You will find many perks in this release, it has never been a better time to come onboard a container-friendly PaaS—infrastructure as code, changes traceability, runtime performance, customer value. Not to mention that Docker is best friends with core DevOps principles of agility, antifragility, and time-to-market.
But all journeys, no matter the delightful outcome, often come with struggles. In this article, I want to dive into the challenges we faced during our Docker migration and the lessons we learned.
If you are looking to migrate to containers, have a read—I’m sure you’ll find at least one point to relate to and maybe the way we approached certain issues and challenges helps you avoid potential mistakes.
Problem: there is no free lunch in software engineering
PaaS solutions are very convenient. I’m sure anyone migrating its app from on-prem or IaaS will agree.
Part of this convenience level is the use of Virtual Machines (VM). Nice piece of software, very solid abstraction, secure and mature for decades now. No one can do without them nowadays, so they are quite necessary. Nonetheless, this is just not enough.
The dark side of VMs is they tend to grow as personal pets. The need for daily maintenance, regular support, and occasional life support calls for serious fixed costs, as well as human assistance.
Now, for the bitter reality: VMs are not adapted to cloud infrastructure anymore. Current conditions are harsh on them, strict regulations may apply, and while the tech landscape evolves constantly, VMs are still built on the same principles as 20 years ago.
Amplify: where things can go wrong
You should always keep in mind all scenarios that can go wrong and prepare. Let’s have a look at just a couple of things that could go wrong in a container migration.
Bad consequence #1: not fast enough
We have heard it before, speed is the mother of all battles. Not only in strategy but from ideation and down to execution as well. Change needs to happen fast, keeping up with the fast evolution of the ecosystem. As Jack Welch, former CEO of G.E said:
It is not the big eating the small, but rather the fast eating the slow. A good idea with slow execution means the death penalty. Bad ideas with good execution mean you can still pivot and experiment and close down on a winning market fit.
Bad consequence #2: lack of confidence
Raise your hand if you have never heard “But… It works on my laptop!” in a software project—I doubt I’ll see many hands raised. Sure, we have infrastructure as code, continuous integration, and all the modern machinery. Don’t get me wrong, there are local optimizations created inside silos, and they are useful. The real deal though is to break silos and take the team to a maturity level high enough, so that “they build it, they run it”.
We don’t want the PaaS to be the new “Ops problem now” and say “support problem now”. Remember the “disaster girl” meme? That’s a silo syndrome gone wrong! And BlackOps is even more dangerous, you don’t want the dev team to run containers themselves and poke holes in your firewall.
What triggered the need for change and our migration to Docker?
Containers are the answer to all the challenges mentioned above—a mature, even boring, standard of software packaging. But it was not always like that.
Back in the 20th century, globalization was possible by one really concrete material change: a universal box to move cars, food, etc. This only enabled “inter-modal transportation” as we know it today.
Since 2014, software containers rapidly matured in a de facto solution under the flag of Docker Inc. with a simple yet working promise:
“Whatever is the underlying executing system, Docker can run it with the exact same code, byte by byte.”
Containers have been running under the hood for many months now at Artifakt—even for stateful stuff, dare we say. We now run different configuration changes in a matter of seconds, instead of 10 to 30 minutes of VM provisioning.
With our next major console version, Artifakt moves to expose containers as units of deployments.
Transformation and testimony: how we made Magento 2 even shinier
You can imagine the groundbreaking effect the Docker migration had on our daily work routine. Orchestrating VMs required strong coupling with our cloud provider on somewhat proprietary technologies. For AWS, it was CloudFormation and OpsWorks. We spent many hours ingesting layers of cloud complexity so that our beloved customers would not have to.
As a result, the magnificent rewrite has us with open source and open formats: Dockerfile,
docker-compose.yaml, and the stable Docker API.
How about running the exact same Magento 2 stack on the laptop and taking it to production? This is now possible in Artifakt.
Magento 2 is part of the nine runtimes we officially support since our Stack v5 release earlier in July. In many ways, this release focused all challenges in one place:
- crontab management
- tests on containers
- deployment processes
- ISO production local stack
Let’s take a look at how we overcome these challenges and where this leaves us.
Docker migration part I: the good
The journey begins with the classic Docker goodness. We were already sold on the benefits of containerization in PaaS environments. Some aspects are really a no-brainer to implement. Maturity is high enough, the ecosystem is thriving, and cloud providers have already been paving the way for a couple of years now.
We cannot go through all the quick wins and many would be straight-up boring in 2021, so let me focus on the most enlightening.
Quick win #1: workflow engine with Argo
Argo Workflows is a wonderful tool to implement reactive GitOps pipelines. That’s a mouthful, so let’s break it down first. GitOps gives teams the ability to conduct and execute changes from Git, and not only code changes. We are talking about infrastructure, networking, storage, etc. Every piece of the machinery ever created, upgraded, or decommissioned can be linked back to a Git commit.
Feeling excited? We know we are! Our workflows are nicely tucked inside Argo and run our base operations like deployment jobs for many different stacks and languages.
Everyone inside our organization has access to past workflows and their logs, making troubleshooting a breeze. Instead of logs being fragmented in many layers of complexity, we now benefit from a shared stand where the Customer Support and Engineering teams can help customers measure and optimize their processes.
In the future, we are also looking forward to trying Argo CD and the many opportunities it offers for a PaaS product like Artifakt.
Quick win #2: linting and testing in the container world
Turns out there are many ways to go wrong with Docker images. As automated tests are a good practice in code, the same principle applies to Dockerfiles. Official documentation already ensures the correctness of Dockerfiles: all commands should return 0 or else the build will fail.
What we need here is to add two more dimensions: validity and content. Validity ensures we have the right style in writing Dockerfiles and content checks on semantics.
Could we do it all with only native Dockerfile instructions? Almost, but even then it would result in bloated images with many non-production layers, and then enters the multi-stage builds, etc.
In other words, if the build succeeds, it must now onboard the correct software in the right amount and number.
So, we chose to follow the principle of separation of concerns and draw a line between testing, validation, and semantic checks.
For simple syntax linting, we went with
hadolint, a specific linter for Dockerfiles—it is powerful enough yet gentle to onboard, integrates nicely with CI/CD, and is actively maintained. Of course, it lives in a Docker image itself! Let’s take a peek at basic options:
hadolint - Dockerfile Linter written in Haskell Usage: hadolint [-v|--version] [--no-fail] [--no-color] [-c|--config FILENAME] [-V|--verbose] [-f|--format ARG] [DOCKERFILE...] [--error RULECODE] [--warning RULECODE] [--info RULECODE] [--style RULECODE] [--ignore RULECODE] [--trusted-registry REGISTRY (e.g. docker.io)] [--require-label LABELSCHEMA (e.g. maintainer:text)] [--strict-labels] [-t|--failure-threshold THRESHOLD] [--file-path-in-report FILEPATHINREPORT] Lint Dockerfile for errors and best practices
Right from the default help screen, we are welcomed with a bunch of nice options and examples: check required options, output report in a different format, declare a private trusted registry, etc.
hadolint in continuous integration, we recommend these three steps:
--no-failand safely evaluate the results without breaking the current flow.
- After playing with options and finding the right balance, make the necessary fixes to tested Dockerfiles.
- Remove the
--no-failoption and enable
hadolintas a mandatory step.
To demonstrate how nice
hadolint is, let’s take a look at this simple command from our Docker base images:
$ docker run --rm -i hadolint/hadolint:v2.6.0 hadolint - < ./akeneo/5-apache/Dockerfile
Hadolint then proceeds to give you a list of recommendations with a level of severity (from style to error), by referencing the exact line, how neat is this? For instance:
-:16 DL4006 warning: Set the SHELL option -o pipefail before RUN with a pipe in it. If you are using /bin/sh in an alpine image or if your shell is symlinked to busybox then consider explicitly setting your SHELL to /bin/ash, or disable this check -:17 DL3008 warning: Pin versions in apt get install. Instead of `apt-get install <package>` use `apt-get install <package>=<version>` -:17 DL3059 info: Multiple consecutive `RUN` instructions. Consider consolidation.
All rules, 70 and counting, are referenced and explained in the official repo wiki. They include very basic checks (“do not use
apt-get upgrade”) to highly specialized use cases (“yarn cache clean missing after yarn install was run”).
hadolint options are even more powerful. Take a look at this command:
docker run --rm -i hadolint/hadolint:v2.6.0 hadolint --require-label author:text --ignore=DL4006 --failure-threshold=warning - ./Dockerfile
In this snippet, we tell
hadolint to scan our Dockerfile, by checking a mandatory label
author, except for rule DL4006, and, finally, only fail if feedback contains warnings, thus ignoring
One last bit for developers,
hadolint also supports inline
#ignore annotations inside a Dockerfile, making it easier to use the same command on a group of Dockerfiles.
All in all, and to draw a conclusion, this gives us a nice start in testing. Later on, when we feel ready to apply a higher level of good practices we can decrease the error threshold to
Having Docker images that are at the same time syntactically correct and following good practices is great. Now, we also needed to check for validity in content. After all, a PaaS product is expected to have a set of internal rules that are applied.
For this, we use another nifty tool, Google’s own container-structure-test. It’s part of the GoogleContainerTools namespace in Github, which also hosts some famous projects you may already know and use: Skaffold, distroless, Jib, and Kaniko.
Container-structure-test is an open source project that reads a test suite and checks an existing Docker image against them. It has one command:
test (what did you expect?). Note that this differs from
hadolint where a plain text Dockerfile was enough. Container-structure-test needs a binary artifact: the Docker image, or an export as a .tar file. Like
hadolint, tests are written in plain YAML, which seems to be the second most useful skill in the cloud ecosystem (just behind Git, right?). Here is an interesting sample of declarative tests in YAML:
schemaVersion: '2.0.0' metadataTest: labels: - key: 'vendor' value: 'Artifakt' - key: 'author' value: "^\\w+([-+.']\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*$" isRegex: true volumes:  entrypoint: ["/usr/local/bin/docker-entrypoint.sh"] fileExistenceTests: - name: 'bash' path: '/bin/bash' shouldExist: true permissions: '-rwxr-xr-x' uid: 0 gid: 0 isExecutableBy: 'any' commandTests: - name: "debian based server" command: "which" args: [ "apt-get"] expectedOutput: ['/usr/bin/apt-get']
In this block, we tested an image for metadata, files ,and command results, respectively defined by the three main keys
But why does the container-structure-test expect a pre-existing build step? Because, under the hood, it uses
docker exec commands on live containers to run each test. The cherry on top is that tests are also guaranteed to be isolated because one container runs only one test.
This behavior is easily visible when you run tests with the
--save option to keep the containers. Let’s run the following test on one of our official images:
container-structure-test test --image registry.artifakt.io/sylius:1.10-apache --save --config ./global.yaml
And let’s check saved containers with docker ps -a:
caae8bd3159c 678b262bc2d6 "NOOP_COMMAND_DO_NOT…" 2 minutes ago Created practical_einstein 3b63b4f5fcd1 registry.artifakt.io/sylius:1.10-apache "NOOP_COMMAND_DO_NOT…" 2 minutes ago Created admiring_solomon 44497e1d88cf d10ef6baba46 "which apt-get" 2 minutes ago Exited (0) 2 minutes ago bold_tesla 6e87be191af3 registry.artifakt.io/sylius:1.10-apache "NOOP_COMMAND_DO_NOT…" 2 minutes ago Created angry_bouman
This is really useful when you want to inspect saved containers and how they were impacted, either with
docker logs or
docker diff. Using this last command, we can easily write a test to ensure our container is stateless enough or does not leak unexpected files, or any other assessment on a live container.
As we already mentioned in the first example, container-structure-test assesses 3 classes of checks out of the box: image metadata (labels, volumes, entrypoints, etc), file presences, and then arbitrary command results.
On top of testing for presence, we also write tests to check for what we don’t want in the final Docker image. Think of development packages, compilers, tooling that could lie around and are definitely not welcome in production. One way to prevent this would be like this:
fileExistenceTests: - name: 'forbid gcc' path: '/usr/bin/gcc' shouldExist: false
Finally, another useful option is to use the
setup keyword in conjunction with test commands. Some tests only make sense after the Docker entrypoint has been running. We use the
setup key for this. See the sample below:
- name: "check mounted folder private" setup: [["/usr/local/bin/docker-entrypoint.sh"]] command: "ls" args: [ "-la", "/opt/drupal/web/sites/default/private"] expectedOutput: [ 'lrwxrwxrwx 1 www-data www-data .+ /opt/drupal/web/sites/default/private -> /data/web/sites/default/private' ]
There are many more features in the official docs and advanced cases like testing daemons (yikes!), so I encourage you to dive into it.
Docker migration part II: the bad (and fun!)
Many challenges were not expected along the road of containerization, so we must share a few of them, as we gained valuable insights.
Challenge #1: crontab integration
Our most complete runtimes, Magento 2 and Akeneo, are heavy users of cron jobs: indexing, caching, image resizing, import/export, you name it.
How does Docker handle asynchronous intermittent processes? Not good, actually. It is well known in Docker 101 that thou shall not run cron in the same container as the main process.
So what are the valid alternatives? We considered the following:
- Swarm cronjob
- cron job containers
- Docker exec bridge
First, Docker just upgraded their Swarm orchestration layer to run cron jobs, just like Kubernetes. That could work, but Swarm was not on our initial roadmap.
Secondly, we could run additional containers for each cron job, using a cron daemon at the node level. This method has its pros and cons. Being constrained by time and schedule, we had to move faster.
Lastly, we could declare keep the crontab on the node level, and run the commands into the live container using
docker exec. This could work because we still run one application container per server, so that makes sense for now.
We went for the last option, and the result was simple, elegant, and respectful of the Linux spirit we love. Here are the three simple steps to inject cron jobs into a live container:
docker exec wrapper, where really 2 lines are enough to target the
#!/bin/bash sudo docker exec -t <cID> sh -c "$2"
Save the wrapper somewhere in the crontab’s userspace (we named it
Add/Replace the default crontab
SHELL env. var with your script:
### Crontab Managed By Artifakt ### SHELL=/home/ec2-user/dockercron.sh 5/* * * * * uptime > /tmp/uptime.txt
docker events and check crontab is calling the app container:
2021-07-28T13:55:02.338665047Z container exec_create: sh -c uptime > /tmp/uptime.txt a360afb6e (artifakt.io/image=616787838396.dkr.ecr.eu-west-1.amazonaws.com/sylius, execID=8b14a4e96eb5175c74b689260e1b692af34f22260e9572fbd2bb2c2b663a3c1e, image=616787838396.dkr.ecr.eu-west-1.amazonaws.com/sylius:1.10, vendor=Artifakt) 2021-07-28T13:55:02.339417333Z container exec_start: sh -c uptime > /tmp/uptime.txt a360afb6e (artifakt.io/image=616787838396.dkr.ecr.eu-west-1.amazonaws.com/sylius, execID=8b14a4e96eb5175c74b689260e1b692af34f22260e9572fbd2bb2c2b663a3c1e, image=616787838396.dkr.ecr.eu-west-1.amazonaws.com/sylius:1.10, vendor=Artifakt) 2021-07-28T13:55:02.822814763Z container exec_die a360afb6e (artifakt.io/image=616787838396.dkr.ecr.eu-west-1.amazonaws.com/sylius, execID=8b14a4e96eb5175c74b689260e1b692af34f22260e9572fbd2bb2c2b663a3c1e, exitCode=0, image=616787838396.dkr.ecr.eu-west-1.amazonaws.com/sylius:1.10, vendor=Artifakt)
This approach maintains the exact same execution environment while keeping the resource usage inside one container, ensuring overall stability.
Challenge #2: cancel OpsWorks
Well, removing a legacy layer that worked for years is close to impossible. So, to be perfectly honest, we did not totally succeed. However, we tried our best and managed to make this layer as boring and irrelevant as possible.
The shift from OpsWorks to Docker required moving platform rules out of cookbooks and into docker-compose YAML and bash scripts. The result is a unique branch for all our deployments.
Fast forward to dozens of commits and after many trials and errors, all OpsWorks does now is to install the Docker Engine and a handful of container dependencies.
Everything else is operated by Docker API, triggered by plain old OpsWorks jobs, that are more predictable than ever.
That counts as a great milestone in simplification and technical debt flattening!
Challenge #3: run HTTPS locally
As we approached release time, our CEO looked at our demos and said, “Hey it would be nice if developers could run their apps locally with HTTPS”. As developers ourselves, everyone on the team immediately understood the opportunity.
The closer the workstation is to production, the fewer bugs we send. This is another way to express chapter 10 of 12-factor apps: “dev/prod parity”
Here is how we did it. Only two additional components (containers!) were required: Nginx-proxy and Cert companion. We patched the original
docker-compose.yaml with them, with no code modifications other than this:
version: '3' services: proxy: image: jwilder/nginx-proxy container_name: base-wordpress-proxy restart: always ports: - "8000:80" - "8443:443" volumes: - /var/run/docker.sock:/tmp/docker.sock:ro - ./certs:/etc/nginx/certs proxy-companion: image: nginx-proxy-companion:latest restart: always environment: - "NGINX_PROXY_CONTAINER=base-wordpress-proxy" volumes: - /var/run/docker.sock:/var/run/docker.sock:ro - ./certs:/etc/nginx/certs
Finally, it is important that our application container has two more
env. variables for this setup to work, so we added them like this:
app: image: base-wordpress:5-apache volumes: - ".:/var/www/html" environment: VIRTUAL_HOST: "localhost" SELF_SIGNED_HOST: "localhost"
On the first run, the
nginx companion looks for a
ca.cert and generates it for you on the fly, if it does not exist. We then have to tell browsers to trust this
ca.cert as any other CA. Sadly this still requires a manual setup.
We tried, looked around, and fiddled with Let’s Encrypt—no solutions were working out of the box. You cannot use Let’s Encrypt as a CA to provide localhost certs.
Finally, some extra steps were required, usually messing with local root certificates. Here is the shortest path we found for developers to install local certificate authority once and use it on all local development stacks. Note that the following steps work only on Google Chrome.
- Open Chrome settings and search for “certificates”.
- Open the Manage certificates menu, it will pop the keychain access.
- In System keychain, open the Certificate tab and drop the ca.cert from Nginx-Proxy Companion.
- Double-click on the Nginx-proxy Companion certificate to Always Trust.
Docker migration part III: the road ahead
We still have many ways to make a good platform even better. Here are the next steps we are considering.
Data is hard, and we heavily rely on AWS persistent data. This already works wonders for resiliency and scalability. Our customers need their data safe and close. Developers, on the other hand, favor convenience and ease of use.
As speed is a competitive advantage, we are making great strides toward faster deployments. Docker images can get big, and building arbitrary code gets really slow really fast (pun intended). Some of the best practices we want to implement in the future include:
- proxies for dependencies managers like composer, npm, or maven
- Docker layer cache instead of pristine build environments
- shared data volumes from build to build
- build and push the image closer to production, instead of moving through regions
These steps can dramatically decrease delays and I highly recommend you give them a try.
Advanced use cases in Docker builds
Some advanced Docker images need a live database to complete the build stage. This sounds weird at first but we’ve seen this quite often. We are considering a few leads like:
- spinning a dummy database server just for the build
- using SQLite as a volume, when compatible—no servers are needed
Here you go, containerization is not boring!
So there you have it, our humble experience of migrating to Docker. If you are currently migrating or looking to migrate to containers, I hope you found some useful tips in this article.
Do you have questions or suggestions on how to make developers’ lives easier? We’d love to hear about it! Reach out to us on Twitter or submit your ideas here.
And if you’re already interested in what Artifakt has to offer, why not book a demo with us?