Problem:
We have an application that has a single instance and runs in a Docker container and further uses a swarm network. We want to write a bash script using which we can deploy new version of the application. The deployment should be fault-tolerant meaning if anything goes wrong, we should not lose the previous version of the application.
Solution:
You can try using docker service command to solve the problem but here I am going to show how to DIY. The simplified steps are like this (details and caveats will follow):
- Build the image that will be used to instantiate new container.
- Disconnect old container from swarm network. Then stop it.
- Instantiate new container giving it a temporary name. We assume the container gets attached to the swarm network as part of instantiation.
- Rename old container to something temporary.
- Disconnect new container from network.
- Rename new container to old container.
- Re-connect new container to swarm network.
- Delete old container.
Caveat #1
Steps #2 and #5 are needed if you are using a swarm network because without it we get an error renaming new container to old container (Step #6).
Could not add service state for endpoint XXX to cluster on rename:
cannot create entry in table endpoint_table with network id tto0055xkicxz0397dln5h06y
and key f31072bc004794a4ec5943e4549cd89ad5647047c336fc2c59171c7a4aaef596, already exists
This error does not happen on a bridge network. See man page where it says: **The container must be running to disconnect it from the network.** As with everything Docker, this is a bit counter-intuitive as you normally turn an appliance off before disconnecting it from power.
This also means that we need to check that the old container is running in Step 2 and start it if its not running.
Utility Functions
function container_exists {
if [ "$(docker container ls -a -f name=$1 | awk '{print $NF}' | grep -E '^'$1'$')" ]; then
return 0
else
return 1
fi
}
function container_is_running {
if [ "$(docker container ls -f name=$1 | awk '{print $NF}' | grep -E '^'$1'$')" ]; then
return 0
else
return 1
fi
}
Rollback
We need a rollback script if anything goes wrong. First step to figure out is how do you catch exceptions in Bash? Like everything Bash, there is no good way. The trap function is closest we have.
function rollback {
echo "FAILED to provision new container. Rolling back deployment"
if containerExists $NEW_CONTAINER_NAME; then
docker logs $NEW_CONTAINER_NAME
if containerIsRunning $NEW_CONTAINER_NAME; then
docker stop $NEW_CONTAINER_NAME
fi
docker rm $NEW_CONTAINER_NAME
fi
if containerExists $TMP_CONTAINER_NAME && ! containerIsRunning $TMP_CONTAINER_NAME; then
# this means we are able to rename the original container but the renaming of the newly provisioned container failed.
echo "rename $TMP_CONTAINER_NAME to $OLD_CONTAINER_NAME"
docker rename $TMP_CONTAINER_NAME $OLD_CONTAINER_NAME
fi
# resume original container
if containerExists $OLD_CONTAINER_NAME && ! containerIsRunning $NEW_CONTAINER_NAME; then
echo "starting original container"
docker start $OLD_CONTAINER_NAME
echo "connecting it to the network"
docker network connect $NETWORK $OLD_CONTAINER_NAME
fi
# return with non-zero exit code to indicate deployment failure
exit 1
}
Main Script
if containerExists $OLD_CONTAINER_NAME; then
# trap will catch unhandled exceptions
# https://stackoverflow.com/a/35800451/147530
trap 'rollback' ERR
if ! containerIsRunning $OLD_CONTAINER_NAME; then
# this code path can be entered when you are developing locally and you stopped container
echo "$OLD_CONTAINER_NAME exists but is not running. starting $OLD_CONTAINER_NAME"
docker start $OLD_CONTAINER_NAME
fi
# When using a Docker Swarm we have to explicitly disconnect the container
# from the network.
# Note that if after disconnecting the network is left with no attached containers, it will go away and disappear momentarily and you will get a
# failed to get network during CreateEndpoint error when you try to attach any container to the network
# see https://github.com/moby/moby/pull/41011
# can't believe Docker has so many bugs in it
echo "disconnecting $OLD_CONTAINER_NAME from $NETWORK"
docker network disconnect $NETWORK $OLD_CONTAINER_NAME
# stop the container. we have to stop the old container and cannot avoid a short
# downtime. If we try to provision new container while old one is still running, we get this error presumably when it tries to publish its port:
# Error response from daemon: driver failed programming external connectivity on endpoint Bind for 0.0.0.0:443 failed: port is already allocated
echo "stopping $OLD_CONTAINER_NAME"
docker stop $OLD_CONTAINER_NAME
echo "starting new container"
CONTAINER_NAME=$NEW_CONTAINER_NAME ./deploy-container.sh
# if temp container has been successfully provisioned
if containerIsRunning $NEW_CONTAINER_NAME; then
echo "swapping old container with new"
echo "renaming $OLD_CONTAINER_NAME to $TMP_CONTAINER_NAME"
docker rename $OLD_CONTAINER_NAME $TMP_CONTAINER_NAME
# see https://github.com/moby/moby/issues/42351
# for why we are disconnecting. Without it the rename will fail on an overlay network.
# Its a bug in Docker and the disconnect is a workaround.
echo "disconnecting $NEW_CONTAINER_NAME from $NETWORK"
docker network disconnect $NETWORK $NEW_CONTAINER_NAME
echo "renaming $NEW_CONTAINER_NAME to $OLD_CONTAINER_NAME"
docker rename $NEW_CONTAINER_NAME $OLD_CONTAINER_NAME
echo "reconnecting $NEW_CONTAINER_NAME to $NETWORK"
docker network connect $NETWORK $OLD_CONTAINER_NAME
echo "removing $TMP_CONTAINER_NAME"
docker rm $TMP_CONTAINER_NAME
else
rollback
fi
else
# it looks like we are deploying for the first time
CONTAINER_NAME=$OLD_CONTAINER_NAME ./deploy-container.sh
fi