This is probably the longest time I have spent on debugging something (4 days) so its worth the effort to write about it.
The problem: We built a wordpress site but got CloudFlare 522 error when trying to connect to it. Our logs were completely empty.
The Setup: Our setup consisted of a VM in a private network with no public IP. This VM ran 3 docker containers connected together in a user-defined bridge network. The first container was an nginx server which would forward requests for PHP resources to the wordpress container. The third container was running MySQL 5.7. Nginx server configuration was borrowed from here. If the VM has no public IP how does one access it? The VM was connected to a layer-4 load balancer in a DMZ with a public IP. Further the load balancer would only accept traffic from CloudFlare CDN. The purpose of CloudFlare was to filter out malicious traffic and protect the servers.
CloudFlare -> Load Balancer -> NGINX -> WordPress -> MySQL
How we debugged: We tried all the usual things. We tested that the host is forwarding http (port 80) and https (port 443) traffic to the nginx container.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ec8600d9c4b4 nginx:1.17 "nginx -g 'daemon of…" 24 hours ago Up 24 hours 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp nginx
feb535607eb2 wordpress:php7.4-fpm-alpine "docker-entrypoint.s…" 25 hours ago Up 25 hours 9000/tcp wordpress
48b2dea3706b mysql:5.7 "docker-entrypoint.s…" 25 hours ago Up 25 hours 0.0.0.0:3306->3306/tcp, 33060/tcp mysql
We tested that wordpress container is listening on port 9000
$ docker exec wordpress netstat -tpln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.11:40662 0.0.0.0:* LISTEN -
tcp 0 0 :::9000 :::* LISTEN 1/php-fpm.conf)
We tested that nginx container can connect to wordpress container
$ docker exec -it nginx /bin/bash
root@ec8600d9c4b4:/# apt-get install netcat
root@ec8600d9c4b4:/# nc -zv wordpress 9000
DNS fwd/rev mismatch: wordpress != wordpress.wordpress_net
wordpress [172.18.0.3] 9000 (?) open
We edit our wp-config.php to enable debug logs in wordpress
// Enable WP_DEBUG mode
define( 'WP_DEBUG', true );
// Enable Debug logging to the /wp-content/debug.log file
define( 'WP_DEBUG_LOG', true );
// Disable display of errors and warnings
define( 'WP_DEBUG_DISPLAY', false );
@ini_set( 'display_errors', 0 );
// Use dev versions of core JS and CSS files (only needed if you are modifying these core files)
define( 'SCRIPT_DEBUG', true );
We then tailed the following logs:
$ docker logs -f nginx
$ docker logs -f wordpress
$ docker exec wordpress tail -f /var/www/html/wp-content/debug.log
Logs were empty.
Next thing we tried was to run a bare-bones nginx server without the wordpress setup. So just run
$ docker run -d -p 80:80 nginx:alpine
Now the error vanished! This led us to believe that CloudFlare was working properly. So the problem had to be with wordpress. But then why was there no error in the logs? Without anything in the logs to give a clue, we were stuck for a long time. Then it dawned on us to access the VM directly using its private IP from the VPN and lo and behold the server responded and wordpress loaded up! So it couldn’t be a problem with wordpress!
When 2+2 does not equal 4: If CloudFlare is working properly as well as WordPress, then we are led to a logical contradiction and the 522 cannot be explained.
We then contacted CloudFlare when we were out of moves and they told us that all requests to the site were timing out leading them to believe there is something blocking requests from CloudFlare’s IP
Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]
Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]
Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]
Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]
Source IP: Y.Y.Y.Y
nc: connect to X.X.X.X port 443 (tcp) timed out: Operation now in progress
[exit code 1]
But we knew there was nothing blocking CloudFlare as the requests did succeed when we ran a bare-bones NGINX server.
Finally, the light struck us on the 4th day. The load balancer in Azure uses health probes to know if a machine is healthy. It construes any 200 response as indication of an unhealthy server. This is all documented here
An HTTP / HTTPS probe fails when:
Probe endpoint returns an HTTP response code other than 200 (for example, 403, 404, or 500). This will mark down the health probe immediately.
but it was the first time I was using a load balancer up close and personal and I didn’t know this. The endpoint it was using to test was / to which nginx was responding with a redirect 302. As soon as I changed the health probe to a custom endpoint to which I configured NGINX to return 200 the error vanished!
# this special section is for the load balancer.
# The elastic load balancer in azure needs to know if a machine is healthy.
# We assume it does that by making a request to the /health-probe endpoint.
# If the load balancer gets a non-200 response it will mark the machine as unhealthy
# and not send requests to it.
location /health-probe {
return 200 OK;
}
It was typical example when you need out-of-the-box thinking to fix a problem. For the longest time I kept thinking maybe the problem is in the docker network as I have struggled with it in the past. The fact that CouldFlare worked as well as WordPress (when we hit the VM using its private IP) but then why we were getting was the thing that puzzled me the most. It left us with no clue.