It’s been a while since I’ve written about work, but this is one of those cases where someone might be scratching their heads for days as to what could be causing their issue.

TL;DR, check every little thing your replica server does, down to the smallest detail. Even if it’s been doing the same thing for years.

As a long standing employee at my place of work, I often end up with the task of resolving issues with the, shall we say, more legacy parts of the platform. So when I was made aware one of the backup MySQL database replication servers for one of the oldest parts of the platform was misbehaving, I was assigned to tracking down the problem. The issue was that the monitoring system had reported that the replication server was NULL seconds behind the master. This often confuses newbies, often asking “how can it being no seconds behind the master be a critical alert?”. To put it simply, NULL is not equal to zero. NULL, in this case, means it’s no longer processing data coming from the master server so it’s an infinite number of seconds behind the master database or, to put it simply, the replication has stopped.

So, expecting the worst, I hopped onto the offending server and ran SHOW SLAVE STATUS\G expecting it to show the error which caused the last query to fail. But instead I was greeted with the following:

*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: ***.***.***.***
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binary.000504
          Read_Master_Log_Pos: 569676966
               Relay_Log_File: mysqld-relay-bin.000077
                Relay_Log_Pos: 41680265
        Relay_Master_Log_File: binary.000504
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
                   Last_Errno: 0
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 469954883
              Relay_Log_Space: 246260313
              Until_Condition: None
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
               Last_SQL_Errno: 0
             Master_Server_Id: 103
1 row in set (0.00 sec)

This was quite odd, it appeared that the replication had just stopped. So I tried the START SLAVE; query and, unsurprisingly, the replication kicked back into life and I didn’t think too much of it. I checked the status again before I went home for the day and the slave was still running fine and had fully caught up with the master. Next morning, I checked the backup server where the backups were shipped to and happily saw that the nightly backup had yielded a file, which indeed was larger than the one that preceded it.

For a sanity check, I decided to check the replication status, only to find it had stopped again and, just like last time, without error. A quick search on the internet didn’t help matters much, with one of the only sensible suggestions to try running mysqlcheck on the affected databases. I fired up a screen session and ran a check which came back clean. So I ruled out the database being an issue.

So it was back to the drawing board. I then decided to check the monitoring software to find exactly what time did the server report the slave had stopped. In this case, it had reported that it had stopped at exactly 04:00 UTC. Suddenly a lightbulb clicked on, as it occurred to me that that’s the exact time the nightly backup kicks off.

Now, because this machine is a dedicated slave, and we’d want to be able to build other replicas that sync from the same master, we’d need to know the exact query the replica was up to at the time the mysqldump application was launched. To achieve this, the --dump-slave=2 argument can be passed to mysqldump. The caveat, however, is when doing so, MySQL stops the replication with the MySQL documentation clearly stating:

This option causes mysqldump to stop the slave SQL thread before the dump and restart it again after.

At this point, we know why the replication is stopping at 04:00 UTC, but not why it’s failing to start after the dump has completed. For this, I had to revisit the shell script I had written over 3 years ago which handles the launch and storage of the backup which had dutifully been running nightly since the server was built. I had the foresight back when I wrote it to dump the time and status of each part of the backup to a log file to help diagnose future issues. So the first place I looked was the log file. In there I could see years worth of messages stating the dump had started, finished, and copied successfully. Even the recent jobs all appeared to have completed successfully.

So it was back to the Internet to see if I could find why mysqldump would fail to start a replica after a slave dump. A few people on various forums and message boards suggested that mysqldump might have actually failed at some point. It was at that point my attention turned back to the backup script itself where I spotted two glaring errors:

  1. The script wasn’t handling non zero exit codes from mysqldump
  2. The script wasn’t writing any output from mysqldump to the log file

The latter was most likely overlooked as mysqldump outputs the dumped database to stdout which needs to be piped to a file so keeping errors out of the dumped file, they’re isolated to the stderr output. What I hadn’t done here was pipe anything that was posted to stderr to the log file. This was fixed by adding 2>> ${LOGFILE} (where ${LOGFILE} is a variable pointing to the actual log file) into the mysqldump command.

After making these modifications to the script, I did a test run where I discovered two things: 1) mysqldump had indeed failed. 2) That not all errors from mysqldump result in a non zero exit code.

mysqldump: Couldn't execute 'show table status like 'vw\_list\_*****'': SELECT command denied to user ''@'10.%' for column '****_********_**' in table 'tbl_*****' (1143)

Here we see that a view was failing to export as a null user was attempting to run a query. The fix for this was to look at how that view had been created, and recreate the missing user on the replica server. Another test run came back with the two lines I’d been hoping for when looking at the slave status:

             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

So, in the end, I put my updated script into service to help track down potential errors again in the future. I’m hoping this tail aids some other poor sysadmin diagnosing a similar issue.