Scalability Engineers

Readable Secondary Workload Damages Primary Replica Performance

This was after a couple of months when we made the production database servers online for Apple Pay. The topology included an availability group with four replicas, 2 of them engaged in a synchronous availability mode within the same data center and the remaining 2 in an asynchronous relationship across the geography in a distant data center.

One evening, it was observed that the CPU load on the primary server started to climb up and the application team reported considerable slowness. The preliminary investigation revealed that a particular query reading one of the “queue” tables was taking a lot longer and was inducing a high number of logical reads consequently impacting the CPU utilization.

It was confounding enough to see that the table had a handful of rows whereas the logical reads being induced were way too high, running into hundreds of thousands. This table was supposed to be a small table and the query impacting the performance was looking for a Top 1 record.

Our team of experts continued to investigate and eventually nailed it down.

There were few queries that were consistently being run on the readable secondary which consistently deferred ghost and version store cleanup on the primary node.

We immediately disabled the readable secondary and rebuilt the clustered index on the queue table and we had the performance back to normal. The issue was reported to Microsoft, and it was suggested to keep the readable secondary off.