A valued client of ours within the financial technology industry recently experienced a critical issue characterized by recurring API timeouts at consistent intervals. The client's infrastructure leveraged the following technologies , they used SQL Server 2016 SP3 at the database front, Asp.Net/C# on the Application Development Framework and the platform was Dbbid.

During the peak holiday season, our fintech client experienced a critical issue: API timeouts were occurring every 25 minutes, disrupting business operations. Our team was immediately engaged to troubleshoot the problem, isolate the root cause and implement a solution, ensuring a smooth Christmas experience for both the client and their customers. Read how we took care of the issue and managed situation so that the client and the customers could enjoy their Christmas.

To start with the initial issue troubleshooting and problem isolation, I and my team looked at the current situation from multiple angles. We analysed the SQL Server wait type during the window of timeouts and analysed the ‘poison waittype Threadpool during this timeframe. During this timer there were also high page latches which the was able to track , we also found “Insert contention in one table”. This table holds the TDS Auth notification request & response and have more than 2 billion rows as of now.

With the initial troubleshooting completed by the team and looking into the findings we narrowed down to two options which looked feasible:

Option 1
To archive the data from this table and just keep the last 2 days data. We tried this as a the first step to see it this resolved the issue, this solution did not work out and we still faced the same timeout issue.

Option 2
With the first option not working out the team further went ahead with the activity to troubleshoot deeper, what we further did is to collect the Perfmon counters on DB server and that’s where we found that there is spike in disk queue length on N and F drive where database log file was residing.

This spike resembles the timeout timings. The graphs that we were able to obtain had a very visible spike pattern, coinciding with the API timeouts. The I\O was frozen or getting stuck on those drives whenever disk queue length is going high and application team seeing timeouts (Every 25 mins).

Through thorough investigation, the team established a clear understanding of the underlying issue, subsequently identifying two potential solutions for implementation

Step 1. We worked with the windows admin team and found one Physical Drive (PD) Disk 23 in Enclosure 0 on Connector 0 of RAID Controller in Slot 2 is not correctly functioning.
Step 2. We further went ahead and excluded these drives from antivirus scan and defender scan and that now took care of the issue and we didn’t find any spike and API timeouts.

Merry Christmas is what it was for all of us.