Problem Statement:


Multiple incidents of the API timeouts at regular interval.

Environment details:


Application: Asp.Net/C#/
Platform: Dbbid
Database: SQL Server 2016 SP3

Issue description:


I was about to plan my Friday evening with Christmas coming and suddenly got a call from our application team. They were getting frequent API timeout in every 25 mins which caused business impact. I had no option but to jump in and resolve the issue.
So, if you want to enjoy your Christmas without any troubles, follow this article and save your time.

Troubleshooting:


Team looked this issue from multiple angles and analyzed the SQL Server wait type during the window of timeouts. We saw ‘poison waittype Threadpool during this timeframe. Also got high page latches Team found “Insert contention in one table”. This table holds the TDS Auth notification request & response and have more than 2 billion rows as of now.

Option 1


To archive the data from this table and just keep the last 2 days data. This solution didn’t work out and we still faced the same timeout issue.

Option 2


Collected the Perfmon counters on DB server and found that there is spike in disk queue length on N and F drive where database log file was residing. This spike resembles the timeout timings.
I\O was frozen or getting hung on those drives whenever disk queue length is going high and application team seeing timeouts (Every 25 mins).

Fix 1:


Worked with the Windows admin team and found one Physical Drive (PD) Disk 23 in Enclosure 0 on Connector 0 of RAID Controller in Slot 2 is not correctly functioning.

Fix 2:


Excluded these drives from antivirus scan and defender scan Now the issue got fixed and we didn’t find any spike and API timeouts.