Tuesday, January 17, 2023

Azure @ Enterprise - Connection pooling behavior when JWT is used to authenticate into Azure SQL Server

Connection pooling in .Net is a hidden feature. We as consumer developers, cannot directly feel how the pooling happens behind the scenes. .Net is allowing us to clear the connection pool. We create a new SqlConnection and Close it after use. When we create a connection it uses an internal pooling mechanism and a connection object is associated with the object that we create and the closed connections go back to the pool.

This post is not to educate about simple connection pooling instead it talks about an issue encountered in production.

Issue

There is a .Net 6 Web API hosted in Azure App Service. It mainly accesses an Azure SQL database and storage blobs. When the load slightly increased, it started showing the below error.

System.Data.SqlClient.SqlException (0x80131904): Resource ID : 2. The session limit for the database is 600 and has been reached. See 'https://docs.microsoft.com/azure/azure-sql/database/resource-limits-logical-server' for assistance.

at System.Data.ProviderBase.DbConnectionPool.CheckPoolBlockingPeriod(Exception e)
 at System.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
  at System.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
  at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()

Troubleshooting

Let us start the production war to bring it back.

Q&A about connection pooling

Before starting specific troubleshooting let's refresh the main points about connection pooling. Experts may skip this section

Are there many connection pools?

Yes, there are chances of having many connection pools. For each connection string, it creates and maintains a separate connection pool.

How many connections are in a pool?

By default 100 per pool. It is configurable through the connection string or via code

When all the connections are in use and a connection is requested from the pool what will happen?

It will throw an exception after the connection timeout. Looks like 

"Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. 
This may have occurred because all pooled connections were in use and max pool size was reached."

Does the SQL Server know the consumers are keeping connections in the pool?

Yes. To simply verify just run exec sp_who2 in the SQL server to check the connections from the client machine. Most of the connections will be in the Sleeping state.

How to check the pooling status of our application?

Take a memory dump of the application and analyze the crude way using WinDbg or use the DebugDiag tool.
Otherwise, monitor the Performance counters related to ADO.Net.

Specific steps to troubleshoot the issue

Let us look at some specifics about this issue

What are the Azure SQL specs?

Standalone database, tier - S0

What is exec sp_who returns from the database?

A lot of connections in the sleeping state.

Are there 600+ requests coming to the App Service per second?

No. There are only 10 requests per second.

Is there auto-scaling of app service instances?

No. There is only 1 app service instance. Questions such as "who in the world runs App Service with only one instance in production" are banned.

Any other application is making connections to the same database?

Yes. But that Azure Function is now disabled. 

Is the connection string or code have any customization of connection pool size?

No customization. Just working with the default 100

What is the authentication mechanism for Azure SQL?

It is the service principal / Azure app registration. Uses certificate to obtain JWT.

How many connection pools are there?

Only god knows until we take a memory dump and analyze it.

Will there be one connection pool per JWT considering JWT is part of the connection string?

Oh yeah...There are chances if the JWT is not cached. If each SqlConnection is created with a separate JWT, it will create many pools with one connection. If that happens, we will not get the timeout error from the connection pool, but the Azure SQL will run out of connections. Resulting in the same exception as mentioned above.

Is there any Microsoft official documentation for the above answer?

Not really. But a GitHub Issue answering the question.

For further clarification look at the source code of SqlConnectionPoolKey.CalculateHashCode()

Disclaimer

The actual root cause cannot be disclosed on social media such as this blog due to security, confidentiality, etc...

No comments: