My team’s past 48 hours were spent in our little personal branch of hell, the one you get sent to when your application crashes for an unknown reason without any warning, and every new piece of information you discover while trying to identify the root cause is conflicting with the previous symptoms. If you manage to get past the feeling that your application was personally cursed by a vicious voodoo shaman and stay focused on the solution, I promise you will find a non-magical explanation for your problem, in this case – environmental change.
This Monday a little after midnight we have lost connectivity to one of our production Azure SQL server in the North Central region.
The exceptions we were getting:
EntityException: An exception has been raised that is likely due to a transient failure. If you are connecting to a SQL Azure database consider using SqlAzureExecutionStrategy.
Inner Exception: The underlying provider failed on Open.
EntityException: The underlying provider failed on Open.
Inner Exception: A connection was successfully established with the server, but then an error occurred during the login process. (provider: SSL Provider, error: 0 – An existing connection was forcibly closed by the remote host.)
- Our SQL connection was going up and down intermittently.
- Our code base was not changed.
- There were no announced changes to Azure SQL environment.
- There were no reports of connectivity issues to Azure SQL servers in other applications in the same region.
- Deploying the same code base with a database on a server in a South Central region helped for a few hours, after which we started getting the same exceptions there as well.
- After another few hours we have lost connectivity to our SQL Azure server from our local machines. Attempting to connect using SQL Server Management Studio produced the following error:
- “A connection was successfully established with the server, but then an error occurred during the login process. (provider: SSL Provider, error: 0 – An existing connection was forcibly closed by the remote host.)”
After identifying a security hole in SSL 3.0 protocol Microsoft started working on disabling SSL3 in SQL Azure across the country. The change was not officially announced and was rolling out gradually.
Since the application in question was PCI compliant, we intended for SSL 3.0 to be the only enabled cryptographic protocol on our machines, and hence were disabling other cryptographic protocols on our VMs by running a startup task in our Cloud Service.
As a result, there were no cryptographic protocols that remain enabled in our Cloud Service VMs, and the communication to our Azure SQL server from these VMs was lost completely, while the server remained reachable from other locations.
What about the connectivity to the server from our local machines? It was lost due to running the same startup task while starting the application in Azure Emulator in local debug environment!
After reviewing the PCI compliance guidelines we have reached the conclusion that we can safely enable TLS 1.0 Client on our Cloud Service VMs.
In this way our Cloud Service VMs would be able to communicate with our Azure SQL server over TLS 1.0 Client, while the application remained secure in terms of external internet connections to the server and PCI compliant.
Applying the solution
- Modify the startup task to stop disabling TLS 1.0 Client
- Re-deploy your application to the Cloud Service
- Reboot the Web/Worker role VMs
Voila, the application is up and running again! Hopefully we will have better luck with unannounced environmental changes next time!