Tuesday, June 26, 2018

Azure @ Enterprise - Finding the usage of vNet and Subnet IPs

There is a common myth that networking is easy in Cloud environment. At least considering Azure ,its is not true. If we are in Enterprise and want to implement security at the networking level we have to deal with vNets, subnets and their associated rule mechanisms such as NSG and much more. If it is a small deployment, there will be less confusion about vNet and subnets inside that and how many IPs used and free etc...Even re-balancing subnets even easy.

But that may not be the situation in an Enterprise where so many systems or departments share one subscription or same networking infrastructure. Things may often go out of control and will end up in situation where there are no more IPs or subnets for new applications.

The first challenge is to identify the usage of current vNets and Subnets inside them. We can get the details from Azure portal but its difficult if we want to consolidate to one view to take actions.

Below is a simple script to list down what are the subnets inside a particular vNet and how many IPs are possible and how many are used

Get-AzureRmVirtualNetwork -Name <Name of vNet> -ResourceGroupName <Name of vNet's RG > `
| Get-AzureRmVirtualNetworkSubnetConfig `
| Sort -Property Name `
| Select -Property Name, `
                   AddressPrefix, `
                   @{Name='Available IPs';Expression={[Math]::Pow(2,32-$_.AddressPrefix.split('/')[1])}}, `
                   @{Name='Used IPs';Expression = {$_.IpConfigurations.Count}}

Please note it shows the possible IPs not displayed as an integer instead as CIDR notation. If we are familiar to networking, we can easily understand how many IPs are available excluding the reserved IPs in Azure, what is the start IP and end IP etc... There are lot of hosted tools available to interpret the CIDR notation.

Once we knew there are issues such as vNets are fragmented, we have to think about solutions. We can easily suggest to change the policy to allot a vNet per department or system or application. There are trade offs in both the approaches.  If we allocate big vNet to a department and they don't have enough applications, all those will be unused. Also all departments need experts to manage the networks. The decision has to be made case by case.

Happy Networking...

Tuesday, June 19, 2018

Azure @ Enterprise - HDInsight name availability check in portal

Whether it is good or not Azure HDInsight needs unique name across Azure to create new instance. This leads to the requirement of name availability check feature. There are multiple mechanisms like calling API or checking the URL https://<proposed hdi name>.azurehdinsight.net for existence. 

Lets see how the Azure portal handles this check?

Azure Portal using CheckNameAvailability API for HDInsight cluster

The best friend is F12 browser tools to check how a web app like Azure Portal works
It is clear that it is trying to check name using the API partially documented below.

https://docs.microsoft.com/en-us/rest/api/cdn/checknameavailabilitywithsubscription/checknameavailabilitywithsubscription

But the above article talks only about checking the name using Microsoft.Cdn provider. HDInsight with check name availability seems undocumented. Below is the URL format for doing so which portal uses.

POST https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.HDInsight/checkNameAvailability

Why the API end point is to West US while the resource to be created in East US?

This is the magic of portal. It seems portal is hosted in West US and thus it uses the API end points there. When the screenshot was taken the portal was opened from New Jersey,USA without any VPN and ISP seems from East US itself. So no chance for Azure to redirect client to West US by recognizing the location. 

Why there is Http 504 in the screenshot?

This story seems related with an outage happened last Wednesday (13Jun2018) in South Central US Azure region. During that time, if we tried to create HDInsight it will show the name is not available. 

Under the hood it tries to reach to the West US end point for checking for the name availability and it error out with gateway timeout. It might be because internally it is not able to contact SouthCentral region. When the timeout happens, Portal thinks the name is not available and displays the message. What a thinking? 

Ideally it should have been understood the http 504 and acted accordingly. As per Azure or any cloud design philosophy, failures are expected. Why not accept that failure occurred in the portal itself?

As it is mentioned the issue seems related with outage in South Central US, there is no proof ensuring causation only the time range is same. Screenshot of outage details below.

It was fun to debug this by sitting near to a Microsoft ADM. The same has been communicated to Microsoft via him. Hopefully they will fix it soon.

Happy debugging...

Tuesday, June 12, 2018

Azure @ Enterprise - AppInsight -> Server -> Exception rate is not correlating with exceptions telemetry

There could be surprising situations where the AppInsight->Servers page show high rate of Exceptions but when we query exceptions it will not match with the exception rate. Its is tough situation for supporting application especially when stakeholders are interested in AppInsight reports.

If the production support teams are not capable of finding out why the different occurs, it will come to development. So what is going on here?

Why EXCEPTION RATE is not the rate of exceptions telemetry?

The AppInsight->Servers page displays the "EXCEPTION RATE" graph from performance counter where the 'exceptions' telemetry is having only the exceptions which are reported to AppInsight from applications such as using TrackException() of SDK. If application or framework is eating exceptions those exceptions will be affecting the performance counter and the graph but never comes into the exceptions telemetry. We can verify by clicking edit button of "EXCEPTION RATE" to view the data element. It will be performance counter. See the image below.
We can cross check by remote login into the server and watch the counter in perfmon.exe utility. This behavior is documented by Microsoft here and here.

What are the sources of eaten exceptions?

The million dollar question is whether I should worry about this high exception rate? The answer is yes if we care about our software what we made. If the app is just working and its not mission critical and have enough budget to run the app, we may leave. But that is not the case with most of the applications especially if it is hosted in Azure. Every fault translate to money. If those exceptions are slowing down requests the entire throughput will get reduced which leads to more machines to serve less users.

OK, we are going to fix it. How to find out the sources of exceptions. As we seen in above section, perfmon.exe is a good start to find what processes are throwing exceptions. Once we know the source the fix depend on individual finding. Below are some examples
  • If we are using netNamedPipe binding to communicate between internal services, there are high chance that so many exceptions got eaten by that binding but functionality would be working.
  • If the system is consuming other Azure services and retry is implemented, there could be exceptions but retry might have solved the business scenario but the performance counter get affected by the exception. 
  • If there are any background workers running using 'On Error Resume Next' style, the exceptions encountered inside those may be eaten and affecting exceptions performance counter.
It is better to report the exception using TrackException even though we are handling it. Overall it is a tedious process but there are chances to get treasures which help us to remove one server to support same number of users.

Happy debugging.