Tuesday, June 19, 2018

Azure @ Enterprise - HDInsight name availability check in portal

Whether it is good or not Azure HDInsight needs unique name across Azure to create new instance. This leads to the requirement of name availability check feature. There are multiple mechanisms like calling API or checking the URL https://<proposed hdi name>.azurehdinsight.net for existence. 

Lets see how the Azure portal handles this check?

Azure Portal using CheckNameAvailability API for HDInsight cluster

The best friend is F12 browser tools to check how a web app like Azure Portal works
It is clear that it is trying to check name using the API partially documented below.

https://docs.microsoft.com/en-us/rest/api/cdn/checknameavailabilitywithsubscription/checknameavailabilitywithsubscription

But the above article talks only about checking the name using Microsoft.Cdn provider. HDInsight with check name availability seems undocumented. Below is the URL format for doing so which portal uses.

POST https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.HDInsight/checkNameAvailability

Why the API end point is to West US while the resource to be created in East US?

This is the magic of portal. It seems portal is hosted in West US and thus it uses the API end points there. When the screenshot was taken the portal was opened from New Jersey,USA without any VPN and ISP seems from East US itself. So no chance for Azure to redirect client to West US by recognizing the location. 

Why there is Http 504 in the screenshot?

This story seems related with an outage happened last Wednesday (13Jun2018) in South Central US Azure region. During that time, if we tried to create HDInsight it will show the name is not available. 

Under the hood it tries to reach to the West US end point for checking for the name availability and it error out with gateway timeout. It might be because internally it is not able to contact SouthCentral region. When the timeout happens, Portal thinks the name is not available and displays the message. What a thinking? 

Ideally it should have been understood the http 504 and acted accordingly. As per Azure or any cloud design philosophy, failures are expected. Why not accept that failure occurred in the portal itself?

As it is mentioned the issue seems related with outage in South Central US, there is no proof ensuring causation only the time range is same. Screenshot of outage details below.

It was fun to debug this by sitting near to a Microsoft ADM. The same has been communicated to Microsoft via him. Hopefully they will fix it soon.

Happy debugging...

Tuesday, June 12, 2018

Azure @ Enterprise - AppInsight -> Server -> Exception rate is not correlating with exceptions telemetry

There could be surprising situations where the AppInsight->Servers page show high rate of Exceptions but when we query exceptions it will not match with the exception rate. Its is tough situation for supporting application especially when stakeholders are interested in AppInsight reports.

If the production support teams are not capable of finding out why the different occurs, it will come to development. So what is going on here?

Why EXCEPTION RATE is not the rate of exceptions telemetry?

The AppInsight->Servers page displays the "EXCEPTION RATE" graph from performance counter where the 'exceptions' telemetry is having only the exceptions which are reported to AppInsight from applications such as using TrackException() of SDK. If application or framework is eating exceptions those exceptions will be affecting the performance counter and the graph but never comes into the exceptions telemetry. We can verify by clicking edit button of "EXCEPTION RATE" to view the data element. It will be performance counter. See the image below.
We can cross check by remote login into the server and watch the counter in perfmon.exe utility. This behavior is documented by Microsoft here and here.

What are the sources of eaten exceptions?

The million dollar question is whether I should worry about this high exception rate? The answer is yes if we care about our software what we made. If the app is just working and its not mission critical and have enough budget to run the app, we may leave. But that is not the case with most of the applications especially if it is hosted in Azure. Every fault translate to money. If those exceptions are slowing down requests the entire throughput will get reduced which leads to more machines to serve less users.

OK, we are going to fix it. How to find out the sources of exceptions. As we seen in above section, perfmon.exe is a good start to find what processes are throwing exceptions. Once we know the source the fix depend on individual finding. Below are some examples
  • If we are using netNamedPipe binding to communicate between internal services, there are high chance that so many exceptions got eaten by that binding but functionality would be working.
  • If the system is consuming other Azure services and retry is implemented, there could be exceptions but retry might have solved the business scenario but the performance counter get affected by the exception. 
  • If there are any background workers running using 'On Error Resume Next' style, the exceptions encountered inside those may be eaten and affecting exceptions performance counter.
It is better to report the exception using TrackException even though we are handling it. Overall it is a tedious process but there are chances to get treasures which help us to remove one server to support same number of users.

Happy debugging.

Tuesday, May 29, 2018

ASP.Net bundling & minification via CDN Caching

It is normal for ASP.Net applications to use bundles to reduce http calls. It all works till the time CDN comes into the landscape. CDN (Content Delivery Network) uses a special approach to deliver contents from world wide distributed servers which we can envision as served from the edge of the network. That way so many requests get served from CDN server without coming to our web server.  It is nice theory about CDN. What is between ASP.Net bundling and CDN?

CDN normally cache static assets. Out of the box, they cache well known static assets such as html, css, js files. Yeah, image file formats too, if we configure. If we had bundled static assets using ASP.Net bundling mechanism which gets a different URL than just js files, those will not be seen by CDN as static assets. Hence no caching.

This is applicable if CDN is placed in between. Akamai provides such a service where we don't need to change our code when we introduce CDN. CDN servers cache when they get content from the backend web server. Normally when we introduce CDN, the domain has to change to serve from CDN. That has added benefit too since it increase the parallel request limit to same origin. If we want to integrate CDN via separate domain, better forget about ASP.Net bundling.

Here lets see how to make pass through CDNs work with ASP.Net bundling since they don't introduce new CDN URL.

How can we make CDN friend with ASP.Net bundles

Approach 1 - Don't use bundling

Sounds easy but if the ASP.Net application is designed as product which is hosted by individual customer, every consumer may not have CDN capability. So need bundling. For hosted applications ie there is only one instance of application in the world, yes we can try this. But still without http2, it would cause performance degradation as previous http versions make different connections for each file / resource request. If the application has angular or any other SPA frameworks used, the number of files may be large.

Approach 2 - Have CDN cache the ASP.Net bundle URL

When we bundle, ASP.Net provides a new URL instead of all the files inside the bundle. From browser we will see only one bundle URL request instead of individual resources in the bundle. That URL has a predefined format. If the CDN supports custom RegEx kind of way to specify the URL formats to cache, it works without much efforts.

The RegEx for recognizing ASP.Net bundle URL format goes as follows.

/\?v=[a-zA-Z0-9_-]{44}$/

This works only if there are no other API or dynamic resource URL ending with ?v={44 letter value}

Approach 3 - Embrace SPA with JS modules & forget server side rendering 

Instead of server side rendering done by ASP.Net, embrace the industry trend SPA (Singe Page Application) with JavaScript module system such as Webpack. Webpack can produce static bundles which can be referred in application as a normal file.

Recommended

If there is enough budget and time, embrace SPA else use the RegEx and stay with ASP.Net bundling and CDN.

I don't have any association with Akamai except using in projects. This post is not intended to promote any of their products.

Tuesday, May 15, 2018

Azure @ Enterprise - Finding how many nodes are really created for one HDInsight cluster

When we create an Azure HDICluster, it internally creates virtual machines. In the Azure portal's cluster creation blade, it asks for the details about Head and Worker nodes. We cannot set the no of head nodes but worker nodes. All good till now.

But @ enterprise, if the HDInsight cluster need to be in vNet, there could be issues on lack of IP Addresses available in the subnet. Its gets worse if the creation needs to happen dynamically in a multi tenant application. It is very difficult to do calculation on the IP address requirements of HDICluster, if we don't know the internals of how many VMs get created as part of one HDInsight cluster regardless of worker nodes.

Is that not available publicly? Yes it is and below are links towards it.
https://blogs.msdn.microsoft.com/azuredatalake/2017/03/10/nodes-in-hdinsight/
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-port-settings-for-services

The above tells for Spark it create Head nodes, ZooKeeper nodes and Gateway nodes. How to validate how many machines created or how to verify the facts ourselves. The portal never tells how many machines get created, if we navigate to already created HDICluster resource blade. PowerShell object of HDICluster instance too doesn't have direct info about the internal machines created. So what is the alternative?

PowerShell to retrieve nodes

Again PowerShell and some string comparisons to rescue. Below goes the script.

$hdiClusterName = "<name of cluster without domain>"

"Assumption 1 - The vNet and subnet of all nodes are same."
"Assumption 2 - The vNet, Public IPAddresses & NIC are in same resource group"
"Assumption 3 - There will be a gateway nodes fof HDICluster and public ip address for gateway is in format publicIpgateway-<internal id>"
"Assumption 4 - A unique internal id is used to name the nodes,NICs, public addresses etc...This script heavily depend on that internal id based naming convention"

"--------------------------------------------------------"

$resource =(Get-AzureRmResource -ResourceId (Get-AzureRmHDInsightCluster -clustername $hdiClusterName).Id)

$hdiClustersVNetResourceGroupName = (Get-AzureRmResource -ResourceId $resource.Properties.computeProfile.roles[0].virtualNetworkProfile.id).ResourceGroupName

"ResourceGroup of vNet assiciated with HDI cluster- $hdiClustersVNetResourceGroupName"

$publicAddress = (Get-AzureRmPublicIpAddress -ResourceGroupName $hdiClustersVNetResourceGroupName) | Where-Object {$_.DnsSettings.DomainNameLabel -eq $hdiClusterName}

$publicIpgatewayName = $publicAddress.Name

$hdiClusterInternalId = $publicIpgatewayName.Split('-')[1]

"Internal Id of HDI used to create nodes - $hdiClusterInternalId"

"Below are the NICs used by $hdiClusterName HDI Cluster. Each NIC corresponds to one node."

$nics = Get-AzureRmNetworkInterface -ResourceGroupName $hdiClustersVNetResourceGroupName
$nics = $nics | Where-Object {$_.Name -like "*$hdiClusterInternalId"}
$nics | Select-Object -Property Name

As we can see the script relies on the naming convention of NICs. If Microsoft changes it the script will fail.

From the list we can see it creates 2 Head nodes, 3 ZooKeeper and 2 Gateway nodes along with minimum 1 worker node. Minimum 8 IP Addresses will be consumed for one HDInsight cluster. At the time of writing this post the ZooKeeper and Gateway nodes seems free. The charge is only for Head and Worker node(s)

Ambari Portal

Another way is via Ambai portal. If we navigate to the below URL, we can see the head nodes and ZooKeeper nodes. But not able to see the gateway nodes.

https://<cluster name>.azurehdinsight.net/#/main/hosts

Happy scripting...

Tuesday, May 8, 2018

Azure @ Enterprise - Checking connectivity from AppServiceEnvironment to HDInsight

The background here is Enterprise Azure environment where most of the things are in vNet and their own subnets. When the Spark HDInsights are in separate Subnet other than the application, there will not be connectivity by default, if we need to submit jobs via Livy or anything like that (This again depends on the Enterprise policy). We have to open the routes from application subnet to HDInsight subnet. The route opening depends on how the infrastructure is laid out. If there are no firewalls of proxies in between application and HDInsight clusters, simple NSG rules would be sufficient.

Suppose there are 2 teams involved one is infrastructure and other development or QA teams, how can the development or QA can verify there is connectivity?

If the application is hosted in virtual machines, we can just log in and open the Ambari UI. Even we can run network troubleshooting commands. But what to do if the applications are hosted as AppService WebApps? If the applications are not client facing and need to be secured from neighbors, those may be inside their own AppServiceEnvironments. Basically no user interface available.

The solution is simple. Back to command line mode and somehow check the http connectivity to the HDICluster. Below is one powershell command which we can execute from the command line interface exposed from Kudu.

curl -Credential "user name" -Method "GET" -Uri "https://<cluster name>.azurehdinsight.net/livy/batches"

How to reach to the kudu console of an AppService instance is detailed in the below links.
https://blogs.msdn.microsoft.com/benjaminperkins/2017/11/08/how-to-access-kudu-scm-for-an-azure-app-service-environment-ase/
https://blogs.msdn.microsoft.com/benjaminperkins/2014/03/24/using-kudu-with-windows-azure-web-sites/

The command tested for non domain joined HDInsight clusters. When we enter the above command it will ask for the password interactively. 

This is just a manual command to test the connectivity. If the scenario is multi-tenant and want to ensure connectivity from application, use WebClient or similar methods.

Tuesday, May 1, 2018

Azure @ Enterprise - Finding Subnet of HDICluster via PowerShell

Enterprise love putting resources into virtual network(vNet) thinking that, it brings certain high level of free security via isolation. HDInsight clusters can also be put into vNet. What to do if we have a HDICluster which is added to Subnet earlier and we don't know the Subnet name now?

Silly question. Isn't it? Just go to portal and see the cluster properties. That is what below MSDN article says.
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-administer-use-portal-linux#list-and-show-clusters

Unfortunately our corporate subscription is only showing vNet name in the properties not Subnet. So how to get the Subnet name?

Finding Subnet of HDICluster

Powershell help us here. Below goes the script

$hdiClusterName = "POC001Joy"
$hdiCluster = (Get-AzureRmHDInsightCluster -clustername $hdiClusterName)
$resourceProperties =(Get-AzureRmResource -ResourceId (Get-AzureRmHDInsightCluster -clustername $hdiClusterName).Id).properties
$resourceProperties.computeProfile.roles[0].virtualNetworkProfile.subnet

It simply gets the object of HDInsight resource and navigate to the required property. The object model is little confusing but its there.

The above PowerShell script can be directly entered into Azure Portal or run after logging into Azure from PowerShell window / ISE.

An issue has been added to the MSDN documentation article related to this missing subnet name.

Enjoy Scripting...

Tuesday, April 24, 2018

Azure @ Enterprise - Automation for scaling down HDInsight after busy hours

Introduction

Every software system involve many manual jobs to maintain it. From monitoring health, periodically maintaining indexes, updating operating systems, taking backups etc...Azure trying to automate those tasks as  much as possible using a service called Azure Automation. This post is about using Azure Automation for automatically scaling down HDI clusters in non peak hours.

Ideally speaking the cluster can be deleted after use. But the problem is that it takes a lot of time to create. In case a job is submitted after hours, it needs a lot of time. So better scaling down is an option. As always its not a silver bullet. Choose accordingly.

Azure Automation basics

There are so many tutorials available to get started using Azure Automation. Below are the high level information about Azure Automation.
  • Automation allows us to write Runbooks using PowerShell & Python workflow. Runbook is the real logic what to do.
  • Runbooks can even be graphical ie drag and drop coding.
  • Runbook can be exposed via Webhooks or scheduled to trigger execution. Each execution is called Job and we get the JobId to track progress.
  • Webhook clearly separate the credentials and connections from the code. So that, those secrets can be managed by the deployment guys.
  • It has the app store model to distribute Runbooks called Runbooks gallery. We can upload our Runbooks too.
Some links on how to do things in Automation

https://docs.microsoft.com/en-us/azure/automation/automation-first-runbook-textual
https://vincentlauzon.com/2015/11/01/azure-runbook-a-complete-simple-example/
https://docs.microsoft.com/en-us/azure/automation/automation-runbook-gallery
https://www.c-sharpcorner.com/article/create-runbooks-to-automate-tasks-in-azure-automation/

Gallery

Coming back to the context of this post. We have to automatically scale down the HDICluster in off peak hours. Don't there a runbook exist in the gallery to do the same? 

Yes there is one, but that doesn't seems to work if we have multiple Subscriptions. Hence this post. 

Contacted the author of the Runbook and informed the same. Hopefully there will be updated version soon. Below is the link.

Below is one link about Runbook gallery to get started.

Runbook for scaling down HDICluster

Directly writing the code as the approach is straight forward

    $SubscriptionName = "<Subscription Name>"
    $ResourceGroupName = "<Resource Group where the HDICluster resides not the automation account>" 
    $ClusterName = "<name of the cluster. No need of azurehdinsight.net>"
    [Int] $Nodes = 1
    
    $connectionName = "AzureRunAsConnection"
    Write-Verbose "Starting Scaling cluster $ClusterName to $Nodes nodes..."
    try
    {
        # Get the connection "AzureRunAsConnection "
        $servicePrincipalConnection=Get-AutomationConnection -Name $connectionName         

        Write-Output "Logging in to Azure..."
        Add-AzureRmAccount `
            -ServicePrincipal `
            -TenantId $servicePrincipalConnection.TenantId `
            -ApplicationId $servicePrincipalConnection.ApplicationId `
            -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint 
    }
    catch {
        if (!$servicePrincipalConnection)
        {
            $ErrorMessage = "Connection $connectionName not found."
            throw $ErrorMessage
        } else{
            Write-Error -Message $_.Exception
            throw $_.Exception
        }
    }
    Select-AzureRMSubscription -SubscriptionName $SubscriptionName
    Write-Output "Scaling cluster $ClusterName to $Nodes nodes..."
    Set-AzureRmHDInsightClusterSize `
        -ResourceGroupName $ResourceGroupName `
        -ClusterName $ClusterName `
        -TargetInstanceCount $Nodes

Please note the parameters are avoided to reduce code size. Ideally the values should comes as parameters.

Prerequisites

Azure AD App

"AzureRunAsConnection" is the name of connection at the automation account level which is going to be the identity of the Runbook code. 
The above code is tested with an Azure AD Application with certificate based authentication. The certificate associated with Azure AD App has to be uploaded to Automation account. Credentials are not at the Runbook level. Meaning multiple Runbooks in the Automation account can share same credentials.

Importing modules

The Runbook needs 2 additional modules which are not present already in Automation.
  • AzureRM.profile
  • AzureRM.HDInsight
Below goes one link how to add modules.


The interesting thing here is AzureRM.HDInsight depend on the AzureRM.profile. So first we need to add AzureRM.profile. Though the message says it is added, it is an async operation. So before the addition is not fully complete, if we try to add the AzureRM.HDInsight we get the same dependency missing error.

Scheduling the Runbook

Scheduling is simple as connecting the Runbook with a Schedule and associating the parameter values. 

Azure Automation @ Enterprise?

Enterprises can drastically reduce the cost to maintain systems if they start using Automation. Automation can be done even without Azure Automation. Powershell was also there earlier. But the advantage with Azure Automation is the blend of scalable infrastructure and automation language. The automation authors doesn't need to worry about where its going to run. Just write the code and give to Azure to execute.
Automation can even be used as part of multi-tenant application to isolate the security boundaries. One such mechanism is via Webhooks. High privilege tasks such as creating an HDICluster can be limited to an Azure AD App and Automation can run using that identity. The applications who knows the Webhook secret URL can only invoke and get the job done. Application doesn't need to know about the infrastructure details such as the virtual network name, sub nets etc.. All those can be local to the Automation Account.

One of my favorite slogan in software engineering is 'Solve our problems before we solve others'. The Azure Automation really helps in that context.