Tuesday, June 12, 2018

Azure @ Enterprise - AppInsight -> Server -> Exception rate is not correlating with exceptions telemetry

There could be surprising situations where the AppInsight->Servers page show high rate of Exceptions but when we query exceptions it will not match with the exception rate. Its is tough situation for supporting application especially when stakeholders are interested in AppInsight reports.

If the production support teams are not capable of finding out why the different occurs, it will come to development. So what is going on here?

Why EXCEPTION RATE is not the rate of exceptions telemetry?

The AppInsight->Servers page displays the "EXCEPTION RATE" graph from performance counter where the 'exceptions' telemetry is having only the exceptions which are reported to AppInsight from applications such as using TrackException() of SDK. If application or framework is eating exceptions those exceptions will be affecting the performance counter and the graph but never comes into the exceptions telemetry. We can verify by clicking edit button of "EXCEPTION RATE" to view the data element. It will be performance counter. See the image below.
We can cross check by remote login into the server and watch the counter in perfmon.exe utility. This behavior is documented by Microsoft here and here.

What are the sources of eaten exceptions?

The million dollar question is whether I should worry about this high exception rate? The answer is yes if we care about our software what we made. If the app is just working and its not mission critical and have enough budget to run the app, we may leave. But that is not the case with most of the applications especially if it is hosted in Azure. Every fault translate to money. If those exceptions are slowing down requests the entire throughput will get reduced which leads to more machines to serve less users.

OK, we are going to fix it. How to find out the sources of exceptions. As we seen in above section, perfmon.exe is a good start to find what processes are throwing exceptions. Once we know the source the fix depend on individual finding. Below are some examples
  • If we are using netNamedPipe binding to communicate between internal services, there are high chance that so many exceptions got eaten by that binding but functionality would be working.
  • If the system is consuming other Azure services and retry is implemented, there could be exceptions but retry might have solved the business scenario but the performance counter get affected by the exception. 
  • If there are any background workers running using 'On Error Resume Next' style, the exceptions encountered inside those may be eaten and affecting exceptions performance counter.
It is better to report the exception using TrackException even though we are handling it. Overall it is a tedious process but there are chances to get treasures which help us to remove one server to support same number of users.

Happy debugging.

Tuesday, May 29, 2018

ASP.Net bundling & minification via CDN Caching

It is normal for ASP.Net applications to use bundles to reduce http calls. It all works till the time CDN comes into the landscape. CDN (Content Delivery Network) uses a special approach to deliver contents from world wide distributed servers which we can envision as served from the edge of the network. That way so many requests get served from CDN server without coming to our web server.  It is nice theory about CDN. What is between ASP.Net bundling and CDN?

CDN normally cache static assets. Out of the box, they cache well known static assets such as html, css, js files. Yeah, image file formats too, if we configure. If we had bundled static assets using ASP.Net bundling mechanism which gets a different URL than just js files, those will not be seen by CDN as static assets. Hence no caching.

This is applicable if CDN is placed in between. Akamai provides such a service where we don't need to change our code when we introduce CDN. CDN servers cache when they get content from the backend web server. Normally when we introduce CDN, the domain has to change to serve from CDN. That has added benefit too since it increase the parallel request limit to same origin. If we want to integrate CDN via separate domain, better forget about ASP.Net bundling.

Here lets see how to make pass through CDNs work with ASP.Net bundling since they don't introduce new CDN URL.

How can we make CDN friend with ASP.Net bundles

Approach 1 - Don't use bundling

Sounds easy but if the ASP.Net application is designed as product which is hosted by individual customer, every consumer may not have CDN capability. So need bundling. For hosted applications ie there is only one instance of application in the world, yes we can try this. But still without http2, it would cause performance degradation as previous http versions make different connections for each file / resource request. If the application has angular or any other SPA frameworks used, the number of files may be large.

Approach 2 - Have CDN cache the ASP.Net bundle URL

When we bundle, ASP.Net provides a new URL instead of all the files inside the bundle. From browser we will see only one bundle URL request instead of individual resources in the bundle. That URL has a predefined format. If the CDN supports custom RegEx kind of way to specify the URL formats to cache, it works without much efforts.

The RegEx for recognizing ASP.Net bundle URL format goes as follows.

/\?v=[a-zA-Z0-9_-]{44}$/

This works only if there are no other API or dynamic resource URL ending with ?v={44 letter value}

Approach 3 - Embrace SPA with JS modules & forget server side rendering 

Instead of server side rendering done by ASP.Net, embrace the industry trend SPA (Singe Page Application) with JavaScript module system such as Webpack. Webpack can produce static bundles which can be referred in application as a normal file.

Recommended

If there is enough budget and time, embrace SPA else use the RegEx and stay with ASP.Net bundling and CDN.

I don't have any association with Akamai except using in projects. This post is not intended to promote any of their products.

Tuesday, May 15, 2018

Azure @ Enterprise - Finding how many nodes are really created for one HDInsight cluster

When we create an Azure HDICluster, it internally creates virtual machines. In the Azure portal's cluster creation blade, it asks for the details about Head and Worker nodes. We cannot set the no of head nodes but worker nodes. All good till now.

But @ enterprise, if the HDInsight cluster need to be in vNet, there could be issues on lack of IP Addresses available in the subnet. Its gets worse if the creation needs to happen dynamically in a multi tenant application. It is very difficult to do calculation on the IP address requirements of HDICluster, if we don't know the internals of how many VMs get created as part of one HDInsight cluster regardless of worker nodes.

Is that not available publicly? Yes it is and below are links towards it.
https://blogs.msdn.microsoft.com/azuredatalake/2017/03/10/nodes-in-hdinsight/
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-port-settings-for-services

The above tells for Spark it create Head nodes, ZooKeeper nodes and Gateway nodes. How to validate how many machines created or how to verify the facts ourselves. The portal never tells how many machines get created, if we navigate to already created HDICluster resource blade. PowerShell object of HDICluster instance too doesn't have direct info about the internal machines created. So what is the alternative?

PowerShell to retrieve nodes

Again PowerShell and some string comparisons to rescue. Below goes the script.

$hdiClusterName = "<name of cluster without domain>"

"Assumption 1 - The vNet and subnet of all nodes are same."
"Assumption 2 - The vNet, Public IPAddresses & NIC are in same resource group"
"Assumption 3 - There will be a gateway nodes fof HDICluster and public ip address for gateway is in format publicIpgateway-<internal id>"
"Assumption 4 - A unique internal id is used to name the nodes,NICs, public addresses etc...This script heavily depend on that internal id based naming convention"

"--------------------------------------------------------"

$resource =(Get-AzureRmResource -ResourceId (Get-AzureRmHDInsightCluster -clustername $hdiClusterName).Id)

$hdiClustersVNetResourceGroupName = (Get-AzureRmResource -ResourceId $resource.Properties.computeProfile.roles[0].virtualNetworkProfile.id).ResourceGroupName

"ResourceGroup of vNet assiciated with HDI cluster- $hdiClustersVNetResourceGroupName"

$publicAddress = (Get-AzureRmPublicIpAddress -ResourceGroupName $hdiClustersVNetResourceGroupName) | Where-Object {$_.DnsSettings.DomainNameLabel -eq $hdiClusterName}

$publicIpgatewayName = $publicAddress.Name

$hdiClusterInternalId = $publicIpgatewayName.Split('-')[1]

"Internal Id of HDI used to create nodes - $hdiClusterInternalId"

"Below are the NICs used by $hdiClusterName HDI Cluster. Each NIC corresponds to one node."

$nics = Get-AzureRmNetworkInterface -ResourceGroupName $hdiClustersVNetResourceGroupName
$nics = $nics | Where-Object {$_.Name -like "*$hdiClusterInternalId"}
$nics | Select-Object -Property Name

As we can see the script relies on the naming convention of NICs. If Microsoft changes it the script will fail.

From the list we can see it creates 2 Head nodes, 3 ZooKeeper and 2 Gateway nodes along with minimum 1 worker node. Minimum 8 IP Addresses will be consumed for one HDInsight cluster. At the time of writing this post the ZooKeeper and Gateway nodes seems free. The charge is only for Head and Worker node(s)

Ambari Portal

Another way is via Ambai portal. If we navigate to the below URL, we can see the head nodes and ZooKeeper nodes. But not able to see the gateway nodes.

https://<cluster name>.azurehdinsight.net/#/main/hosts

Happy scripting...

Tuesday, May 8, 2018

Azure @ Enterprise - Checking connectivity from AppServiceEnvironment to HDInsight

The background here is Enterprise Azure environment where most of the things are in vNet and their own subnets. When the Spark HDInsights are in separate Subnet other than the application, there will not be connectivity by default, if we need to submit jobs via Livy or anything like that (This again depends on the Enterprise policy). We have to open the routes from application subnet to HDInsight subnet. The route opening depends on how the infrastructure is laid out. If there are no firewalls of proxies in between application and HDInsight clusters, simple NSG rules would be sufficient.

Suppose there are 2 teams involved one is infrastructure and other development or QA teams, how can the development or QA can verify there is connectivity?

If the application is hosted in virtual machines, we can just log in and open the Ambari UI. Even we can run network troubleshooting commands. But what to do if the applications are hosted as AppService WebApps? If the applications are not client facing and need to be secured from neighbors, those may be inside their own AppServiceEnvironments. Basically no user interface available.

The solution is simple. Back to command line mode and somehow check the http connectivity to the HDICluster. Below is one powershell command which we can execute from the command line interface exposed from Kudu.

curl -Credential "user name" -Method "GET" -Uri "https://<cluster name>.azurehdinsight.net/livy/batches"

How to reach to the kudu console of an AppService instance is detailed in the below links.
https://blogs.msdn.microsoft.com/benjaminperkins/2017/11/08/how-to-access-kudu-scm-for-an-azure-app-service-environment-ase/
https://blogs.msdn.microsoft.com/benjaminperkins/2014/03/24/using-kudu-with-windows-azure-web-sites/

The command tested for non domain joined HDInsight clusters. When we enter the above command it will ask for the password interactively. 

This is just a manual command to test the connectivity. If the scenario is multi-tenant and want to ensure connectivity from application, use WebClient or similar methods.

Tuesday, May 1, 2018

Azure @ Enterprise - Finding Subnet of HDICluster via PowerShell

Enterprise love putting resources into virtual network(vNet) thinking that, it brings certain high level of free security via isolation. HDInsight clusters can also be put into vNet. What to do if we have a HDICluster which is added to Subnet earlier and we don't know the Subnet name now?

Silly question. Isn't it? Just go to portal and see the cluster properties. That is what below MSDN article says.
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-administer-use-portal-linux#list-and-show-clusters

Unfortunately our corporate subscription is only showing vNet name in the properties not Subnet. So how to get the Subnet name?

Finding Subnet of HDICluster

Powershell help us here. Below goes the script

$hdiClusterName = "POC001Joy"
$hdiCluster = (Get-AzureRmHDInsightCluster -clustername $hdiClusterName)
$resourceProperties =(Get-AzureRmResource -ResourceId (Get-AzureRmHDInsightCluster -clustername $hdiClusterName).Id).properties
$resourceProperties.computeProfile.roles[0].virtualNetworkProfile.subnet

It simply gets the object of HDInsight resource and navigate to the required property. The object model is little confusing but its there.

The above PowerShell script can be directly entered into Azure Portal or run after logging into Azure from PowerShell window / ISE.

An issue has been added to the MSDN documentation article related to this missing subnet name.

Enjoy Scripting...

Tuesday, April 24, 2018

Azure @ Enterprise - Automation for scaling down HDInsight after busy hours

Introduction

Every software system involve many manual jobs to maintain it. From monitoring health, periodically maintaining indexes, updating operating systems, taking backups etc...Azure trying to automate those tasks as  much as possible using a service called Azure Automation. This post is about using Azure Automation for automatically scaling down HDI clusters in non peak hours.

Ideally speaking the cluster can be deleted after use. But the problem is that it takes a lot of time to create. In case a job is submitted after hours, it needs a lot of time. So better scaling down is an option. As always its not a silver bullet. Choose accordingly.

Azure Automation basics

There are so many tutorials available to get started using Azure Automation. Below are the high level information about Azure Automation.
  • Automation allows us to write Runbooks using PowerShell & Python workflow. Runbook is the real logic what to do.
  • Runbooks can even be graphical ie drag and drop coding.
  • Runbook can be exposed via Webhooks or scheduled to trigger execution. Each execution is called Job and we get the JobId to track progress.
  • Webhook clearly separate the credentials and connections from the code. So that, those secrets can be managed by the deployment guys.
  • It has the app store model to distribute Runbooks called Runbooks gallery. We can upload our Runbooks too.
Some links on how to do things in Automation

https://docs.microsoft.com/en-us/azure/automation/automation-first-runbook-textual
https://vincentlauzon.com/2015/11/01/azure-runbook-a-complete-simple-example/
https://docs.microsoft.com/en-us/azure/automation/automation-runbook-gallery
https://www.c-sharpcorner.com/article/create-runbooks-to-automate-tasks-in-azure-automation/

Gallery

Coming back to the context of this post. We have to automatically scale down the HDICluster in off peak hours. Don't there a runbook exist in the gallery to do the same? 

Yes there is one, but that doesn't seems to work if we have multiple Subscriptions. Hence this post. 

Contacted the author of the Runbook and informed the same. Hopefully there will be updated version soon. Below is the link.

Below is one link about Runbook gallery to get started.

Runbook for scaling down HDICluster

Directly writing the code as the approach is straight forward

    $SubscriptionName = "<Subscription Name>"
    $ResourceGroupName = "<Resource Group where the HDICluster resides not the automation account>" 
    $ClusterName = "<name of the cluster. No need of azurehdinsight.net>"
    [Int] $Nodes = 1
    
    $connectionName = "AzureRunAsConnection"
    Write-Verbose "Starting Scaling cluster $ClusterName to $Nodes nodes..."
    try
    {
        # Get the connection "AzureRunAsConnection "
        $servicePrincipalConnection=Get-AutomationConnection -Name $connectionName         

        Write-Output "Logging in to Azure..."
        Add-AzureRmAccount `
            -ServicePrincipal `
            -TenantId $servicePrincipalConnection.TenantId `
            -ApplicationId $servicePrincipalConnection.ApplicationId `
            -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint 
    }
    catch {
        if (!$servicePrincipalConnection)
        {
            $ErrorMessage = "Connection $connectionName not found."
            throw $ErrorMessage
        } else{
            Write-Error -Message $_.Exception
            throw $_.Exception
        }
    }
    Select-AzureRMSubscription -SubscriptionName $SubscriptionName
    Write-Output "Scaling cluster $ClusterName to $Nodes nodes..."
    Set-AzureRmHDInsightClusterSize `
        -ResourceGroupName $ResourceGroupName `
        -ClusterName $ClusterName `
        -TargetInstanceCount $Nodes

Please note the parameters are avoided to reduce code size. Ideally the values should comes as parameters.

Prerequisites

Azure AD App

"AzureRunAsConnection" is the name of connection at the automation account level which is going to be the identity of the Runbook code. 
The above code is tested with an Azure AD Application with certificate based authentication. The certificate associated with Azure AD App has to be uploaded to Automation account. Credentials are not at the Runbook level. Meaning multiple Runbooks in the Automation account can share same credentials.

Importing modules

The Runbook needs 2 additional modules which are not present already in Automation.
  • AzureRM.profile
  • AzureRM.HDInsight
Below goes one link how to add modules.


The interesting thing here is AzureRM.HDInsight depend on the AzureRM.profile. So first we need to add AzureRM.profile. Though the message says it is added, it is an async operation. So before the addition is not fully complete, if we try to add the AzureRM.HDInsight we get the same dependency missing error.

Scheduling the Runbook

Scheduling is simple as connecting the Runbook with a Schedule and associating the parameter values. 

Azure Automation @ Enterprise?

Enterprises can drastically reduce the cost to maintain systems if they start using Automation. Automation can be done even without Azure Automation. Powershell was also there earlier. But the advantage with Azure Automation is the blend of scalable infrastructure and automation language. The automation authors doesn't need to worry about where its going to run. Just write the code and give to Azure to execute.
Automation can even be used as part of multi-tenant application to isolate the security boundaries. One such mechanism is via Webhooks. High privilege tasks such as creating an HDICluster can be limited to an Azure AD App and Automation can run using that identity. The applications who knows the Webhook secret URL can only invoke and get the job done. Application doesn't need to know about the infrastructure details such as the virtual network name, sub nets etc.. All those can be local to the Automation Account.

One of my favorite slogan in software engineering is 'Solve our problems before we solve others'. The Azure Automation really helps in that context.

Tuesday, April 17, 2018

Azure @ Enterprise - Connecting to Management API using Azure AD App certificate

Introduction

When we develop large multi-tenant applications, we often face requirements to dynamically provision infrastructure resources, which we never need to do for simple applications. For example, assume the enterprise security practice or business requirement demand to isolate tenant's data into separate databases. Then we will have to dynamically create databases on the fly, when a tenant joins the application. This is drastically different than working on a simple application where only one database is storing everything. The maximum we have to deal with are the indexes and partitioning in database level and load balancing at front end level.

When we create or provision the resources from application, there will be so many security related questions to be answered. What if someone hack into the application and delete the databases? How to handle the noisy neighbor problem? The list is large.

To handle security, the on-premise solution is to have separate service accounts. Those have permission to create databases and isolate that service from web services exposed to clients machines. Expose that only internally either by authorization or by exposing via netNamedPipe binding in WCF.

Managing resources in Azure

Cloud computing is expected to solve all the infrastructure provisioning issues. Azure does that well. When enterprise meet Azure, all the security related things mentioned above gets reevaluated. In cloud it gets important otherwise an attack or poor code may create large number of highly priced resources which directly affect the financials. Or resources can be deleted which brings the entire system down. In on-premise systems there is very limited or no way an attack can delete a virtual machine. But in cloud its not.

How to secure a component which does infrastructure provisioning? This problem can be solved in Azure many ways. We can have a service which is secured using Azure AD and only be exposed inside enterprise's own virtual network(vNet) in Azure. But then the question comes how to secure the Azure AD? Azure AD supports different types authentications and enterprise like the MFA and certificate based auth. The latest in the series is Managed Service Identity.

MFA - Multi Factor Authentication helps secure something exposed to users who can look at the security token and enter the same in web page or device. But for service to service communication, a scheduled job or queued operation to service communication, MFA is not suitable. The certificates helps there.

Securing in the world of Microservices - Automation RunBook?

In a large enterprise, there could be so many applications which are multi-tenant and need infrastructure provisioning. Enterprise may have only one Azure subscription for all these. In such scenario giving the certificates which have the privilege to create Azure resources to all those apps will not be feasible or those apps cannot be run with that level of high privilege.

One solution in Azure is to use Azure Automation. Automation Runbook can run in high privilege and can create Azure resources. That can be exposed via Webhooks to the applications. Applications can invoke the Webhook with some kind of application identity or developer key in request header. Once the runbook starts it can check for the application key and do actions if allowed. Please note that Webhooks don't have security mechanism built in. URL has a secret token and whoever knows the URL can invoke. The runbook can check for header and validate.

Writing RunBooks is easy and there are lot of tutorials available how to get it right. 

But there is a problem remaining. Webhook return a JobId. How do the applications check status of the Job?

Callback?

We will end up again in certificates, if we need to use the Azure Management API. But it is easy to do status reporting if the automation runbook accept a callback URL and invoke that on Job completion.

Webhook accepting another Webhook on completion may make things complicated but that is good solution without polling.

Unfortunately if we end up in polling, below are the code snippets which can be used to get the Automation Job status using .Net SDK. There are so many code snippets available in internet but very difficult to get working code which uses cert to auth into Azure Management API. 

Since the authentication APIs accept strings and the names are confusing, its gets complicated easily.

Code snippets

Below is the entry point which accepts the necessary inputs to locate an Azure Automation Job

private static async Task<JobGetResponse> GetJobResponse( string subscriptionGuid,string resourceGroupName, string AutomationAccount, string JobId)
{
            AutomationManagementClient client = await AutomationManagementClientFactory.Get(subscriptionGuid);
            return client.Jobs.Get(resourceGroupName, AutomationAccount, Guid.Parse(JobId));
}

The return ed JobGetResponse has Job property which exposes most of the properties of Job.
In order to get the code working, we need a valid AutomationManagementClient. How to properly give the string values into the flow is the trickiest part. .

internal class AutomationManagementClientFactory
{
    internal static async Task<AutomationManagementClient> Get(string subscriptionGuid)
    {
        string token = await TokenFactory.GetAccessToken("https://management.core.windows.net/");
        TokenCloudCredentials tcc = new TokenCloudCredentials(subscriptionGuid,token);
        return new AutomationManagementClient(tcc);
    }
}

This depend on the TokenFactory. But before going there the catch here is on the hard coded URL. The URL is to the management end point. Lets see the TokenFactory class

internal class TokenFactory
{
    /// <summary>
    /// Get Access Token
    /// </summary>
    /// <param name="resource"></param>
    /// <returns></returns>
    internal static async Task<string> GetAccessToken(string resource)
    {
        var context = new AuthenticationContext($"https://login.windows.net/{Configurations.TenantId}", TokenCache.DefaultShared);
        var assertionCert = GetClientAssertionCertificate(Configurations.AzureADApplicationId);
        var result = await context.AcquireTokenAsync(resource,assertionCert );
        return result.AccessToken;
    }
    internal static IClientAssertionCertificate GetClientAssertionCertificate(string clientId)
    {
        string certIssuerName = Configurations.GetConfigValueByKey("CertificateIssuerName");
        X509Certificate2 clientAssertionCertPfx = CertificateHelper.FindCertificateByIssuerName(certIssuerName);
        return new ClientAssertionCertificate(clientId, clientAssertionCertPfx);
    }
}

The responsibility of the class is to get authentication token towards a resource and the resource here is the Azure Management end point. The authentication context using the Azure AD tenant Guid to get token. The TenantId is not the Azure AD Application id. 

It uses certificate which is found using the issuer name. The criteria to find certificate can be anything but the rule here is that the certificate should be same as the certificate used for Azure AD Application. The Azure Application's id is obtained from the config. It has to be Application Id not the object id of Azure AD App.

The signature may confuse us. The client assertion certificate uses application id, but the parameter name is client id to make it generic.

The last thing is the certificate helper. As mentioned above, how we get the cert is not relevant as long as its the right certificate. Adding the code for that as well.

public static class CertificateHelper
{
    /// <summary>
    /// Find Certificate By Issuer name
    /// </summary>
    /// <param name="findValue"></param>
    /// <returns></returns>
    public static X509Certificate2 FindCertificateByIssuerName(string findValue)
    {
        using (X509Store store = new X509Store(StoreName.My, StoreLocation.CurrentUser))
        {
            store.Open(OpenFlags.ReadOnly);
            X509Certificate2Collection col = store.Certificates.Find(X509FindType.FindByIssuerName,
                findValue, false); // Don't validate certs, since the test root isn't installed.

            return col.Count == 0 ? throw new CryptographicException($"Certifcate not found") : col[0];

        }
    }
}


Prerequisites / Environment setup

  • Azure AD Application which has permission to resource group where the automation account reside.
  • The above Azure AD application to accept certificate to get token back. So install the certificate to the proper store. In this case it search in Current User's personal store. If this code runs from IIS web application using service accounts, the store can be different.

Why the code snippet is important?

When we get the snippet working and look at the code we feel its simple. But when we get a situation when this doesn't work, we cannot understand anything such as what is client id, what is resource id etc...

The hard coded strings are applicable in public Azure cloud. When the code runs in Azure Government or other tenants the values will differ.

Exceptions

Below are some exceptions which may occur during the development

Access token from wrong audience

The below exception may occur if the token is obtained from TokenFactory is not associated with the right resource. 

"The access token has been obtained from wrong audience or resource ’https://management.core.windows.net'. It should exactly match (including forward slash) with one of the allowed audiences ‘https://management.core.windows.net/’,’https://management.azure.com/’"

Enjoy...