Tuesday, April 24, 2018

Azure @ Enterprise - Automation for scaling down HDInsight after busy hours

Introduction

Every software system involve many manual jobs to maintain it. From monitoring health, periodically maintaining indexes, updating operating systems, taking backups etc...Azure trying to automate those tasks as  much as possible using a service called Azure Automation. This post is about using Azure Automation for automatically scaling down HDI clusters in non peak hours.

Ideally speaking the cluster can be deleted after use. But the problem is that it takes a lot of time to create. In case a job is submitted after hours, it needs a lot of time. So better scaling down is an option. As always its not a silver bullet. Choose accordingly.

Azure Automation basics

There are so many tutorials available to get started using Azure Automation. Below are the high level information about Azure Automation.
  • Automation allows us to write Runbooks using PowerShell & Python workflow. Runbook is the real logic what to do.
  • Runbooks can even be graphical ie drag and drop coding.
  • Runbook can be exposed via Webhooks or scheduled to trigger execution. Each execution is called Job and we get the JobId to track progress.
  • Webhook clearly separate the credentials and connections from the code. So that, those secrets can be managed by the deployment guys.
  • It has the app store model to distribute Runbooks called Runbooks gallery. We can upload our Runbooks too.
Some links on how to do things in Automation

https://docs.microsoft.com/en-us/azure/automation/automation-first-runbook-textual
https://vincentlauzon.com/2015/11/01/azure-runbook-a-complete-simple-example/
https://docs.microsoft.com/en-us/azure/automation/automation-runbook-gallery
https://www.c-sharpcorner.com/article/create-runbooks-to-automate-tasks-in-azure-automation/

Gallery

Coming back to the context of this post. We have to automatically scale down the HDICluster in off peak hours. Don't there a runbook exist in the gallery to do the same? 

Yes there is one, but that doesn't seems to work if we have multiple Subscriptions. Hence this post. 

Contacted the author of the Runbook and informed the same. Hopefully there will be updated version soon. Below is the link.

Below is one link about Runbook gallery to get started.

Runbook for scaling down HDICluster

Directly writing the code as the approach is straight forward

    $SubscriptionName = "<Subscription Name>"
    $ResourceGroupName = "<Resource Group where the HDICluster resides not the automation account>" 
    $ClusterName = "<name of the cluster. No need of azurehdinsight.net>"
    [Int] $Nodes = 1
    
    $connectionName = "AzureRunAsConnection"
    Write-Verbose "Starting Scaling cluster $ClusterName to $Nodes nodes..."
    try
    {
        # Get the connection "AzureRunAsConnection "
        $servicePrincipalConnection=Get-AutomationConnection -Name $connectionName         

        Write-Output "Logging in to Azure..."
        Add-AzureRmAccount `
            -ServicePrincipal `
            -TenantId $servicePrincipalConnection.TenantId `
            -ApplicationId $servicePrincipalConnection.ApplicationId `
            -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint 
    }
    catch {
        if (!$servicePrincipalConnection)
        {
            $ErrorMessage = "Connection $connectionName not found."
            throw $ErrorMessage
        } else{
            Write-Error -Message $_.Exception
            throw $_.Exception
        }
    }
    Select-AzureRMSubscription -SubscriptionName $SubscriptionName
    Write-Output "Scaling cluster $ClusterName to $Nodes nodes..."
    Set-AzureRmHDInsightClusterSize `
        -ResourceGroupName $ResourceGroupName `
        -ClusterName $ClusterName `
        -TargetInstanceCount $Nodes

Please note the parameters are avoided to reduce code size. Ideally the values should comes as parameters.

Prerequisites

Azure AD App

"AzureRunAsConnection" is the name of connection at the automation account level which is going to be the identity of the Runbook code. 
The above code is tested with an Azure AD Application with certificate based authentication. The certificate associated with Azure AD App has to be uploaded to Automation account. Credentials are not at the Runbook level. Meaning multiple Runbooks in the Automation account can share same credentials.

Importing modules

The Runbook needs 2 additional modules which are not present already in Automation.
  • AzureRM.profile
  • AzureRM.HDInsight
Below goes one link how to add modules.


The interesting thing here is AzureRM.HDInsight depend on the AzureRM.profile. So first we need to add AzureRM.profile. Though the message says it is added, it is an async operation. So before the addition is not fully complete, if we try to add the AzureRM.HDInsight we get the same dependency missing error.

Scheduling the Runbook

Scheduling is simple as connecting the Runbook with a Schedule and associating the parameter values. 

Azure Automation @ Enterprise?

Enterprises can drastically reduce the cost to maintain systems if they start using Automation. Automation can be done even without Azure Automation. Powershell was also there earlier. But the advantage with Azure Automation is the blend of scalable infrastructure and automation language. The automation authors doesn't need to worry about where its going to run. Just write the code and give to Azure to execute.
Automation can even be used as part of multi-tenant application to isolate the security boundaries. One such mechanism is via Webhooks. High privilege tasks such as creating an HDICluster can be limited to an Azure AD App and Automation can run using that identity. The applications who knows the Webhook secret URL can only invoke and get the job done. Application doesn't need to know about the infrastructure details such as the virtual network name, sub nets etc.. All those can be local to the Automation Account.

One of my favorite slogan in software engineering is 'Solve our problems before we solve others'. The Azure Automation really helps in that context.

Tuesday, April 17, 2018

Azure @ Enterprise - Connecting to Management API using Azure AD App certificate

Introduction

When we develop large multi-tenant applications, we often face requirements to dynamically provision infrastructure resources, which we never need to do for simple applications. For example, assume the enterprise security practice or business requirement demand to isolate tenant's data into separate databases. Then we will have to dynamically create databases on the fly, when a tenant joins the application. This is drastically different than working on a simple application where only one database is storing everything. The maximum we have to deal with are the indexes and partitioning in database level and load balancing at front end level.

When we create or provision the resources from application, there will be so many security related questions to be answered. What if someone hack into the application and delete the databases? How to handle the noisy neighbor problem? The list is large.

To handle security, the on-premise solution is to have separate service accounts. Those have permission to create databases and isolate that service from web services exposed to clients machines. Expose that only internally either by authorization or by exposing via netNamedPipe binding in WCF.

Managing resources in Azure

Cloud computing is expected to solve all the infrastructure provisioning issues. Azure does that well. When enterprise meet Azure, all the security related things mentioned above gets reevaluated. In cloud it gets important otherwise an attack or poor code may create large number of highly priced resources which directly affect the financials. Or resources can be deleted which brings the entire system down. In on-premise systems there is very limited or no way an attack can delete a virtual machine. But in cloud its not.

How to secure a component which does infrastructure provisioning? This problem can be solved in Azure many ways. We can have a service which is secured using Azure AD and only be exposed inside enterprise's own virtual network(vNet) in Azure. But then the question comes how to secure the Azure AD? Azure AD supports different types authentications and enterprise like the MFA and certificate based auth. The latest in the series is Managed Service Identity.

MFA - Multi Factor Authentication helps secure something exposed to users who can look at the security token and enter the same in web page or device. But for service to service communication, a scheduled job or queued operation to service communication, MFA is not suitable. The certificates helps there.

Securing in the world of Microservices - Automation RunBook?

In a large enterprise, there could be so many applications which are multi-tenant and need infrastructure provisioning. Enterprise may have only one Azure subscription for all these. In such scenario giving the certificates which have the privilege to create Azure resources to all those apps will not be feasible or those apps cannot be run with that level of high privilege.

One solution in Azure is to use Azure Automation. Automation Runbook can run in high privilege and can create Azure resources. That can be exposed via Webhooks to the applications. Applications can invoke the Webhook with some kind of application identity or developer key in request header. Once the runbook starts it can check for the application key and do actions if allowed. Please note that Webhooks don't have security mechanism built in. URL has a secret token and whoever knows the URL can invoke. The runbook can check for header and validate.

Writing RunBooks is easy and there are lot of tutorials available how to get it right. 

But there is a problem remaining. Webhook return a JobId. How do the applications check status of the Job?

Callback?

We will end up again in certificates, if we need to use the Azure Management API. But it is easy to do status reporting if the automation runbook accept a callback URL and invoke that on Job completion.

Webhook accepting another Webhook on completion may make things complicated but that is good solution without polling.

Unfortunately if we end up in polling, below are the code snippets which can be used to get the Automation Job status using .Net SDK. There are so many code snippets available in internet but very difficult to get working code which uses cert to auth into Azure Management API. 

Since the authentication APIs accept strings and the names are confusing, its gets complicated easily.

Code snippets

Below is the entry point which accepts the necessary inputs to locate an Azure Automation Job

private static async Task<JobGetResponse> GetJobResponse( string subscriptionGuid,string resourceGroupName, string AutomationAccount, string JobId)
{
            AutomationManagementClient client = await AutomationManagementClientFactory.Get(subscriptionGuid);
            return client.Jobs.Get(resourceGroupName, AutomationAccount, Guid.Parse(JobId));
}

The return ed JobGetResponse has Job property which exposes most of the properties of Job.
In order to get the code working, we need a valid AutomationManagementClient. How to properly give the string values into the flow is the trickiest part. .

internal class AutomationManagementClientFactory
{
    internal static async Task<AutomationManagementClient> Get(string subscriptionGuid)
    {
        string token = await TokenFactory.GetAccessToken("https://management.core.windows.net/");
        TokenCloudCredentials tcc = new TokenCloudCredentials(subscriptionGuid,token);
        return new AutomationManagementClient(tcc);
    }
}

This depend on the TokenFactory. But before going there the catch here is on the hard coded URL. The URL is to the management end point. Lets see the TokenFactory class

internal class TokenFactory
{
    /// <summary>
    /// Get Access Token
    /// </summary>
    /// <param name="resource"></param>
    /// <returns></returns>
    internal static async Task<string> GetAccessToken(string resource)
    {
        var context = new AuthenticationContext($"https://login.windows.net/{Configurations.TenantId}", TokenCache.DefaultShared);
        var assertionCert = GetClientAssertionCertificate(Configurations.AzureADApplicationId);
        var result = await context.AcquireTokenAsync(resource,assertionCert );
        return result.AccessToken;
    }
    internal static IClientAssertionCertificate GetClientAssertionCertificate(string clientId)
    {
        string certIssuerName = Configurations.GetConfigValueByKey("CertificateIssuerName");
        X509Certificate2 clientAssertionCertPfx = CertificateHelper.FindCertificateByIssuerName(certIssuerName);
        return new ClientAssertionCertificate(clientId, clientAssertionCertPfx);
    }
}

The responsibility of the class is to get authentication token towards a resource and the resource here is the Azure Management end point. The authentication context using the Azure AD tenant Guid to get token. The TenantId is not the Azure AD Application id. 

It uses certificate which is found using the issuer name. The criteria to find certificate can be anything but the rule here is that the certificate should be same as the certificate used for Azure AD Application. The Azure Application's id is obtained from the config. It has to be Application Id not the object id of Azure AD App.

The signature may confuse us. The client assertion certificate uses application id, but the parameter name is client id to make it generic.

The last thing is the certificate helper. As mentioned above, how we get the cert is not relevant as long as its the right certificate. Adding the code for that as well.

public static class CertificateHelper
{
    /// <summary>
    /// Find Certificate By Issuer name
    /// </summary>
    /// <param name="findValue"></param>
    /// <returns></returns>
    public static X509Certificate2 FindCertificateByIssuerName(string findValue)
    {
        using (X509Store store = new X509Store(StoreName.My, StoreLocation.CurrentUser))
        {
            store.Open(OpenFlags.ReadOnly);
            X509Certificate2Collection col = store.Certificates.Find(X509FindType.FindByIssuerName,
                findValue, false); // Don't validate certs, since the test root isn't installed.

            return col.Count == 0 ? throw new CryptographicException($"Certifcate not found") : col[0];

        }
    }
}


Prerequisites / Environment setup

  • Azure AD Application which has permission to resource group where the automation account reside.
  • The above Azure AD application to accept certificate to get token back. So install the certificate to the proper store. In this case it search in Current User's personal store. If this code runs from IIS web application using service accounts, the store can be different.

Why the code snippet is important?

When we get the snippet working and look at the code we feel its simple. But when we get a situation when this doesn't work, we cannot understand anything such as what is client id, what is resource id etc...

The hard coded strings are applicable in public Azure cloud. When the code runs in Azure Government or other tenants the values will differ.

Exceptions

Below are some exceptions which may occur during the development

Access token from wrong audience

The below exception may occur if the token is obtained from TokenFactory is not associated with the right resource. 

"The access token has been obtained from wrong audience or resource ’https://management.core.windows.net'. It should exactly match (including forward slash) with one of the allowed audiences ‘https://management.core.windows.net/’,’https://management.azure.com/’"

Enjoy...

Tuesday, April 10, 2018

Azure @ Enterprise - Tuning the HDIClusters programatically

HDInsight Cluster

HDInsight shortly referred to as HDI is the Microsoft wrapper around Hadoop and other open source data analytics technologies such as Spark. It depends on the Harton works platform. It can be installed onpremise and available in Azure as well, in the form of platform service. 

In Azure, the advantage is that the scaling can be easily done though it takes around 15 mins. We can create a cluster for specific workloads and delete after it is done. This help us to save lot of money as its costly during running time.

HDInsight @ Enterprise

At enterprise the workloads differ and there could be different application teams wants to use HDICluster for their various workloads. Either all can write the code to create HDICluster using the Azure management APIs in every application or there could be a common service which can be used to serve the applications. When we have common service and different application workloads have different cluster demands, we need to adjust the cluster properties. 

Setting the cluster properties is really complex since the properties are spread across in different levels. There are properties at cluster level such as no of worker nodes, node manager level, Livy job submission level, worker JVM properties etc... Getting these properties under control is a big challenge.

Sometimes we may need to reuse the clusters before deleting it to save time of cluster creation. At the time of writing this post, it takes around 15-20 mins to get a new cluster created. If the common service can give the clusters to subsequent consumers, it would save a good amount of time.

Manually

Manually we can easily adjust the properties from Azure portal and the Ambari views of specific cluster.  Some links are given below.

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-manage-ambari
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-changing-configs-via-ambari
https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-resource-manager
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-administer-use-portal-linux

After setting some properties the cluster needs restart. The portal shows whether to restart or not based on the property what is changed.

API

It is easy to adjust the properties at the cluster level using Azure APIs. But when it comes to the properties inside cluster such as the Node Manager heap size etc...we have to rely on the Ambari API. Below are some links to do the same.

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-manage-ambari-rest-api#example-update-configuration
https://community.hortonworks.com/content/supportkb/49134/how-to-stop-start-a-ambari-service-component-using.html

Using these APIs is the toughest thing in the API world. We have to get the current settings and do the changes to that and send back with a new label. Something similar to how we do in the coding. Get latest, do change and commit the change set.

If the jobs are submitted using Livy, there is option for sending some parameters which are at that job level. Examples of those parameters are the executor_core.

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-troubleshoot-spark#how-do-i-configure-a-spark-application-by-using-livy-on-clusters

Handle restarts

As mentioned earlier some properties require the cluster to restart. The UI shows a warning to restart. What to do when we use the API? The best answer is to restart the service after setting the properties regardless whether restart needed or not :)

https://community.hortonworks.com/questions/50823/ambari-rest-api-to-restart-all-services.html
https://community.hortonworks.com/questions/123749/api-restart-the-services-that-restart-is-required.html
https://stackoverflow.com/questions/45739985/api-restart-the-services-that-restart-is-required

Since the usage is pretty much straightforward, not including any code snippets. But if anyone facing issues with these APIs, please comment in this post.

Tuesday, April 3, 2018

IoT - NodeMCU

Introduction

Nobody now a days need introduction to IoT. To me its more buzzword where all the devices or things gets connected to internet. Regardless 'IoT' term is there or not, devices will get connected to internet if they can talk over wifi, phone networks or any other connection mechanism

Server side

IoT and Cloud seems getting coupled together with specialized IoT offerings from cloud providers. As far as we can host an http end point in internet with capability to handle load, device has connectivity to there, we are good to deal with most of use cases. Really don't need to follow the buzzwords of cloud providers. But there are so many marketing sessions happened and happening on IoT by just demoing the server side. Nothing at all about device side.

Device / Thing

Device is any Turing machine with internet connectivity and required sensors and actuators. The decision making is done by the Turing machine which is nothing but a computer CPU. Generally people use small boards which have these capabilities. Arduino, Raspberry Pi are famous in this area. If we use an old PC, laptop or mobile connected to required sensors and actuators, its same as the Thing. It doesn't necessarily be the small device always.

But the small device has advantage over PCs laptops and mobiles where the display, keyboard are optional things for the so called Thing. Basically the boards are modular. If we need a moisture sensor we can plug that in else just don't use that. That reduce the cost drastically when we use the boards. The board ensures that there is computing power from processor and I/O ports to communicate with the world.

NodeMCU

This is one of the board similar to Arduino and Raspberry Pi. More powerful than Arduino but below 
Raspberry Pi. The boards are available from 2.5 USD and up. NodeMCU currently is in Version 3 which indicate there were 2 version earlier and people might have used those and reported issues and perhaps fixed. So it is better to buy and try one now.

Why NodeMCU

The simple answer is its real out of the box WiFi enabled Thing. Other boards may need additional purchases to connect to internet. Second is that it has more specs that Arduino in the same price range.

This post is mainly about experience with one variant of NodeMCU V3 board named LoLin. Below is AliExpress link
https://www.aliexpress.com/item/new-Wireless-module-CH340-NodeMcu-V3-Lua-WIFI-Internet-of-Things-development-board-based-ESP8266/32556303666.html

Driver

We have to install the proper USB driver in order to connect from computer to the board to push the code. Have to google CH340G driver. Fortunately for Win10 it didn't had that issue. I am not sure whether I might have installed as part of any other experiment earlier.

Pin layout

If we just google for NodeMCU quick start tutorials we can get steps which will take us towards blinking the LED in the board. When I got the board, I tried the same but the code in those tutorials never worked to blink the LED. 

First I was in impression that they shipped a faulty one. But the rating of the listing made me rethink. how come that listing can get 108 five star ratings if they ship faulty products. Started deep googling. Deep googling seems a new term when the normal google doesn't help. That require one seperate post. 

It landed me into another world of incompatibilities among the NodeMCU boards. Below is one post discussing about the LED built into the board and the pin number to activate LED. The interesting thing is that we need to output 0 to turn the LED and vice versa.

https://arduino.stackexchange.com/questions/38477/does-the-node-mcu-v3-lolin-not-have-a-builtin-led

Connecting to WiFi

Another tutorial on connecting to the WiFi. This is yet to be tested using my board
http://henrysbench.capnfatz.com/henrys-bench/arduino-projects-tips-and-more/arduino-esp8266-lolin-nodemcu-getting-started/

I am planning to use the board to automatically water our curry leaves plant from Aquarium by sensing moisture in the soil. Hopefully more updates will be coming soon.

References

https://frightanic.com/iot/comparison-of-esp8266-nodemcu-development-boards/