Tuesday, July 17, 2018

Sarting Scala Spark - Streaming via TCP network socket in Windows

This is a simple tutorial about basic Scala Spark word count via streaming. There are so many samples for achieving this same. Then why this post?

One reason is all the posts are talking about how it works in Linux machines. Very rarely we find posts where the authors are using Windows machines. Another reason is to simplify the code to be understandable by a Scala beginner such a proper names for reduceByKey(_+_) which is very difficult to understand at first. Lets get started.


The code is pretty much straight forward as given below.

package org.apache.spark.examples.streaming

import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

// Counts words in input stream.
object SparkStreaming_HdfsWordCount {
  def main(args: Array[String]) {
    Logger.getLogger("org").setLevel(Level.ERROR) //Remove the noise which is useful sometimes
    var master = "local[2]" //Change to have immutability. Helps to run easily in local machine ans cluster mode.
    args.sliding(2, 2).toList.collect {
      case Array("-master", arg: String) => master = arg
    val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster(master)
    // Setup streaming
    val ssc = new StreamingContext(sparkConf, Seconds(10))
    val lines:ReceiverInputDStream[String] =ssc.socketTextStream("localhost",8081)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(word => {
      //println(x) //Debug purpose. Can see individual word
      (word, 1) // 1 means one instance of word.
    }).reduceByKey((x,y) => x + y) //Summing

    wordCounts.print() // Display the output


Some changes are done from the standard sample versions during troubleshooting. If we create a sample project in IntelliJ IDE for Spark, we can get streaming program which has file location as stream source. It might be little tedious to get the File system working in windows machines such as prefixing file:// or file:/// and the confusion over backslash or front slash etc...So better to use the network as source

If the environment is setup right, running the above program will try to connect to localhost:8081 and process the network stream.


The above Scala code make connection to port 8081 in localhost. But where is the server? Server means the producer of events into stream. If we look at normal posts they talk about a utility called nc which can be used to generate contents into the stream. 

But that is a Linux specific utility named netcat (nc). Though there are some equivalent available in Windows, it may be little difficult to get things working unless we download the binaries from known or unknown sites. For simplicity lets have the producer in PowerShell

$endpoint = new-object System.Net.IPEndPoint ([system.net.ipaddress]::any, $port) 
$listener = new-object System.Net.Sockets.TcpListener $endpoint
$listener.server.ReceiveTimeout = 5000


try {
    Write-Host "Listening on port $port, press CTRL+C to cancel"

    While ($true){

        if (!$listener.Pending())
            Start-Sleep -Seconds 1;
        $client = $listener.AcceptTcpClient()
        #$client.client.RemoteEndPoint | Add-Member -NotePropertyName DateTime -NotePropertyValue (get-date) -PassThru

        $id = New-Guid
        $data = [text.Encoding]::ASCII.GetBytes("joymon $id")    $client.GetStream().Write($data,0,$data.length)
        "Sent message - $date - $(get-date)"

catch {
        Write-Error $_       
            Write-host "Listener Closed Safely"

This is again straight forward. Inside a loop it wait for incoming connection. Once the connection is established, it will send messages suffixed with a GUID. The GUID is for making sure all the messages are reaching to the Spark streaming. The PowerShell code is mainly composed from these link1 & link2. Thanks to the authors. Just copy paste the code to PowerShell ISE and run.

The real time scenarios could be really complex than this word count. But this will give us an idea how the Spark Streaming works.

Tuesday, July 10, 2018

Azure @ Enterprise - How AppInsight SDK hook to WebClient and HttpClient classes?

This is continuation from previous post related to .Net SDK for Azure AppInsight and end to end cross process correlation. One of the open item in that post is about how the AppInsight SDK track the http dependency calls. Without we doing nothing, how the AppInsight can subscribe or hook to the http calls going out of the process?

Capture the outgoing http traffic from a .Net process in C#

Ideally capturing or intercepting the http traffic from a .Net process by a component inside that same process is independent of Azure AppInsight SDK. The SDK is just one use case of capturing the outgoing http traffic. There could be many other use cases such as security, modification etc...

Anyway lets start from the AppInsight source code and dig to the place where the hook happens. The source code of AppInsight SDK is available @ below Github repository


Navigating Source code of AppInsight SDK

In order to find out how the AppInsight SDK for .Net Server (ie legacy .Net as .Net Core is latest) hook to the http traffic, the entry point can be the ApplicationInsights.config file. If we search for the word 'Dependency' we can get the 'DependencyTrackingTelemetryModule'. Yes this resembles like the HttpModule which is powerful to do anything in the ASP.Net web world. First thing is to comment that section and see whether the Dependencies are getting logged or not to ensure that is the right thing we are looking at.

Anyway lets move ahead. Once we identify the source code of DependencyTrackingTelemetryModule class we can see below code snippet inside

#if NET45
// Net40 does not support framework event source

private HttpDesktopDiagnosticSourceListener httpDesktopDiagnosticSourceListener;
private FrameworkHttpEventListener httpEventListener;
private FrameworkSqlEventListener sqlEventListener;



This tells us that the AppInsight SDK behaves differently in different .Net frameworks. Next thing it tells is that there are 2 http listening mechanisms. Lets start with the HttpDesktopDiagnosticsSourceListener class and come back if required for the other class. Once we locate that class we can see the constructor of that class as below

        internal HttpDesktopDiagnosticSourceListener(DesktopDiagnosticSourceHttpProcessing httpProcessing, ApplicationInsightsUrlFilter applicationInsightsUrlFilter)
            this.httpDesktopProcessing = httpProcessing;
            this.subscribeHelper = new HttpDesktopDiagnosticSourceSubscriber(this, applicationInsightsUrlFilter);
            this.requestFetcherRequestEvent = new PropertyFetcher("Request");
            this.requestFetcherResponseEvent = new PropertyFetcher("Request");
            this.responseFetcher = new PropertyFetcher("Response");

            this.requestFetcherResponseExEvent = new PropertyFetcher("Request");
            this.responseExStatusFetcher = new PropertyFetcher("StatusCode");
            this.responseExHeadersFetcher = new PropertyFetcher("Headers");


Now focus to the subscriberHelper which is initialized with HttpDesktopDiagnosticSoruceSubscriber instance. What is in there?

internal HttpDesktopDiagnosticSourceSubscriber(
            HttpDesktopDiagnosticSourceListener parent,
            ApplicationInsightsUrlFilter applicationInsightsUrlFilter)
            this.parent = parent;
            this.applicationInsightsUrlFilter = applicationInsightsUrlFilter;
                this.allListenersSubscription = DiagnosticListener.AllListeners.Subscribe(this);

It is simple as subscribing to DiagnosticListener.AllListeners.

Redirect the analysis to DiagnosticsListener

Now we got a big clue. Lets understand what is DiagnosticListener and what is can listen to? Can it listen to outgoing http traffic?

The answer is yes from the .Net versions it support. Meaning in older .Net versions we cannot use this mechanism to capture http traffic. Below goes the sample code for hooking into http traffic using DiagnosticListener class.


More details

If anybody interested in following the topic below are some links

AppInsight SDK changed their model to capture


The above analysis is done by looking at the code and recreating sample using the code logic.  It might now work with other source repos where the name of the classes or functions may not resemble what it does internally. It is easy to debug the source to find out what is happening

Tuesday, July 3, 2018

Azure @ Enterprise - .Net AppInsight version and dependency correlation

AppInsight Correlation

When we develop for Azure, normally there will be a lot of services interacting together to complete one meaningful business function. Business functions such as loading a web page or a queued operation may have a lot of dependencies which are expected to work together to get success. Often when we use traditional logging framework, they tend to log only till the boundary of a process. We may not be able to correlate with what is happening in the next process when troubleshooting. We can do that by adding some custom code but AppInsight brings it out of the box. Out of the box means the internal data structure supports it and most of the SDKs too support it. More details on end to end correlation can be found here in my previous post.

Dependency correlation

When we enable AppInsight, we want all the events happening inside correlated together. One major item is Dependency telemetry. If we use AppInsight SDK below 2.4, we don't get any out going http requests via HttpClient correlated. In the same time if we have used WebClient it get correlated as dependency.


The problem is that when the HttpClient calls are not logged as dependencies our troubleshooting may go wrong or take more time. One possible reason why AppInsight SDK is not able to hook into HttpClient class is that AppInsight below 2.4 version don't know the class HttpClient. HttpClient is relatively new to the .Net's list of API used to make http calls.

Ideally it is interesting to identify how the AppInsight SDK subscribe to the WebClient and HttpClient classes so that it gets notified when there are outbound requests. Something for new post.


Better solution is to upgrade AppInsight SDK to latest and use HttpClient which is recommended for modern async programming.. If AppInsight SDK change is difficult, change HttpClient usage to WebClient.

Happy Coding...

Tuesday, June 26, 2018

Azure @ Enterprise - Finding the usage of vNet and Subnet IPs

There is a common myth that networking is easy in Cloud environment. At least considering Azure ,its is not true. If we are in Enterprise and want to implement security at the networking level we have to deal with vNets, subnets and their associated rule mechanisms such as NSG and much more. If it is a small deployment, there will be less confusion about vNet and subnets inside that and how many IPs used and free etc...Even re-balancing subnets even easy.

But that may not be the situation in an Enterprise where so many systems or departments share one subscription or same networking infrastructure. Things may often go out of control and will end up in situation where there are no more IPs or subnets for new applications.

The first challenge is to identify the usage of current vNets and Subnets inside them. We can get the details from Azure portal but its difficult if we want to consolidate to one view to take actions.

Below is a simple script to list down what are the subnets inside a particular vNet and how many IPs are possible and how many are used

Get-AzureRmVirtualNetwork -Name <Name of vNet> -ResourceGroupName <Name of vNet's RG > `
| Get-AzureRmVirtualNetworkSubnetConfig `
| Sort -Property Name `
| Select -Property Name, `
                   AddressPrefix, `
                   @{Name='Available IPs';Expression={[Math]::Pow(2,32-$_.AddressPrefix.split('/')[1])}}, `
                   @{Name='Used IPs';Expression = {$_.IpConfigurations.Count}}

Please note it shows the possible IPs not displayed as an integer instead as CIDR notation. If we are familiar to networking, we can easily understand how many IPs are available excluding the reserved IPs in Azure, what is the start IP and end IP etc... There are lot of hosted tools available to interpret the CIDR notation.

Once we knew there are issues such as vNets are fragmented, we have to think about solutions. We can easily suggest to change the policy to allot a vNet per department or system or application. There are trade offs in both the approaches.  If we allocate big vNet to a department and they don't have enough applications, all those will be unused. Also all departments need experts to manage the networks. The decision has to be made case by case.

Happy Networking...

Tuesday, June 19, 2018

Azure @ Enterprise - HDInsight name availability check in portal

Whether it is good or not Azure HDInsight needs unique name across Azure to create new instance. This leads to the requirement of name availability check feature. There are multiple mechanisms like calling API or checking the URL https://<proposed hdi name>.azurehdinsight.net for existence. 

Lets see how the Azure portal handles this check?

Azure Portal using CheckNameAvailability API for HDInsight cluster

The best friend is F12 browser tools to check how a web app like Azure Portal works
It is clear that it is trying to check name using the API partially documented below.


But the above article talks only about checking the name using Microsoft.Cdn provider. HDInsight with check name availability seems undocumented. Below is the URL format for doing so which portal uses.

POST https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.HDInsight/checkNameAvailability

Why the API end point is to West US while the resource to be created in East US?

This is the magic of portal. It seems portal is hosted in West US and thus it uses the API end points there. When the screenshot was taken the portal was opened from New Jersey,USA without any VPN and ISP seems from East US itself. So no chance for Azure to redirect client to West US by recognizing the location. 

Why there is Http 504 in the screenshot?

This story seems related with an outage happened last Wednesday (13Jun2018) in South Central US Azure region. During that time, if we tried to create HDInsight it will show the name is not available. 

Under the hood it tries to reach to the West US end point for checking for the name availability and it error out with gateway timeout. It might be because internally it is not able to contact SouthCentral region. When the timeout happens, Portal thinks the name is not available and displays the message. What a thinking? 

Ideally it should have been understood the http 504 and acted accordingly. As per Azure or any cloud design philosophy, failures are expected. Why not accept that failure occurred in the portal itself?

As it is mentioned the issue seems related with outage in South Central US, there is no proof ensuring causation only the time range is same. Screenshot of outage details below.

It was fun to debug this by sitting near to a Microsoft ADM. The same has been communicated to Microsoft via him. Hopefully they will fix it soon.

Happy debugging...

Tuesday, June 12, 2018

Azure @ Enterprise - AppInsight -> Server -> Exception rate is not correlating with exceptions telemetry

There could be surprising situations where the AppInsight->Servers page show high rate of Exceptions but when we query exceptions it will not match with the exception rate. Its is tough situation for supporting application especially when stakeholders are interested in AppInsight reports.

If the production support teams are not capable of finding out why the different occurs, it will come to development. So what is going on here?

Why EXCEPTION RATE is not the rate of exceptions telemetry?

The AppInsight->Servers page displays the "EXCEPTION RATE" graph from performance counter where the 'exceptions' telemetry is having only the exceptions which are reported to AppInsight from applications such as using TrackException() of SDK. If application or framework is eating exceptions those exceptions will be affecting the performance counter and the graph but never comes into the exceptions telemetry. We can verify by clicking edit button of "EXCEPTION RATE" to view the data element. It will be performance counter. See the image below.
We can cross check by remote login into the server and watch the counter in perfmon.exe utility. This behavior is documented by Microsoft here and here.

What are the sources of eaten exceptions?

The million dollar question is whether I should worry about this high exception rate? The answer is yes if we care about our software what we made. If the app is just working and its not mission critical and have enough budget to run the app, we may leave. But that is not the case with most of the applications especially if it is hosted in Azure. Every fault translate to money. If those exceptions are slowing down requests the entire throughput will get reduced which leads to more machines to serve less users.

OK, we are going to fix it. How to find out the sources of exceptions. As we seen in above section, perfmon.exe is a good start to find what processes are throwing exceptions. Once we know the source the fix depend on individual finding. Below are some examples
  • If we are using netNamedPipe binding to communicate between internal services, there are high chance that so many exceptions got eaten by that binding but functionality would be working.
  • If the system is consuming other Azure services and retry is implemented, there could be exceptions but retry might have solved the business scenario but the performance counter get affected by the exception. 
  • If there are any background workers running using 'On Error Resume Next' style, the exceptions encountered inside those may be eaten and affecting exceptions performance counter.
It is better to report the exception using TrackException even though we are handling it. Overall it is a tedious process but there are chances to get treasures which help us to remove one server to support same number of users.

Happy debugging.

Tuesday, May 29, 2018

ASP.Net bundling & minification via CDN Caching

It is normal for ASP.Net applications to use bundles to reduce http calls. It all works till the time CDN comes into the landscape. CDN (Content Delivery Network) uses a special approach to deliver contents from world wide distributed servers which we can envision as served from the edge of the network. That way so many requests get served from CDN server without coming to our web server.  It is nice theory about CDN. What is between ASP.Net bundling and CDN?

CDN normally cache static assets. Out of the box, they cache well known static assets such as html, css, js files. Yeah, image file formats too, if we configure. If we had bundled static assets using ASP.Net bundling mechanism which gets a different URL than just js files, those will not be seen by CDN as static assets. Hence no caching.

This is applicable if CDN is placed in between. Akamai provides such a service where we don't need to change our code when we introduce CDN. CDN servers cache when they get content from the backend web server. Normally when we introduce CDN, the domain has to change to serve from CDN. That has added benefit too since it increase the parallel request limit to same origin. If we want to integrate CDN via separate domain, better forget about ASP.Net bundling.

Here lets see how to make pass through CDNs work with ASP.Net bundling since they don't introduce new CDN URL.

How can we make CDN friend with ASP.Net bundles

Approach 1 - Don't use bundling

Sounds easy but if the ASP.Net application is designed as product which is hosted by individual customer, every consumer may not have CDN capability. So need bundling. For hosted applications ie there is only one instance of application in the world, yes we can try this. But still without http2, it would cause performance degradation as previous http versions make different connections for each file / resource request. If the application has angular or any other SPA frameworks used, the number of files may be large.

Approach 2 - Have CDN cache the ASP.Net bundle URL

When we bundle, ASP.Net provides a new URL instead of all the files inside the bundle. From browser we will see only one bundle URL request instead of individual resources in the bundle. That URL has a predefined format. If the CDN supports custom RegEx kind of way to specify the URL formats to cache, it works without much efforts.

The RegEx for recognizing ASP.Net bundle URL format goes as follows.


This works only if there are no other API or dynamic resource URL ending with ?v={44 letter value}

Approach 3 - Embrace SPA with JS modules & forget server side rendering 

Instead of server side rendering done by ASP.Net, embrace the industry trend SPA (Singe Page Application) with JavaScript module system such as Webpack. Webpack can produce static bundles which can be referred in application as a normal file.


If there is enough budget and time, embrace SPA else use the RegEx and stay with ASP.Net bundling and CDN.

I don't have any association with Akamai except using in projects. This post is not intended to promote any of their products.