Tuesday, October 31, 2017

Exposing Parquet file to SQL 2016 as well as Hadoop (Java/Scala)

This is just an architecture post explaining the possibility of Parquet file exposed to SQL 2016 databae via polybase and other applications accessing normally. The other applications can be anything such as data analytics code running in Hadoop cluster.

Mainly this kind of integration needed when we already have an transaction database such as SQL Server and we have to analyze data. Either we can have scheduled data movement using ETL technologies or we can use polybase to move data from an internal normal table to external polybase table which is backed by parquet file. If the solution is in Azure, the parquet file can be somewere in storage. Once the data is there in parquet file format the analytics algorithms can hit the same. Parquet file is mentioned here because of their familiarity in analytics community.

Below goes such an architecture.
Since the architecture may change over the time, LucidChart diagram is embedded. Please comment if this is not working. Thanks to LucidChart for their freemium model.

Details on implementation such as code snippets are good to share in separate post.

Tuesday, October 24, 2017

Caution JavaScript Ahead - setTimeout() or setInterval() doen't make JavaScript multi threaded

If someone born into JavaScript, they know what is meant by JavaScript is single threaded. Developers coming from other languages also knows that JavaScript is single threaded. But some APIs in JavaScript force them to think that JavaScript is multi threaded. Below is such a situation

setTimeout(() => {
  console.log('Timeout hit and time is ' + new Date());
}, 1000);
console.log('setTimeout at ' + new Date());

There could be N number of reasons for someone wants to execute his code after specified time. It might be initialization delay, rendering time, session timeout handling etc... But is this going to solve the problem by executing the code exactly after 1 second (1000ms)?

If we consider any application there will be more code than these 2 lines. Consider someone else written below code which got executed after we register the handler via setTimeout()

setTimeout(() => {
  console.log('Timeout hit and time is ' + new Date());
}, 1000);
console.log('setTimeout at ' + new Date());
problemMaker()

function problemMaker() {
  //Any sync AJAX call or some code which execute for long time.
  var url = 'https://httpbin.org/delay/5';
  var request = new XMLHttpRequest();
  request.open('GET', url, false);
  request.send();
  document.writeln(request.responseText.length); 
}

Does this ensure that the function gets executed after 1 second? Native JavaScript developers can immediately identify the issue. Other may think it might work. Lets see a test run result in console.

setTimeout at Tue Oct 24 2017 19:46:14 GMT-0400 (Eastern Daylight Time)
Timeout hit and time is Tue Oct 24 2017 19:46:29 GMT-0400 (Eastern Daylight Time)

Yes JavaScript is single threaded whatever we do with setTimeout or setInterval functions. Better do no trust them on when they are going to execute. If we write like this it may work on development machine and may fail in higher environments such as business testing, staging or production. Highly inconsistent issue. Lets avoid saying "It works in my machine".

Sample code located at https://plnkr.co/edit/uexi2U

Tuesday, October 17, 2017

Running multiple instances of AzCopy.exe command

AzCopy.exe is really an amazing tool for data transfer. But if we run multiple instances of AzCopy we may get below error.

AzCopy Command - AzCopy /Source:c:\temp\source /Dest:https://<storage account>.blob.core.windows.net/test /DestSAS:"<SAS>" /pattern:"" /s

An error occurred while reading the restart journal from "C:\Users\<user name>\AppData\Local\Microsoft\Azure\AzCopy". Detailed error: The process cannot access the file 'C:\Users\<user name>\AppData\Local\Microsoft\Azure\AzCopy\AzCopyCheckpoint.jnl' because it is being used by another process.

The error is pretty much clear. AzCopy keeps a journal file for resume functionality and if we don't specify the journal file location in command it uses default location and when second AzCopy starts it cannot read journal file.

The fix is to specify the location for .jnl. AzCopy Command goes as follows
AzCopy /Source:c:\temp\source /Dest:https://<storage account>.blob.core.windows.net/test /DestSAS:"<SAS>" /pattern:"" /s /z:<unique folder for azcopy command>

If we are running AzCopy from the command window it is easy to find out. But, if AzCopy is invoked from applications (PowerShell or .Net) in parallel it is difficult to find out because we might have disabled all the messages using /y. AzCopy has /v: switch which redirect the logs to a file. That will help to troubleshoot.

Tuesday, October 3, 2017

Using .Net default value is trouble especially for APIs

Though there are features, they are not meant to be used. JavaScript atleast has a book named "JavaScript: The good parts" but others don't have one. Lets see one scenario in C# .Net.

Long long ago there was an API exposed to clients.

        class MyAPI
        {
            public void APIMethod(string s, int i = 10)
            {
                Console.WriteLine($"Inside foo with i = {i}");
            }
        }
internal void Test()
        {
            MyAPI apiClient = new MyAPI();
            apiClient.APIMethod("hi");
        }

Clients were happy using it. Later someone got added to the API team and he overloaded the method as follows thinking that keeping the same name will help the clients discover the API.

        class MyAPI
        {
            public void APIMethod(string s, int i = 10)
            {
                Console.WriteLine($"Inside APIMethod with i = {i}");
            }
            public void APIMethod(string s)
            {
                Console.WriteLine($"Inside APIMethod");
            }
        }

Clients happily adopted the new version of API. But soon they started feeling their calls are not working as expected. They escalated the issue and developers spent hours and days and finally they figured what went wrong. They corrected the code as follows.

        class MyAPI
        {
            public void APIMethod(string s, int i = 10)
            {
                Console.WriteLine($"Inside APIMethod with i = {i}");
            }
            public void NewAPIMethod(string s)
            {
                Console.WriteLine($"Inside NewAPIMethod");
            }
        }

Moral of the story

Do not use a feature only because it is available.