Tuesday, October 29, 2013

Hadoop HDInsight Installation issues

I am closely watching movements in the big data for around 2 years. I tried to learn more about processing huge data set when I was writing a series about artificial intelligence. As any other developer, at one point I started feeling that we need more power to process big data sets which cannot be given by one machine. Even if the machine has the power, it cannot hold that much large data set in it's hard drive. It leads to thoughts of distributed computing. The problem is that, it is known from the college days but no idea how to implement or how to see real distributed processing in action?

The research often ends up in 'Google search indexing and how they process huge volume of internet data'. Since their techniques are secret, the only one way is Hadoop which is again a technique unveiled by them.

I tried installing original Apache Hadoop but not able to successfully execute a Map Reduce task in it. As I am working in a delivery focused company, the time available for the research activities is very less. Months later, I saw the Microsoft implementation of Hadoop in Azure. There was again a bottle neck. Even though it was free, it needs a valid credit card number. Due to some self enforced rules, I decided not to use credit cards. Time passed away, Microsoft released HDInsight to windows servers and desktops. I became happy. But there was a simple limitation. It requires Win7 64bit machine to install HDInsight. I was using 32bit Vista at that time. To get Win7 64 only for me, in the off shore team was a big task. After getting approvals after approvals, finally I got my machine upgraded to Win7 64bit. Immediately I installed HDInsight. I checked all the Apache services, dashboard, HDFS & Hadoop command line.

Again the busy project release time came. One project after another. The installation remained there. Recently things settled. The projects I am handling are in clam state and its the time for me to explore.

Hadoop installation issues

After this long time, I tried to open the Hadoop dashboard web site. It simply says "Internal Server Error". I checked the app pool and the corresponding HadoopAppPool is not running. When I tried to run it, I got an error in the event viewer related to authentication of .\Hadoop user which is running the app pool. I thought that the password might be expired. There is no chance for change in the password. Sometimes if we set the credentials again it may work. But what is the password of local Hadoop user?

I looked into the services.msc for all Apache services and could see all are stopped. I tried to start but failed due to authentication issues. All the services are by default installed to run using .\hadoop service account which is created when we install hadoop. Finding the password of local Hadoop user became essential now. Below is the error which I could see in event viewer.


Windows could not start the Apache Hadoop hiveserver2 service on Local Computer.

Error 1069: The service did not start due to a logon failure.

Finding the password of local Hadoop user

The search for getting the password of hadoop user ended in a file located at C:\Hadoop\singlenodecreds.xml. It shows an encrypted value instead of actual password. So how to decrypt it?


$xml = Import-Clixml -Path "C:\Hadoop\singlenodecreds.xml"
 
$psCred = New-Object System.Management.Automation.PSCredential $import.Username, ($xml.Password | ConvertTo-SecureString)
 
Write-Host $psCred.GetNetworkCredential().Password

The above PowerShell will do it. If it can be retrieved in this way, why Microsoft encrypted it in the file? Only HDInsight team knows? May be when its installed in multiple machines, it makes sense.

Getting the password solved the problem? Absolutely no. I was able to set the credentials but got the same error

Reinstalling Hadoop

Now its the time to reinstall Hadoop. Before that I just googled for similar issues and most of them resolved the issue by either formatting the machine or installing it in a fresh virtual machine. I uninstalled and installed hadoop using Microsoft Web Platform Installer. This time all the services are running. But I cannot see the Hadoop dashboard and WebAPI applications in IIS. 

Some people says that, if the hadoop dashboard is not coming we can install it by executing the below PowerShell script.


cd C:\HadoopFeaturePackSetup\HadoopFeaturePackSetupTools
.\winpkg.ps1 ..\Packages\HadoopWebApi-winpkg.zip install -CredentialFilePath c:\Hadoop\Singenodecreds.xml
.\winpkg.ps1 ..\Packages\HadoopDashboard-winpkg.zip install -CredentialFilePath c:\Hadoop\Singenodecreds.xml

In my case I was not able to find both HadoopWebApi-winpkg.zip & HadoopDashboard-winpkg.zip in C:\HadoopFeaturePackSetup\HadoopFeaturePackSetupTools folder. Is my company firewall blocking these files from getting downloaded? Seems this is the time to get a deep dive into Web Platform Installer.

The internals of web platform installer needs a full post.So let me summarize. I could see that the web platform installer is requesting http://www.microsoft.com/web/webpi/4.5/toolsproductlist.xml to get the details about HDInsight installer. The Id of HDInsight in the xml file is 'HDINSIGHT-PREVIEW'. When I opened the xml file I could see that the size of installer is 30.300 MB, but when I manually download the file from http://go.microsoft.com/fwlink/?LinkId=269735 (Its also mentioned in xml)its around 3.800MB .May be the downloaded file MSSingleNodeInstaller.msi is corrupted or there is some firewall issue. But if firewall is playing in between I cannot touch the file itself.. This is something problematic and is not under my control. So I posted in MSDN and Stackoverflow. Then started refreshing :)

Do I really need working Hadoop dashboard to work with Hadoop

Now answers for couple of hours. May be most of the Hadoop experts are in the eastern side of the earth and as I am working from India, I need to wait till they get up. But time is money. I thought of taking a chance. Why can't try executing a Hadoop Job?
The reasons behind this is, 
  1. Microsoft created HDinsight on top of Apache Hadoop.
  2. Apache Hadoop is working. At least all the services are running.
  3. Dashboard is created in ASP.Net which may be just a portal for showing the results. I can see the results by using command line.There are more developers out there who are using Hadoop without ASP.Net dashboard.
I started coding my hadoop job using Visual studio and executed it.Surprisingly, I didn't get any errors which says 'I occurred due to lack of Hadoop dashboard or HadoopWeb API". But obviously there were errors which I can explain in next post.

What were the issues

The root issue is 'I didn't read the HDInsight release notes'
  • The password of Hadoop user will expire and the release note has mentioned the same with a script to correct it.
  • The dashboard is removed from latest installation which is done via web platform installer.This also mentioned in the release notes.If we need to get details of result and status of map reduce jobs we need to use the PowerShell commands.
  • The size of the MSSingleNodeInstaller.msi file is correct in 4.6 version of WebPI http://www.microsoft.com/web/webpi/4.6/toolsproductlist.xml
More links to installation

Monday, October 21, 2013

ASP.Net MVC 4 Error handling in real time applications

Whenever we google for ASP.Net MVC exception handling techniques, we get a bunch of links but most of those explains about different techniques and finally ask us to decide one method based on our requirement. Sometimes we will end up in a more confused state. This article is to help a real time ASP.Net MVC developer to decide his logging mechanism.

Aim of error handling

Below are the things we should aim when we develop a error handling framework.

Never expose the exception details

We should never return the details of exception such as call stack to the end user. If we expose, it is considered as a security hole for hackers to understand the working of our system.

Log all the exception details

No system in this world is perfect. We need to accept the fact that, there may be exceptions. How we are dealing with those exceptions is important. To fix bugs and improve our system, it is required to log all the exceptions occurred in the system. The developers can therefore analyse the logs and fix.

Types of errors

Below are the different types of error conditions, we can expect in our ASP.Net MVC application.

Exceptions in ASP.Net MVC pages

This is the most common type of exceptions we need to handle. There may be exceptions when we execute the controller code which needs to be handled properly.

How to handle MVC page exceptions

There are 2 methods to handle this. Either we need to handle the exception at MVC framework level using the HandleError attribute on the controller classes or handle the exception at application level by global.asax :Application_Error(). 

I would suggest going for the error handling at the application level. It will capture all the error including routing issues. How that is useful in case of wrong URLs can be seen in next section.

One challenge in handling the exceptions at application level is that we cannot show the MVC Error page directly as we are out of MVC framework. For that we need to follow the below technique in the Application_Error method.

        protected void Application_Error(object sender, EventArgs e)
        {
            var httpContext = ((MvcApplication)sender).Context;
            new MVCApplicationExceptionHandler().Handle(httpContext,Server);
        }

    public class MVCApplicationExceptionHandler
    {
        internal  void Handle(HttpContext httpContext, HttpServerUtility server)
        {
            var currentRouteData = RouteTable.Routes.GetRouteData(new HttpContextWrapper(httpContext));
            string currentControllerName = GetCurrentControllerName(currentRouteData);
            string currentActionName = GetCurrentActionName(currentRouteData);
 
            var ex = server.GetLastError();
            new ExceptionLogger().Log(ex);
            RedirectToErrorPage(httpContext, ex,currentControllerName,currentActionName);
        }
        private void RedirectToErrorPage(HttpContext httpContext,Exception ex, string currentControllerName, string currentActionName)
        {
            //Clearing httpContext
            httpContext.ClearError();
            httpContext.Response.Clear();
            httpContext.Response.StatusCode = ex is HttpException ? ((HttpException)ex).GetHttpCode() : 500;
            httpContext.Response.TrySkipIisCustomErrors = true;
 
            //Setting values for new route
            string errorAction = GetErrorActionNameFrom(ex);
            var routeData = new RouteData();
            routeData.Values["controller"] = "Error";
            routeData.Values["action"] = errorAction;
            
            //Fire the ErrorController
            var controller = new ErrorController();
            controller.ViewData.Model = new HandleErrorInfo(ex, currentControllerName, currentActionName);
            ((IController)controller).Execute(new RequestContext(new HttpContextWrapper(httpContext), routeData));
        }
        private string GetErrorActionNameFrom(Exception ex)
        {
            var action = "Index";
            var httpEx = ex as HttpException;
            if (httpEx !=null)
            {
                switch (httpEx.GetHttpCode())
                {
                    case 404:
                        action = "NotFound";
                        break;
                    default:
                        action = "Index";
                        break;
                }
            }
            return action;
        }
        private string GetCurrentActionName(RouteData currentRouteData)
        {
            string actionName = string.Empty;
            if (currentRouteData != null &&
                currentRouteData.Values["action"] != null && !String.IsNullOrEmpty(currentRouteData.Values["action"].ToString()))
                {
                    actionName = currentRouteData.Values["action"].ToString();
                }
            return actionName;
        }
        private string GetCurrentControllerName(RouteData currentRouteData)
        {
            string controllerName = string.Empty;
            if (currentRouteData != null &&
                currentRouteData.Values["controller"] != null && !String.IsNullOrEmpty(currentRouteData.Values["controller"].ToString()))
                {
                    controllerName = currentRouteData.Values["controller"].ToString();
                }
            return controllerName;
        }
    }

The ErrorController and it's View can be simple. ErrorController.Index() action which will be invoked when an exception occurs inside the controller.

Requests to wrong ASP.Net MVC page URLs which cannot be routed

Sometimes there will be requests to wrong URLs. We need to respond with proper message instead of showing the white screen or ASP.Net generated error page.

eg: www.mycompany.com/Emplyee/1 - The spelling mistake in the word 'employee' should be handled by our application.

How to handle wrong MVC URL requests

The above solution will work for wrong URLs as well. In this case the NotFound action will be invoked.See the attached sample for more details.

Exceptions in AJAX calls

AJAX calls may also raise exception in the controller. Those needs to be handled and the required information needs to be passed to the client side to inform the user about his request.

How to handle AJAX exceptions

Instead of error page we should return a JSON response which tells that there is an error happened. Lets see one example below

<script type="text/javascript">
    function divide() {
        var n1 = $("#n1").val();
        var n2 = $("#n2").val();
        //Lets not bother about the data which is available in n1 and n2. Assume that those are numbers
        $.ajax({
            type: "GET",
            cache: false,
            url: "../Calculator/Divide",
            data: { "n1": n1, "n2": n2 },    // multiple data sent using ajax
            success: function (html) {
                if (html.IsSuccess) {
                    $("#res").val(html.Result);
                }
                else {
                    alert(html.Result);
                }
            },
            error: function (a, b, c) {
                alert("Some unexpected error happened");
            }
        });
    }
</script>

    public class CalculatorController : Controller
    {
        public ActionResult Divide(int n1,int n2)
        {
            try
            {
                return new JsonResult() {JsonRequestBehavior = JsonRequestBehavior.AllowGet,Data = new  { IsSuccess = true, Result = n1 / n2 } };
            }
            catch (DivideByZeroException ex )
            {
                return new JsonResult() {JsonRequestBehavior = JsonRequestBehavior.AllowGet, Data = new { IsSuccess = false, Result = "You cannot divide by Zero" } };
            }
        }
    }

I don't think it needs any explanation. Even in case of exception its returning the JSON response. But before processing at the client side, it should check for IsSuccess property. If any other exception occurs the control goes to application level which we can see in next section.

The catch blocks should not be exception eating blocks. Use the catch blocks which we are sure that we can handle.

AJAX Request to wrong URLs which cannot be routed

In case of an AJAX request to wrong url either, we should return JSON response with error details or return "Not found status". Here we are redirecting to  ErrorController.APINotFound() action from there we are returning JSON result. The response http code will still be error. Only advantage here is that the clients will get details of the error or what went wrong in a secure manner. This is the better method which I could see instead of throwing http errors alone.

How to handle wrong AJAX requests

For this we had to modify one of the method in earlier code which is used to retrieve the Error action name. Now we need to look into the request header for the origin. If its originated from AJAX the value will be "XMLHttpRequest".

        private string GetErrorActionNameFrom(Exception ex, HttpContext context)
        {
            var action = "Index";
            if (!string.IsNullOrWhiteSpace( context.Request.Headers["X-Requested-With"]) &&
                context.Request.Headers["X-Requested-With"].Equals("XMLHttpRequest"))
            {
                action = "APINotFound";
            }
            else
            {
                var httpEx = ex as HttpException;
                if (httpEx != null)
                {
                    switch (httpEx.GetHttpCode())
                    {
                        case 404:
                            action = "NotFound";
                            break;
                        default:
                            action = "Index";
                            break;
                    }
                }
            }
            return action;
        }


        public ActionResult APINotFound()
        {
            return new JsonResult() { JsonRequestBehavior = JsonRequestBehavior.AllowGet, 
                Data = new { IsSuccess=false,Result="Not able to process your API request due to wrong url or unexpected errors"} };
        }

    function divide() {
        var n1 = $("#n1").val();
        var n2 = $("#n2").val();
        //Lets not bother about the data which is available in n1 and n2. Assume that those are numbers
        $.ajax({
            type: "GET",
            cache: false,
            url: "../Calculator/ivide",
            data: { "n1": n1, "n2": n2 },    // multiple data sent using ajax
            success: function (html) {
                if (html.IsSuccess) {
                    $("#res").val(html.Result);
                }
                else {
                    alert(html.Result);
                }
            },
            error: function (a, b, c) {
                alert(JSON.parse(a.responseText).Result);
            }
        });
    }
</script>

Sample can be download from sky drive. Now we can say out ASP.Net MVC application has an exception handling framework.

Tuesday, October 15, 2013

Reading empty MS Excel cells using OpenXML sdk

When we normally read the excel cells using open XML it will omit the empty cells. This is because the open xml is not keeping the information about the empty cells in the xml. We can verify this by doing a simple exercise mentioned below

  1. Open MS Excel 2007 or higher version and create a new excel document
  2. Add some text in the cell B2
  3. Save the file. Make sure you are saving in .xlsx format
  4. Once its saved change the .xlsx extension to .zip
  5. Extract the .zip file contents
  6. You can see the xml file where its recording only the value for cell B2. A1,A2,B1 are not at all mentioned in the xml.
Lets come back to the topic which is reading the excel file using Open XML sdk. The normal procedure is to iterate through each excel Row and do second iteration using Row.Elements<Cell>() .In the above mentioned file if we do the iteration the cell count will be 1 when we iterate second row. But if our requirement is to read and convert to DataSet or to create object model from excel, this behaviour may cause issues as the blank cells of excel are not considered by Open XML. I faced the same situation and below is the one which I felt more suitable to address the problem.

        public DataTable ReadFirstCell()
        {
            DataTable dataTable = new DataTable();
            string value = string.Empty;
            try
            {
                using (SpreadsheetDocument myWorkbook = SpreadsheetDocument.Open(_filePath, true))
                {
                    var sheet = myWorkbook.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().First();
                    var worksheetPart = (WorksheetPart)myWorkbook.WorkbookPart.GetPartById(sheet.Id.Value);
                    IEnumerable<Row> rows = worksheetPart.Worksheet.GetFirstChild<SheetData>().Elements<Row>();
                    int RowIndex = 0;
 
                    foreach (var row in rows)
                    {
                        //Create the data table header row ie columns using first excel row.
                        if (RowIndex == 0)
                        {
                            CreateColumnsFromHeaderRow(row, dataTable);
                            RowIndex++;
                        }
                        else
                        {
                            //From second row of excel onwards, add data rows to data table.
                            IEnumerable<Cell> cells = GetCellsFromRowIncludingEmptyCells(row);
                            DataRow newDataRow = dataTable.NewRow();
                            int columnCount = 0;
                            foreach (Cell currentCell in cells)
                            {
                                value = GetCellValue(currentCell);
                                //There are empty headers which are not added to data table columns. So avoid those.
                                if (columnCount < dataTable.Columns.Count)
                                {
                                    newDataRow[columnCount++] = value;
                                }
                            }
                            dataTable.Rows.Add(newDataRow);
                        }
                    }
                }
            }

Below are the helper methods used
        private static IEnumerable<Cell> GetCellsFromRowIncludingEmptyCells(Row row)
        {
            int currentCount = 0;
            // row is a class level variable representing the current
            foreach (DocumentFormat.OpenXml.Spreadsheet.Cell cell in
                row.Descendants<DocumentFormat.OpenXml.Spreadsheet.Cell>())
            {
                string columnName = GetColumnName(cell.CellReference);
                int currentColumnIndex = ConvertColumnNameToNumber(columnName);
                //Return null for empty cells
                for (; currentCount < currentColumnIndex; currentCount++)
                {
                    yield return null;
                }
                yield return cell;
                currentCount++;
            }
        }

        public static string GetColumnName(string cellReference)
        {
            // Match the column name portion of the cell name.
            Regex regex = new Regex("[A-Za-z]+");
            Match match = regex.Match(cellReference);
 
            return match.Value;
        }

        public static int ConvertColumnNameToNumber(string columnName)
        {
            Regex alpha = new Regex("^[A-Z]+$");
            if (!alpha.IsMatch(columnName)) throw new ArgumentException();
 
            char[] colLetters = columnName.ToCharArray();
            Array.Reverse(colLetters);
 
            int convertedValue = 0;
            for (int i = 0; i < colLetters.Length; i++)
            {
                char letter = colLetters[i];
                int current = i == 0 ? letter - 65 : letter - 64; // ASCII 'A' = 65
                convertedValue += current * (int)Math.Pow(26, i);
            }
 
            return convertedValue;
        }

The only method remaining is GetCellValue(). It can be found in one of my previous post about OpenXML sdk.

Tuesday, October 8, 2013

Reading Excel cell with number formatting using OpenXML sdk

Background

In Excel, we can assign cell styles such as number format, date-time format, etc... If we have an excel document with that formatting, we normally would like to see the formatting when we read the same using a program. But if you are using OpenXML SDK to read excel file, it cannot be achieved directly. We need to read the style index of Excel cell object and find the corresponding CellStyle then format the CellValue based on the format code. Let's see how to achieve that using C# and .Net

Reading excel cell value with formatting

I assume that you have a basic understanding of OpenXML SDK programming. The below steps explains how to read the "A1" cell of an excel document with formatting using open xml sdk.

Step 1 - Obtain the object of Excel cell

First, create a OpenXMLExcelReader class. Accept the excel file name as a parameter via the constructor and keep in a member variable. Then start with the method ReadFirstCell()


    public class OpenXMLExcelReader
    {
        string _filePath;
        public OpenXMLExcelReader(string filePath)
        {
            _filePath = filePath;
        }
        public string ReadFirstCell()
        {
                return ReadCell( "A", 1);
        }
    }


Now, where is the ReadCell method which needs to be a generic implementation which will read any cell value by accepting cell reference? It can be defined as below


        public string ReadCell(string columnName, uint row)
        {
            string value = string.Empty;
            using (SpreadsheetDocument doc = SpreadsheetDocument.Open(_filePath, true))
            {
                var sheet = doc.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().First();
                var worksheetPart = (WorksheetPart)doc.WorkbookPart.GetPartById(sheet.Id.Value);
                Cell cell = GetCell(worksheetPart.Worksheet, columnName, row);
                if (cell != null)
                {
                    value = GetCellValue(cell);
                }
            }
            return value;
        }


GetCell() method is returning the cell object nicely. Let's see how the GetCell is implemented. You can see 2 methods which do their responsibilities neatly.


        private static Cell GetCell(Worksheet worksheet, string columnName, uint rowIndex)
        {
            Cell returnCell = null;
            Row row = GetRowFromWorksheet(worksheet, rowIndex);
            if (row != null)
            {
                returnCell= row.Elements<Cell>().Where(c => string.Compare
                       (c.CellReference.Value, columnName +
                       rowIndex, true) == 0).FirstOrDefault();
            }
            return returnCell;
        }
        private static Row GetRowFromWorksheet(Worksheet worksheet, uint rowIndex)
        {
            return worksheet.GetFirstChild<SheetData>().
              Elements<Row>().Where(r => r.RowIndex == rowIndex).FirstOrDefault();
        }

Step 2 - Reading value from excel Cell object using CellStyle

This step is a little more complex. Sometimes we can read InnerText directly. Sometimes we need to get value from a Shared string table. Sometimes using Run elements and if formatting is applied, we need to retrieve the format and apply to the text which is read from cell. Let's see how we can organize it.

        private static string GetCellValue(Cell excelCell)
        {
            string value;
            if (excelCell == null)
                return null;
            if (excelCell.DataType == null)
            {
                return GetCellValueWithoutConsideringDataType(excelCell);
            }
            value = excelCell.InnerText;
            //If none of the below cases are executed, return the innerText            
            switch (excelCell.DataType.Value)
            {
                case CellValues.SharedString:
                    value = GetCellValueByReadingFromSharedString(excelCell);
                    break;
                case CellValues.Boolean:
                    switch (value)
                    {
                        case "0": value = "false"break;
                        default: value = "true"break;
                    }
                    break;
            }
            return value;
        }


In most of the cases, if the cell formatting is applied as number the DataType will be null and it will get into the method GetCellValueWithoutConsideringDataType(). Below is the implementation.


        private static string GetCellValueWithoutConsideringDataType(Cell excelCell)
        {
            CellFormat cellFormat = GetCellFormat(excelCell);
            if (cellFormat != null)
            {
                return GetFormatedValue(excelCell, cellFormat);
            }
            else
            {
                return excelCell.InnerText;
            }
        }

If there is a cell format associated with the cell, process accordingly, else return the innerText. Below is the code to obtain CellFormat object from the cell.

        private static CellFormat GetCellFormat(Cell cell)
        {
            WorkbookPart workbookPart = GetWorkbookPartFromCell(cell);
            int styleIndex = (int)cell.StyleIndex.Value;
            CellFormat cellFormat = (CellFormat)workbookPart.WorkbookStylesPart.Stylesheet.CellFormats.ElementAt(styleIndex);
            return cellFormat;
        }
 
        private static WorkbookPart GetWorkbookPartFromCell(Cell cell)
        {
            Worksheet workSheet = cell.Ancestors<Worksheet>().FirstOrDefault();
            SpreadsheetDocument doc = workSheet.WorksheetPart.OpenXmlPackage as SpreadsheetDocument;
            return doc.WorkbookPart;
        }

Now we have cell value and cell format. If the cell format has any number format, lets apply that.

        private static string GetFormatedValue(Cell cell, CellFormat cellformat)
        {
            string value;
            if (cellformat.NumberFormatId != 0)
            {
                string format = GetWorkbookPartFromCell(cell).WorkbookStylesPart.Stylesheet.NumberingFormats.Elements<NumberingFormat>()
                    .Where(i => i.NumberFormatId.Value == cellformat.NumberFormatId.Value)
                    .First().FormatCode;
                double number = double.Parse(cell.InnerText);
                value = number.ToString(format);
            }
            else
            {
                value = cell.InnerText;
            }
            return value;
        }


This handles only Number formatting.If you want to deal with other formats you need to extend this. Reading from the shared string table can be found in my previous post about reading excel cells with formatting.