Monday, November 4, 2013

Why my HDInsight Hadoop job failed

This post assumes that the reader had read my previous post about Hadoop HDInsight Installation issues. My aim was to run a simple Hadoop Map Reduce task in Win 7 - 64 bit machine using HDInsight

When I run my first Hadoop application from Visual Studio, I got the below result in console windows with a status of 'Job failed'.


Output folder exists.. deleting.
File dependencies to include with job:
[Auto-detected] <drive>:\Joy\Code\DotNet\HDInsight\HadoopTest\bin\Debug\HadoopTest.vsh
ost.exe
[Auto-detected] <drive>:\Joy\Code\DotNet\HDInsight\HadoopTest\bin\Debug\HadoopTest.exe

[Auto-detected] <drive>:\Joy\Code\DotNet\HDInsight\HadoopTest\bin\Debug\Microsoft.Hado
op.MapReduce.dll
[Auto-detected] <drive>:\Joy\Code\DotNet\HDInsight\HadoopTest\bin\Debug\Microsoft.Hado
op.WebClient.dll
[Auto-detected] <drive>:\Joy\Code\DotNet\HDInsight\HadoopTest\bin\Debug\Newtonsoft.Jso
n.dll
packageJobJar: [] [/<drive>:/Hadoop/hadoop-1.1.0-SNAPSHOT/lib/hadoop-streaming.jar] <drive>:
\Users\<username>\AppData\Local\Temp\streamjob4051103269437643633.jar tmpDir=null
13/10/21 22:19:07 WARN security.ShellBasedUnixGroupsMapping: got exception tryin
g to get groups for user <logged in user's name without domain\>
org.apache.hadoop.util.Shell$ExitCodeException: GetLocalGroupsForUser error (222
1): The user name could not be found.

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:454)
        at org.apache.hadoop.util.Shell.run(Shell.java:369)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
573)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:659)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:642)
        at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUserGroups(
ShellBasedUnixGroupsMapping.java:65)
        at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(Shel
lBasedUnixGroupsMapping.java:41)
        at org.apache.hadoop.security.Groups.getGroups(Groups.java:79)
        at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGro
upInformation.java:1041)
        at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmi
ssionFiles.java:107)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:872)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:866)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
tion.java:1136)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:8
66)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:840)
        at org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.j
ava:917)
        at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:122)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java
:50)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
13/10/21 22:19:07 WARN security.UserGroupInformation: No groups available for us
er joyg
java.io.IOException: No groups found for user <user name>
        at org.apache.hadoop.security.Groups.getGroups(Groups.java:81)
        at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGro
upInformation.java:1041)
        at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmi
ssionFiles.java:107)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:872)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:866)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
tion.java:1136)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:8
66)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:840)
        at org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.j
ava:917)
        at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:122)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java
:50)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
13/10/21 22:19:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/10/21 22:19:08 WARN snappy.LoadSnappy: Snappy native library not loaded
13/10/21 22:19:08 INFO mapred.FileInputFormat: Total input paths to process : 4
13/10/21 22:19:08 INFO streaming.StreamJob: getLocalDirs(): [c:\hadoop\HDFS\mapr
ed\local]
13/10/21 22:19:08 INFO streaming.StreamJob: Running job: job_201310212213_0001
13/10/21 22:19:08 INFO streaming.StreamJob: To kill this job, run:
13/10/21 22:19:08 INFO streaming.StreamJob: C:\Hadoop\hadoop-1.1.0-SNAPSHOT/bin/
hadoop job  -Dmapred.job.tracker=localhost:50300 -kill job_201310212213_0001
13/10/21 22:19:08 INFO streaming.StreamJob: Tracking URL: http://127.0.0.1:50030
/jobdetails.jsp?jobid=job_201310212213_0001
13/10/21 22:19:10 INFO streaming.StreamJob:  map 0%  reduce 0%
13/10/21 22:19:42 INFO streaming.StreamJob:  map 100%  reduce 100%
13/10/21 22:19:42 INFO streaming.StreamJob: To kill this job, run:
13/10/21 22:19:42 INFO streaming.StreamJob: C:\Hadoop\hadoop-1.1.0-SNAPSHOT/bin/
hadoop job  -Dmapred.job.tracker=localhost:50300 -kill job_201310212213_0001
13/10/21 22:19:42 INFO streaming.StreamJob: Tracking URL: http://127.0.0.1:50030
/jobdetails.jsp?jobid=job_201310212213_0001
13/10/21 22:19:42 ERROR streaming.StreamJob: Job not successful. Error: # of fai
led Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_20131
0212213_0001_m_000000
13/10/21 22:19:42 INFO streaming.StreamJob: killJob...
Streaming Command Failed!


Did you get any thing from this log? There were basically 2 problems. 
  1. WARN - Not able to get groups 
  2. ERROR Job not successful

Error :Job not successful

Lets first focus on the error as always.The starting point of this investigation should be the log and where we can find the logs? Here comes the job tracker. Notice the tracking URL mentioned in the output http://127.0.0.1:50030/jobdetails.jsp?jobid=job_201310212213_0001
Browse to this page

We can easily find a table like the below one in that page
Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map100.00%
400042 / 0
reduce100.00%
100010 / 0
This table gives the details about map reduce tasks executed. In the last column (Failed/Killed Task Attempts) we can see a link. Its highlighted in blue above for reference. Navigate to that link to get more information.

Now we are entering more detailed log page. There will be a table which describes about attempts, tasks error etc..Our target is the last column titled as 'Logs'. Select the 'Last 4KB' link to navigate to there.

This takes us to a page with URL of the below kind
http://<FQDN>:50060/tasklog?attemptid=attempt_201310212213_0001_m_000000_0&start=-4097

Welcome to .Net!!! Here we can see logs which tells more about our .net application. Earlier we were seeing Java related call stack. an example log is pasted below


Unhandled Exception: Microsoft.Hadoop.MapReduce.StreamingException: The user type could not be loaded. DLL=HadoopTest.exe, Type=HadoopTest.SentenceMapper ---> 
System.BadImageFormatException: Could not load file or assembly 'file:///c:\hadoop\HDFS\mapred\local\taskTracker\joylocal\jobcache\job_201310290904_0001\attempt_201310290904_0001_m_000000_0\work\HadoopTest.exe' or one of its dependencies. An attempt was made to load a program with an incorrect format.
   
at System.Reflection.RuntimeAssembly._nLoad(AssemblyName fileName, String codeBase, Evidence assemblySecurity, RuntimeAssembly locationHint, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean throwOnFileNotFound, Boolean forIntrospection, Boolean suppressSecurityChecks)
   
at System.Reflection.RuntimeAssembly.InternalLoadAssemblyName(AssemblyName assemblyRef, Evidence assemblySecurity, RuntimeAssembly reqAssembly, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean throwOnFileNotFound, Boolean forIntrospection, Boolean suppressSecurityChecks)
   
at System.Reflection.RuntimeAssembly.InternalLoadAssemblyName(AssemblyName assemblyRef, Evidence assemblySecurity, RuntimeAssembly reqAssembly, StackCrawlMark& stackMark, Boolean throwOnFileNotFound, Boolean forIntrospection, Boolean suppressSecurityChecks)
   
at System.Reflection.RuntimeAssembly.InternalLoadFrom(String assemblyFile, Evidence securityEvidence, Byte[] hashValue, AssemblyHashAlgorithm hashAlgorithm, Boolean forIntrospection, Boolean suppressSecurityChecks, StackCrawlMark& stackMark)
   
at System.Reflection.Assembly.LoadFrom(String assemblyFile)   
at Microsoft.Hadoop.MapReduce.MapperMain.Main()
   --- End of inner exception stack trace ---
at Microsoft.Hadoop.MapReduce.MapperMain.Main()
at Microsoft.HdInsight.MapDriver.Program.Main(String[] args)
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code -532462766
 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)

This tells that, when the Apache Hadoop framework which is written in Java tried to invoke our MapReduce tasks which is written in C#.Net, there was an exception in our .Net code related to the executable's image format.

The easiest method to solve this is to google the error message. As usual, the first google result was point to stackoverflow.com and it solved my problem. Below are the causes of that problem

  1. The program is not running as admin;
  2. The executables are missing (paths in log files)
  3. Whether .net framework is working for the corresponding version?
  4. The building target is x86 instead of x64

Lesson 1

The .Net assembly we are creating for Hadoop MapReduce tasks should be targeting to 64bit environment.

WARN - got exception trying to get groups for user ****

This took so much time. As the type says WARN, it was not causing any issues to the execution of Hadoop MapR jobs.But still I felt this is something interesting to solve. The first aim was to understand the warning. Basically it says, 'it is not able to retrieve windows groups of user who is running the program'. As I have logged in to this system with my company login (company domain\username), it shows my user name without the domain name prefix.

Attempt 1 - Editing dfs.web.ui

I tried googling the error message. The links said that, I need to edit the hdfs-site.xml which is present in the location <install drive>:\Hadoop\hadoop-1.1.0-SNAPSHOT\conf\hdfs-site.xml

I modified the dfs.web.ui section in that file but no luck.Actually that section itself was not there.So I added it. This file is something to be studied later.

Attempt 2 - Understanding the Hadoop source code

Now we don't have any chance. We need to go back to the past days where internet was not there to google. Really I had programmed 2 of my academic projects without googling as internet was not common at that time. This will be really hard to digest for the people who starts coding with google.

If we are true programmers and confident in our algorithmic skills, understanding source code will not be a problem even if we are seeing that language for the first time. This case is more simple. Hadoop is open source written in Java and anybody can see the source code.

As we can see in the stack trace, there is a Shell class which is executing the command which is retrieving the groups of user. What is the command and where from its coming? Only way is to browse the source code of Util.Shell class available in the below link

Here is a small tricky part or the point where people who don't know the programming principles needs to spend more time. The method name is runCommand() which doesn't tell anything about users or groups. Yes the source code is written by great programmers who follows SOLID principles where each class and method will be doing single responsibility. The command which is used to get the user groups will be coming from somewhere outside of this method. More precisely, in one of the previous methods in the call stack.

When we manually inspect the call stack, we can see the logic of retrieving the groups of the user may be more likely present in the class ShellBasedUnixGroupsMapping and method getGroups().If we further digg down the code path, we can see that, it is calling Shell.getGroupsForUserCommandShell.getGroupsForUserCommand() method to get the command. Lets see the code of getGroupsForUserCommand method

public static String[] getGroupsForUserCommand(final String user) {
   //'groups username' command return is non-consistent across different unixes
   return (WINDOWS)? new String[] { WINUTILS, "groups", "-F", "\"" + user + "\""}
                   : new String [] {"bash", "-c", "id -Gn " + user};
}

WINUTIL.exe command

We can assume that WINDOWS token will be true as we are running it in windows machine. When its called, it will return the command as follows.

winutils groups -F <user name as it reached via param>

In our case the chance for user name format as short name is 100%, because the log says it. Can we execute the command ourself in the same way how system is doing? To do that we need to get the path to WINUTILS.exe file

What is the path to winutil command file? Simple search the Hadoop folder. We can find at <install drive>:\Hadoop\hadoop-1.1.0-SNAPSHOT\bin

Now we can try to executing the command
<drive>:\Hadoop\hadoop-1.1.0-SNAPSHOT\bin>winutils groups -F <short username> (This failed)

<drive>:\Hadoop\hadoop-1.1.0-SNAPSHOT\bin>winutils groups -F <DOMAIN>\username (Success)

We can reach to a conclusion that this command will not work and nobody should be able to execute MapReduce tasks. But people claims that they are able to execute Hadoop tasks using HDInsight. Are there any more tests remaining?

Lets try executing the command with hadoop username. .\Hadoop is a local user which is created when we install Hadoop.

<drive>:\Hadoop\hadoop-1.1.0-SNAPSHOT\bin>winutils groups -F hadoop (Success)

Solution - Execute the HDInsight Hadoop task in the context of local user

It gives the light. When we try to run Hadoop application in the context of local user it may work. Immediately I created a local user in the machine and run the application. It worked !!!

So where its trimming the user name? Or which method says the user groups needs to be retrieved based on short user name? Some more source code analysis revealed it. Its the UserGroupInformation.getGroupNames() method

public synchronized String[] getGroupNames() {
   ensureInitialized();
   try {
     List<String> result = groups.getGroups(getShortUserName());
     return result.toArray(new String[result.size()]);
   } catch (IOException ie) {
     LOG.warn("No groups available for user " + getShortUserName());
     return new String[0];
   }
}

Now things are clear. I am finding a way to confirm this with Hadoop developers.

1 comment:

Unknown said...

Thanks a lot! This post made my day!