Tuesday, August 22, 2017

Starting Scala Spark - Setting up local development environment

When someone comes to me and says 'this can be or cannot be done using .Net, SQL or Browser', I know whether it is really possible or not. Recently someone came to me and said 'we cannot connect to SQL Server database from Scala and load data frames, if Scala is running inside Azure HDInsight Spark cluster'. To be frank I did some google immediately but there are no direct answer to that scenario. In the internet, every one talking about loading data from Azure blob and all into Spark data frame and doing parallel data processing. But, if Spark is an in-memory distributed execution technology, why can't it read from SQL Server database and load data frame and do the processing? Scala is just another JVM language. Spark is written in Scala so it too runs on JVM and JVM has connectivity to SQL Server. So theoretically yes, but I don't have confidence to say practically yes.

It leads to the dilemma of  'should Software Architects code' or depend on what the development team says? Since I am in the 'yes should code' side, I decided to start exploring Spark so that next time I should be able to handle Spark same like how I am handling .Net and browser side technologies. 

The topics I should cover are Scala, Spark, Azure HDInsight.

This post just aims to some one who wants to learn Spark data processing technology at high level so that it can be integrated to other applications. Not to go in depth to become Spark data processing expert programmer.

Installation

First step is to setup the development environment. Learning Spark or Scala in Azure HDInsight environment is not a cost effective decision. There is already good tutorials explaining the development environment setup. One such is given below which talks about setting up Spark environment in Windows machines.

https://msdn.microsoft.com/en-us/magazine/mt595756.aspx

This is little old article. It worked for most of the steps. Below are some we may need to revisit for latest versions

Spark version

The instructions are little old regarding the version numbers though the concept remains same. One is regarding the Scala version which is suitable for Spark. The idea is to first select the Spark version and then select the Scala version. At the time of writing this article, the Spark version is 2_2_0 and corresponding Scala version is 2.12.3 (Spark works from Scala 2.11 on wards). The Scala version support is mentioned in the Spark downloads page.

WinUtils.exe

Another dependency is the winutils.exe which has to be in the \hadoop folder. It can be downloaded by googling. One such link is given below.

The location of the winutils can be C:\hadoop\bin\winutils.exe where the PATH environment variable points to the c:\hadoop folder.

After following the steps mentioned in the tutorial, we may get into multiple different issues. One such issue is given below.

Error

This may happen once we run the Spark shell after installation. The problem is that the Spark context variable 'sc' is not initialized. If this is not, we cannot do any Spark activities though we can issue Scala code. The error message is as follows. The font size is reduced to view more in less space.

C:\spark_2_2_0\bin>spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/22 17:33:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/22 17:33:29 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark_2_2_0/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark_2_2_0/bin/../jars/datanucleus-rdbms-3.2.9.jar."
17/08/22 17:33:29 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark_2_2_0/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark_2_2_0/bin/../jars/datanucleus-api-jdo-3.2.6.jar."
17/08/22 17:33:29 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark_2_2_0/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark_2_2_0/bin/../jars/datanucleus-core-3.2.10.jar."
17/08/22 17:33:35 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/08/22 17:33:35 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1053)
  at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
  at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:129)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:938)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)
  ... 47 elided
Caused by: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)
  at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)
  at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
  at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)
  ... 61 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
  at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:191)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
  at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
  at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 70 more
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------
  at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
  at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
  ... 84 more
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql
              ^
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val sf = sc.textFile("C:\\Temp\\joygazure.publishsettings")
<console>:17: error: not found: value sc
       val sf = sc.textFile("C:\\Temp\\joygazure.publishsettings")

Fix

If we analyze the log carefully we can see there is a permission issue to tmp folder related to hive. Did we install hive? Not explicitly. What is the relation with initializing Spark context and hive folder permissions. At first it might not seems related but ultimately they are related.


c:\hadoop\bin\winutils.exe chmod 777 c:\tmp\hive

No comments: