Tuesday, October 17, 2017

Running multiple instances of AzCopy.exe command

AzCopy.exe is really an amazing tool for data transfer. But if we run multiple instances of AzCopy we may get below error.

AzCopy Command - AzCopy /Source:c:\temp\source /Dest:https://<storage account>.blob.core.windows.net/test /DestSAS:"<SAS>" /pattern:"" /s

An error occurred while reading the restart journal from "C:\Users\<user name>\AppData\Local\Microsoft\Azure\AzCopy". Detailed error: The process cannot access the file 'C:\Users\<user name>\AppData\Local\Microsoft\Azure\AzCopy\AzCopyCheckpoint.jnl' because it is being used by another process.

The error is pretty much clear. AzCopy keeps a journal file for resume functionality and if we don't specify the journal file location in command it uses default location and when second AzCopy starts it cannot read journal file.

The fix is to specify the location for .jnl. AzCopy Command goes as follows
AzCopy /Source:c:\temp\source /Dest:https://<storage account>.blob.core.windows.net/test /DestSAS:"<SAS>" /pattern:"" /s /z:<unique folder for azcopy command>

If we are running AzCopy from the command window it is easy to find out. But if AzCopy is invoked from applications (PowerShell or .Net) in parallel it is difficult to find out because we might have disabled all the messages using /y. AzCopy has /v: switch which redirect the logs to a file. That will help to troubleshoot.

Tuesday, October 3, 2017

Using .Net default value is trouble especially for APIs

Though there are features, they are not meant to be used. JavaScript atleast has a book named "JavaScript: The good parts" but others don't have one. Lets see one scenario in C# .Net.

Long long ago there was an API exposed to clients.

        class MyAPI
        {
            public void APIMethod(string s, int i = 10)
            {
                Console.WriteLine($"Inside foo with i = {i}");
            }
        }
internal void Test()
        {
            MyAPI apiClient = new MyAPI();
            apiClient.APIMethod("hi");
        }

Clients were happy using it. Later someone got added to the API team and he overloaded the method as follows thinking that keeping the same name will help the clients discover the API.

        class MyAPI
        {
            public void APIMethod(string s, int i = 10)
            {
                Console.WriteLine($"Inside APIMethod with i = {i}");
            }
            public void APIMethod(string s)
            {
                Console.WriteLine($"Inside APIMethod");
            }
        }

Clients happily adopted the new version of API. But soon they started feeling their calls are not working as expected. They escalated the issue and developers spent hours and days and finally they figured what went wrong. They corrected the code as follows.

        class MyAPI
        {
            public void APIMethod(string s, int i = 10)
            {
                Console.WriteLine($"Inside APIMethod with i = {i}");
            }
            public void NewAPIMethod(string s)
            {
                Console.WriteLine($"Inside NewAPIMethod");
            }
        }

Moral of the story

Do not use a feature only because it is available.

Tuesday, September 26, 2017

What is wasb(s) protocol for accessing Azure blob storage

WASB Protocol

I didn't get a chance to study wasb(s) protocol before I had to use it for HDInsight related tasks. So it was trial and error in the initial time. That lead me to take a decision to write detailed post on the wasb protocol since it helps us especially when we are Hadoop in Azure world.

But before writing, thought of googling for similar work. Why should there be one more post in the internet with same content? The results were interesting. Below are the links explaining what is wasbs protocol for accessing Azure storage blob.

https://blogs.msdn.microsoft.com/cindygross/2015/02/03/why-wasb-makes-hadoop-on-azure-so-very-cool/
https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/

But these were not answering all the questions I had. So decided to add those here.

Can .Net access the wasbs:// url

.Net really don't need to access Azure blob storage via wasbs protocol. It can access via SDK using corresponding object model. Or it can access using https protocol.

References

Tuesday, September 19, 2017

SQL 2016 - Create external polybase table on parquet file stored in Azure blob storage

This is a walk through on creating an external polybase table in SQL 2016 which stores data in Azure blob storage using parquet file format.

Prerequisite

The prerequisite is the basic knowledge about SQL Server and Microsoft Azure.

Use cases

One of the best use case is to move data between SQL Server transaction systems to Azure blob storage. Mainly if we are in data analytics world, there we can rarely see transactional relation databases. So we has to move data in between. eg: if we have internal or normal transaction table we can create external polybase table with same schema and do insert..select to move data from SQL Server to unstructured stores such as Azure blob or hadoop.

Azure storage account and blob setup

This is as straight forward as creating Azure storage account and blob container for any other purpose. Note down the storage account key and the container name to be used in SQL to setup connection.

Installing SQL 2016 with Polybase support

The polybase support came with SQL 2016. This Microsoft SQL Server 2016 feature requires Java runtime!!! Yes. Polybase seems a wrapper over Java libraries and those need JVM to run. Min inum required version for running polybase at the time of writing this post is JRE 7 Update 51

Once we have the JRE 1.7 version installed, we can install SQL 2016 with polybase support. Polybase has to be selected manually during the installation as it is not selected by default.

SQL Server level Settings

There are some settings to be done at the SQL Server level.

Enable Polybase

The below query will enable polybase export.

-- Enable INSERT into external table  
sp_configure 'allow polybase export', 1;  
reconfigure

Create password

This is the master key

-- Create a master key.
-- Only necessary if one does not already exist.
-- Required to encrypt the credential secret in the next step.
Create MASTER KEY   ENCRYPTION BY PASSWORD = '<Password>'

Database level settings

This is the step where the credential towards the Azure blob storage is setup in SQL Server database.

-- Create a database scoped credential
-- IDENTITY: Provide any string, it is not used for authentication to Azure storage.
-- SECRET: Provide your Azure storage account key.
go
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
    IDENTITY = 'user',
    SECRET = '<storage account key>'
; 
This doesn't tell what is the location the data resides. It just opens entire storage account to SQL Server.

Create data source

Now we can use the stored credentials to map to the Azure blob container path. This just tells where in the Storage account the data goes. The wasbs:// protocol is used to point blob storage path.

-- Create an external data source
-- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure blob storage.
-- LOCATION: Provide Azure storage account name and blob container name.
-- CREDENTIAL: Provide the credential created in the previous step.

--  Example with sample Storage account,container name:polytest,storage account name: parquetstore
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://polytest@parquetstore.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
); 

Create file format

One more setup required before creating the table is creation of the file format. This is required to tell what format to be used to store the table data.

-- Create an external file format
-- FORMAT_TYPE: Type of file format in Azure storage (supported: DELIMITEDTEXT, RCFILE, ORC, PARQUET).
-- FORMAT_OPTIONS: Specify field terminator, string delimiter, date format etc. for delimited text files.
-- Specify DATA_COMPRESSION method if data is compressed.

CREATE EXTERNAL FILE FORMAT [ParquetFileFormatSnappy]  
WITH (  
    FORMAT_TYPE = PARQUET,  
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'  
);

The data compression uses is Snappy. There are others as well.

Create external table

This is the last step. It creates external table with the file format, storage location and data source. 

-- CREATE AN EXTERNAL TABLE WITH PARQUET FILE.   
--A: Create the external table
-- Specify column names and data types. This needs to match the data in the sample file.
-- LOCATION: Specify path to file or directory that contains the data (relative to the blob container).
-- To point to all files under the blob container, use LOCATION='.'
CREATE EXTERNAL TABLE [dbo].[SampleTable_Parquet_External] (  
        [customerName] [nvarchar](100) NULL,
 [customerNumber] [varchar](70) NULL
)  
WITH (  
        LOCATION='/customerdata.parquet',  
        DATA_SOURCE = AzureStorage,  
        FILE_FORMAT = [ParquetFileFormatSnappy],  
        REJECT_TYPE = VALUE,  
        REJECT_VALUE = 0  
);

CRUD operations on external table

As it is stated by polybase technology, it don't support DML operations such as update, delete statements. We can just insert.

--Insert Data to the external table
INSERT INTO [dbo].[SampleTable_Parquet_External]
select * from dbo.[SampleTable_Internal]

Things to remember about polybase

The API seems granular. We can create file format, data source and all separately and use those when creating table.

As stated earlier it is mainly useful in big data analytics. But when we tried moving more than 300,000 records from SQL Server to blob we could see it just breaks. Still troubleshooting with Microsoft. Hopefully it might be something we did wrong.

Another issue faced was about data types. The polybase engine don't know all the data types and can move them. There are settings to say how may mismatches it should handle/forgive before it fails.

More limitations can be found here.

More reading

Tuesday, September 5, 2017

Starting Scala Spark - Read write to parquet file

Introduction

This is a post to index information related to parquet file format and how Spark can use it. Since there are already many tutorials to perform various operations in the context, this post mainly consolidate the links.

What is parquet file?

It is a columnar storage format, can contain schema along with data, supporting various encoding, compressions etc...
The file format specifications are from Apache. There are good support in the Java world. .Net seems catching up with Parquet. Since it is mainly for data analysis world it is not recommended to use in transnational systems.

Read write to local

From Spark we can read and write to parquet files using the methods given in below link.
https://community.hortonworks.com/articles/21303/write-read-parquet-file-in-spark.html

Read write to parquet present in Azure blob storage

Below goes a tutorial which explains how local Spark cluster can be used to access Azure blob.

https://blogs.msdn.microsoft.com/arsen/2016/07/13/accessing-azure-storage-blobs-from-spark-1-6-that-is-running-locally/

It explains writing text files. If we just use the parquet function Spark will write data to parquet format to Azure blob storage.

Version compatibility

To get Azure connectivity to Azure from Spark it has to know the Azure libraries. They are hadoop-azure-v#.v#.v#.jar,azure-storage-v#.v#.v#.jar. The above link explains using Spark 1.6 version libraries hadoop-azure-2.7.0.jar,azure-storage-2.0.0.jar respectively. Both these libraries can be downloaded from http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.0/. How to link these libraries to Spark is clearly explained in the link.

At the time of writing this post the latest Spark version is 2.2.0. So what should be the Hadoop version? Should it be the latest Hadoop version 2.8.1? There will be a tendency to use Hadoop 2.8.1. But if we use it we may get NoSuchMethodError Exception. Stack trace below stripping root callers.

java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration; at org.apache.hadoop.fs.azure.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:45) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.getAccountKeyFromConfiguration(AzureNativeFileSystemStore.java:852) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:932) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) at

As a true developer the way to troubleshoot this is to look at the source code of Hadoop.Azure and related packages to see what changes between versions and use the proper one. Else do trial and error and get one combination working.

The below combinations seems working
Spark 2.2.0
http://apache.claz.org/hadoop/common/hadoop-2.7.0/

Dealing with parquet files of different encoding

Writing parquet files with specified codec

Below link explains how to use different API to write which accepts the encoding.

eg:df.write.format("parquet").option("compression", "gzip")..save(<path>)

Reading parquet with specified codex

While reading parquet using sqlContext, we can use setConf() to mention the encoding.

eg: sqlContext.setConf(“spark.sql.parquet.compression.codec”,”gzip”)

More details can be found in official Spark documentation.

More links

https://gist.github.com/aseigneurin/59844ac82da93eb9c7623931d3412783
https://www.youtube.com/watch?v=_0Wpwj_gvzg

Tuesday, August 29, 2017

Azure Application Insights - End to end correlation from Angular to internal WCF services

Azure AppInsights is one of the PaaS offering which can be easily configured to our application and leverage capabilities. And it really works without any changes in our application. But does it helps us more than any other tools? May be yes with default setup. We will really love it if we can get end to end correlation to understand what is happening in our system. Production debugging will be a breeze with correlation. 

This post is talking about how can we correlate activities happening in different tiers of our system and see it in one place. Lets take a scenario where an Angular application is interfacing with users, it calls ReST endpoints to talk to front end services and front end services communicate with other internal services. We can correlate operations originating from client side to related operations happening in the internal services.

Prerequisite

  • Basic knowledge about AngularJS 1.x
  • Knowledge about WCF services.
  • Knowledge about setting up AppInsights service

Sample setup

It is advised to download the sample from below location before reading further. Below goes the structure of sample.

AngularJS application just does one thing. Finding the area of circle if we give radius. For finding area, it calls a ReST web service.
It sends one event during startup. Another one during calculating the area ie when calling the FronEndWCFService. These are custom AppInsights events

FronEndWCFService is a simple WCF service hosted using webHttpBinding to get ReST end points. It calculates the area using the parameter radius. It don't know the value of pi. It dependent on InternalService to get the value of pi. During the calculation it sends some custom events/operations to AppInsights.

InternalService is hosted using net.pipe binding. It has one method which returns the value of Pi. It also sends some custom events to AppInsights.

The above 3 components are hosted inside IIS and enabled with AppInsights for default system events.

Running the sample

  • Make sure the 3 web applications are hosted in IIS.
    • The web application has to net.pipe binding to host InternalService.
    • Try to browse the service Urls.
  • Give the Url of InternalService in the web config of FrontEndWCFService so that it can call.
  • Obtain AppInsights instrumentation key by either creating a new instance in Azure or copy from existing AppInsights instance. Replace this key into web applications. Better use visual studio search and replace and the word to replace with key is {Your Key} .There are 3 files to get the instrumentation key. 
    • /AngularClient/Views/Shared/_Layout.cshtml
    • /FrontEndWCFService/ApplicationInsights.config
    • /InternalService/ApplicationInsights.config
  • Run the AngularClient application in browser.
    • During the application load, it will log "Custom init event from JS"
    • When we give the radius in the text box and click on button it will log "Custom event from JS before service call"

Viewing the AppInsights

It may take some time to get the entries reflected in the AppInsight in the Azure portal. It is better to go to the analytics portal of ApplicationInsight instance and view the results. In the analytics portal, the details can be seen as follows.

The entries highlighted in the name column are the custom events generated by different tiers in the system. Along with custom events the system will generate event if AppInsights collectors are attached to the processes. That can be done by connecting interceptors in the WCF or HttpModules in Web request pipelines.

Understanding correlation results

The operation_id column is the id which really correlate the operations. The id has to be unique and the parent_id can be any id used previously. It can be a tree of related events.

Don't get confused on the term 'events' used here. AppInsight can accept events, operations, traces etc...For easy reference term event is used.

Q&A

How does a web request from Angular gets correlated with servers

When we send web request from Angular or any client if we send http header "Request-Id" it will be treated as operation_id by server side, if we have the AppInsight request interceptors

Why the sample has 2 methods in Angular to write custom AppInsights

When the application is getting initialized, the appInsights object may not have the context ready.If so we have to put into queue. Once the context is initialized, we can use it to set operation_id. At that point the queue will be null.

Should I write any code for intercepting incoming web/WCF requests to read correlation data?

No we just need to use already available nuget packages and setup our web.config appropriately.

Should I write any code for intercepting the out going web/WCF requests to write correlation data?

No we just need to use already available nuget packages and setup our web.config appropriately.

References

https://docs.microsoft.com/en-us/azure/application-insights/application-insights-correlation
https://github.com/Azure/diagnostics-correlation
https://github.com/Microsoft/ApplicationInsights-JS
https://github.com/aspnet/Microsoft.AspNet.TelemetryCorrelation
https://mitra.computa.asia/article/msdn-modifying-and-filtering-telemetry-appinsights-javascript-sdk-telemetry-initializer

Tuesday, August 22, 2017

Starting Scala Spark - Setting up local development environment

When someone comes to me and says 'this can be or cannot be done using .Net, SQL or Browser', I know whether it is really possible or not. Recently someone came to me and said 'we cannot connect to SQL Server database from Scala and load data frames, if Scala is running inside Azure HDInsight Spark cluster'. To be frank I did some google immediately but there are no direct answer to that scenario. In the internet, every one talking about loading data from Azure blob and all into Spark data frame and doing parallel data processing. But, if Spark is an in-memory distributed execution technology, why can't it read from SQL Server database and load data frame and do the processing? Scala is just another JVM language. Spark is written in Scala so it too runs on JVM and JVM has connectivity to SQL Server. So theoretically yes, but I don't have confidence to say practically yes.

It leads to the dilemma of  'should Software Architects code' or depend on what the development team says? Since I am in the 'yes should code' side, I decided to start exploring Spark so that next time I should be able to handle Spark same like how I am handling .Net and browser side technologies. 

The topics I should cover are Scala, Spark, Azure HDInsight.

This post just aims to some one who wants to learn Spark data processing technology at high level so that it can be integrated to other applications. Not to go in depth to become Spark data processing expert programmer.

Installation

First step is to setup the development environment. Learning Spark or Scala in Azure HDInsight environment is not a cost effective decision. There is already good tutorials explaining the development environment setup. One such is given below which talks about setting up Spark environment in Windows machines.

https://msdn.microsoft.com/en-us/magazine/mt595756.aspx

This is little old article. It worked for most of the steps. Below are some we may need to revisit for latest versions

Spark version

The instructions are little old regarding the version numbers though the concept remains same. One is regarding the Scala version which is suitable for Spark. The idea is to first select the Spark version and then select the Scala version. At the time of writing this article, the Spark version is 2_2_0 and corresponding Scala version is 2.12.3 (Spark works from Scala 2.11 on wards). The Scala version support is mentioned in the Spark downloads page.

WinUtils.exe

Another dependency is the winutils.exe which has to be in the \hadoop folder. It can be downloaded by googling. One such link is given below.

The location of the winutils can be C:\hadoop\bin\winutils.exe where the PATH environment variable points to the c:\hadoop folder.

After following the steps mentioned in the tutorial, we may get into multiple different issues. One such issue is given below.

Error

This may happen once we run the Spark shell after installation. The problem is that the Spark context variable 'sc' is not initialized. If this is not, we cannot do any Spark activities though we can issue Scala code. The error message is as follows. The font size is reduced to view more in less space.

C:\spark_2_2_0\bin>spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/22 17:33:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/22 17:33:29 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark_2_2_0/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark_2_2_0/bin/../jars/datanucleus-rdbms-3.2.9.jar."
17/08/22 17:33:29 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark_2_2_0/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark_2_2_0/bin/../jars/datanucleus-api-jdo-3.2.6.jar."
17/08/22 17:33:29 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark_2_2_0/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark_2_2_0/bin/../jars/datanucleus-core-3.2.10.jar."
17/08/22 17:33:35 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/08/22 17:33:35 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1053)
  at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
  at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:129)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:126)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:938)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:938)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)
  ... 47 elided
Caused by: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------;
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
  at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)
  at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)
  at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
  at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
  at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)
  ... 61 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
  at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:191)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
  at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
  at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
  at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
  ... 70 more
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx------
  at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
  at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
  ... 84 more
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql
              ^
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val sf = sc.textFile("C:\\Temp\\joygazure.publishsettings")
<console>:17: error: not found: value sc
       val sf = sc.textFile("C:\\Temp\\joygazure.publishsettings")

Fix

If we analyze the log carefully we can see there is a permission issue to tmp folder related to hive. Did we install hive? Not explicitly. What is the relation with initializing Spark context and hive folder permissions. At first it might not seems related but ultimately they are related.


c:\hadoop\bin\winutils.exe chmod 777 c:\tmp\hive