Tuesday, September 5, 2017

Starting Scala Spark - Read write to parquet file

Introduction

This is a post to index information related to parquet file format and how Spark can use it. Since there are already many tutorials to perform various operations in the context, this post mainly consolidate the links.

What is parquet file?

It is a columnar storage format, can contain schema along with data, supporting various encoding, compressions etc...
The file format specifications are from Apache. There are good support in the Java world. .Net seems catching up with Parquet. Since it is mainly for data analysis world it is not recommended to use in transnational systems.

Read write to local

From Spark we can read and write to parquet files using the methods given in below link.
https://community.hortonworks.com/articles/21303/write-read-parquet-file-in-spark.html

Read write to parquet present in Azure blob storage

Below goes a tutorial which explains how local Spark cluster can be used to access Azure blob.

https://blogs.msdn.microsoft.com/arsen/2016/07/13/accessing-azure-storage-blobs-from-spark-1-6-that-is-running-locally/

It explains writing text files. If we just use the parquet function Spark will write data to parquet format to Azure blob storage.

Version compatibility

To get Azure connectivity to Azure from Spark it has to know the Azure libraries. They are hadoop-azure-v#.v#.v#.jar,azure-storage-v#.v#.v#.jar. The above link explains using Spark 1.6 version libraries hadoop-azure-2.7.0.jar,azure-storage-2.0.0.jar respectively. Both these libraries can be downloaded from http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.0/. How to link these libraries to Spark is clearly explained in the link.

At the time of writing this post the latest Spark version is 2.2.0. So what should be the Hadoop version? Should it be the latest Hadoop version 2.8.1? There will be a tendency to use Hadoop 2.8.1. But if we use it we may get NoSuchMethodError Exception. Stack trace below stripping root callers.

java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration; at org.apache.hadoop.fs.azure.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:45) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.getAccountKeyFromConfiguration(AzureNativeFileSystemStore.java:852) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:932) at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) at

As a true developer the way to troubleshoot this is to look at the source code of Hadoop.Azure and related packages to see what changes between versions and use the proper one. Else do trial and error and get one combination working.

The below combinations seems working
Spark 2.2.0
http://apache.claz.org/hadoop/common/hadoop-2.7.0/

Dealing with parquet files of different encoding

Writing parquet files with specified codec

Below link explains how to use different API to write which accepts the encoding.

eg:df.write.format("parquet").option("compression", "gzip")..save(<path>)

Reading parquet with specified codex

While reading parquet using sqlContext, we can use setConf() to mention the encoding.

eg: sqlContext.setConf(“spark.sql.parquet.compression.codec”,”gzip”)

More details can be found in official Spark documentation.

More links

https://gist.github.com/aseigneurin/59844ac82da93eb9c7623931d3412783
https://www.youtube.com/watch?v=_0Wpwj_gvzg

No comments: