Hot questions for Using Azure in apache spark

Top Java Programmings / Azure / apache spark

Question:

I am having issue in reading data from azure blobs via spark streaming

JavaDStream<String> lines = ssc.textFileStream("hdfs://ip:8020/directory");

code like above works for HDFS, but is unable to read file from Azure blob

https://blobstorage.blob.core.windows.net/containerid/folder1/

Above is the path which is shown in azure UI, but this doesnt work, am i missing something, and how can we access it.

I know Eventhub are ideal choice for streaming data, but my current situation demands to use storage rather then queues


Answer:

In order to read data from blob storage, there are two things that need to be done. First, you need to tell Spark which native file system to use in the underlying Hadoop configuration. This means that you also need the Hadoop-Azure JAR to be available on your classpath (note there maybe runtime requirements for more JARs related to the Hadoop family):

JavaSparkContext ct = new JavaSparkContext();
Configuration config = ct.hadoopConfiguration();
config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.azure.account.key.youraccount.blob.core.windows.net", "yourkey");

Now, call onto the file using the wasb:// prefix (note the [s] is for optional secure connection):

ssc.textFileStream("wasb[s]://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/<path>");

This goes without saying that you'll need to have proper permissions set from the location making the query to blob storage.

Question:

I have written an Azure Databricks scala notebook (based on a JAR library), and I run it using a Databricks job once every hour.

In the code, I use the Application Insights Java SDK for log tracing, and init a GUID that marks the "RunId". I do this in a Scala 'object' constructor:

object AppInsightsTracer
{
  TelemetryConfiguration.getActive().setInstrumentationKey("...");
  val tracer = new TelemetryClient();
  val properties = new java.util.HashMap[String, String]()
  properties.put("RunId", java.util.UUID.randomUUID.toString);

  def trackEvent(name: String)
  {
    tracer.trackEvent(name, properties, null)
  }
}

The notebook itself simply calls the code in the JAR:

import com.mypackage._
Flow.go()

I expect to have a different "RunId" every hour. The weird behavior I am seeing is that for all runs, I get exactly the same "RunId" in the logs! As if the Scala object constructor code is run exactly once, and is re-used between notebook runs...

Do Spark/Databricks notebooks retain context between runs? If so how can this be avoided?


Answer:

A Jupyter notebook spawns a Spark session (think of it as a process) and keeps it alive until it either dies, or you restart it explicitly. The object is a singleton, so it's initialized once and will be the same for all cell executions of the notebook.

Question:

I am trying to use spark sql to query a csv file placed in Data Lake Store. when I query i am getting "java.lang.ClassNotFoundException: Class com.microsoft.azure.datalake.store.AdlFileSystem not found".

How can I use spark sql to query a file placed in Data Lake Store? Please help me with a sample.

Example csv:

Id     Name     Designation
1      aaa      bbb
2      ccc      ddd
3      eee      fff

Thanks in advance, Sowandharya


Answer:

Presently HDInsight-Spark Clusters are not available with Azure Data Lake Storage. Once we have the support it would work seamlessly. In the mean time you can try and use ADL Analytics to the same job on ADLS using U-SQL queries. For reference please visit the link: https://azure.microsoft.com/en-us/documentation/articles/data-lake-analytics-get-started-portal/ We are working for the support and it is currently targeted for some time prior to summer 2016. Hope it helps.

Thanks, Sourabh.