Hot questions for Using Azure in hdinsight

Question:

I am running a sequence spark jobs on HDI cluster. I need HDI cluster to be available for test running time alone. So I want to do the following using Java

  • Create HDI cluster
  • Run my tests
  • Delete HDI cluster

How can I do the above using Java program in Azure?


Answer:

Currently, Azure does not provide Java SDK for HDI cluster.

But you could use Rest API to create and delete HDI cluster.

Create

PUT https://management.azure.com/subscriptions/{subscription Id}/resourceGroups/{resourceGroup Name}/providers/Microsoft.HDInsight/clusters/{cluster name}?api-version={api-version}

Delete

DELETE https://management.azure.com/subscriptions/{subscription Id}/resourceGroups/{resourceGroup Name}/providers/Microsoft.HDInsight/clusters/{cluster name}?api-version={api-version}

In java, you could write a program to call these API.

Question:

I am trying to run a set of steps in an oozie workflow. One of the steps involves running a java program that reads the arguments from job.properties.template file. How do I schedule this on a Azure HDInsight cluster (I already have a cluster running).

Also, is there any way to get on to head node of the HDInsight cluster like the way we ssh into master node of an EMR cluster. I read about RDP (Remote Desktop Protocol) somewhere. It will be useful if someone could give few more pointers related to this.


Answer:

For executing java program in HDinsight remote desktop please try this.

  1. add your jar in lib folder and add your properties,xml files and then move it to your blob storage.

Example :

WorkfLow.xml

<workflow-app name="WorkflowJavaMainAction" xmlns="uri:oozie:workflow:0.2">

<start to="javaMainAction"/>

<action name="javaMainAction">

<java>

<job-tracker>jobtrackerhost:9010</job-tracker>

<name-node>wasb://xxx@yyy.blob.core.windows.net</name-node>


<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>default</value>

</property>

</configuration>

<main-class>packagename.classname</main-class>

</java>

<ok to="end"/>

<error to="killJobAction"/>

</action>

<kill name="killJobAction">

<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>

</kill>

<end name="end" />

</workflow-app>

Coordiantor.xml :

<coordinator-app end="${endTime}" frequency="${frequency}" name="sample_update" start="${startTime}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.2">

<controls>

        <timeout>5</timeout>

        <concurrency>1</concurrency>

</controls>

<action>

<workflow>

<app-path>wasb://xxx@yyy.blob.core.windows.net/user/hdp/ooziejava/workflow.xml</app-path>

</workflow>

</action>

</coordinator-app>

Job.properites

oozie.use.system.libpath=true

oozie.coord.application.path=wasb://xxx@yyy.blob.core.windows.net/user/hdp/
ooziejava/coordinator.xml

startTime=2014-11-16T07:30Z

endTime=2014-11-23T04:50Z

frequency=15

timezone=GMT+0530

Question:

i run this code in power shell by following the steps and commands for pwer shell in this tutorial. i just change the name from WordCount to Matrix. all the steps work fine, But i get this error after run the Azure PowerShell script:

exception in thread main org.apache.hadoop.mapred.lip.input.invalidInputException:input path does not exist

Answer:

Based on my understanding, I think you want to calculate for matrix multiplication in Azure HDInsight. And you could ran your code in HDInsight Emulator successfully, but failed in HDInsigit on Azure.

The file path on HDFS of Azure HDInsight is directly use the relative path based on the blob container as root path without host information if you remote into the cluster, such as wasb:///examples/data/....

So you can try to remote into the HDInsight Cluster and run the code in the remote ssh for Linux or cmd for Windows, and follow the steps below.

  1. Copy your mapreduce jar file and data file into HDInsight Cluster. For example of Hadoop on Linux, you can command scp <your-file> <ssh-username>@<hdcluster-name>-ssh.azurehdinsight.net:/home/<hdcluster-username>/.
  2. Make a directory in HDInsight Filesystem, command hadoop fs -mkdir wasb:///<dir-name>/.
  3. Copy your mapreduce jar file into hadoop fs -cp <your jar file>wasb:///<dir-name>/jars/ like the default examples on HDInsight.

Or you can refer to https://azure.microsoft.com/en-us/documentation/articles/hdinsight-upload-data/ to upload files into HDInsight instead of the three steps above.

  1. Copy your data file into hadoop fs -cp <your data file> wasb:///<dir-name>/data/input/ like the default examples on HDInsight.
  2. Command hadoop jar wasb:///<dir-name>/jars/<your jar file name>.jar <your class name> wasb:///<dir-name>/data/input/<your data file> wasbL///<dir-name>/data/output to run your code
  3. Waiting for the job completed, then command hadoop fs -cat wasb:///<dir-name>/data/output/* to show the result.

If the HDInsight Cluster created on Linux, you can refer to https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-mapreduce-ssh/ and find the ssh login information on Azure new portal, see the picture below.

If the HDInsight Cluster created on Windows, you can refer to https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-mapreduce-remote-desktop/ and find the Remote Desktop Information as the picture above that the Remote Desktop instead of the Secure Shell.

If you want to see the result of your code, you can also find it on Azure new portal, see the pictures below.