Apache Spark jobs

Apache Spark jobs define, schedule, monitor, and control the execution of Apache Spark processes.

Prerequisites

Apache Spark job definition

A description of the job properties and valid values are detailed in the context-sensitive help in the Dynamic Workload Console by clicking the question mark (?) icon in the top-right corner of the properties pane.

For more information about creating jobs using the various supported product interfaces, see Defining a job.

The following table lists the required and optional attributes for Apache Spark jobs:
Table 1. Required and optional attributes for the definition of an Apache Spark job
Attribute Description and value Required
Connection attributes
Url The Apache Spark server Url. It must have the following format: http://<SPARK_SERVER>:8080/json (dashboard address).

If not specified in the job definition, it must be supplied in the plug-in properties file.

REST Url The Apache Spark server Url to execute REST API calls. It must have the following format: http://<SPARK_SERVER>:6066 where 6066 is the default port for REST API calls.

If not specified in the job definition, it must be supplied in the plug-in properties file.

Resource Name The full path to the .jar, .py, or .R file that contains the application code.
Resource Type

The type of resource specified in the Resource Name field.

Main Class The entry point for your application. For example, org.apache.spark.examples.SparkPi.
Arguments The arguments passed to the main method of your main class, if any. If more than one argument is present, use commas to separate the different arguments.
Application Name The name of the application.
JAR The full path to a bundled jar including your application and all dependencies. The URL must be globally visible inside your cluster, for instance, an hdfs path or a file path that is present on all nodes.
Deploy Mode The deploy mode of Apache Spark driver program:
Cluster
To deploy your driver inside the cluster
Client
To deploy your driver locally as an external client
Spark Master The master URL for the cluster. For example, spark://23.195.26.187:7077.
Driver Cores Number of cores to use for the driver process, only in cluster mode.
Driver Memory Amount of memory in gigabytes to use for the driver process.
Executor Cores The number of cores to use on each executor. It is ignored when Apache Spark runs in standalone mode: in this case, it gets the value of Driver Cores since the executor is launched within a driver jvm process.
Executor Memory Amount of memory in gigabytes to use per executor process. It is ignored when spark runs in standalone mode: in this case, it gets the value of Driver Memory since the executor is launched within a driver jvm process.
Variable List The list of variables with related values that you want to specify. Click the plus (+) sign to add one or more variables to the variable list. Click (-) sign to remove one or more variables from the variable list. You can search a variable in the list by specifying the variable name in the filter box.
Note: Required and optional attributes cannot contain double quotation mark.

Scheduling and stopping a job in HCL Workload Automation

You schedule HCL Workload Automation Apache Spark jobs by defining them in job streams. Add the job to a job stream with all the necessary scheduling arguments and submit the job stream.

You can submit jobs by using the Dynamic Workload Console, Application Lab or the conman command line. See Scheduling and submitting jobs and job streams for information about how to schedule and submit jobs and job streams using the various interfaces.

After submission, when the job is running and is reported in EXEC status in HCL Workload Automation, you can stop it if necessary, by using the kill command. This action stops also the program execution on the Apache Spark server.

Monitoring a job

If the HCL Workload Automation agent stops when you submit the Apache Spark job, or while the job is running, the job restarts automatically as soon as the agent restarts.

For information about how to monitor jobs using the different product interfaces available, see Monitoring HCL Workload Automation jobs.

ApacheSparkJobExecutor.properties

The properties file is automatically generated either when you perform a "Test Connection" from the Dynamic Workload Console in the job definition panels, or when you submit the job to run the first time. Once the file has been created, you can customize it. This is especially useful when you need to schedule several jobs of the same type. You can specify the values in the properties file and avoid having to provide information such as credentials and other information, for each job. You can override the values in the properties files by defining different values at job definition time.

The properties file, named ApacheSparkJobExecutor.properties, is located in the following path:
On Windows operating systems
TWS_INST_DIR\TWS\JavaExt\cfg
On UNIX operating systems
TWA_DATA_DIR/TWS/JavaExt/cfg
The file contains the following properties:

 url= http://<SPARK_SERVER>:8080/json
 sparkurl= http://<SPARK_SERVER>:6066
drivercores=1
drivermemory=1
executorcores=1
executormemory=1
timeout=36000
The url and sparkurl properties must be specified either in this file or when creating the Apache Spark job definition in the Dynamic Workload Console. For more information, see the Dynamic Workload Console online help.

The timeout property represents the time, in seconds, that HCL Workload Automation waits for a reply from Apache Spark server. When the timeout expires with no reply, the job terminates with abend status. The timeout property can be specified only in the properties file.

For a description of each property, see the corresponding job attribute description in Required and optional attributes for the definition of an Apache Spark job.

Job properties

While the job is running, you can track the status of the job and analyze the properties of the job. In particular, in the Extra Information section, if the job contains variables, you can verify the value passed to the variable from the remote system. Some job streams use the variable passing feature, for example, the value of a variable specified in job 1, contained in job stream A, is required by job 2 in order to run in the same job stream.

For information about how to display the job properties from the various supported interfaces, see Analyzing the job log. For example, from the conman command line, you can see the job properties by running:
conman sj <job_name>;props
where <job_name> is the Apache Spark job name.

The properties are listed in the Extra Information section of the output command.

For information about passing job properties, see Passing job properties from one job to another in the same job stream instance.

The following example shows an Apache Spark job definition via composer command line:

 <?xml version="1.0" encoding="UTF-8"?>
jsdl:jobDefinition xmlns:jsdl="http://www.ibm.com/xmlns/prod/scheduling/1.0/jsdl"
xmlns:jsdlapachespark="http://www.ibm.com/xmlns/prod/scheduling/1.0/jsdlapachespark" name="APACHESPARK">
 <jsdl:application name="apachespark">
 <jsdlapachespark:apachespark>
 <jsdlapachespark:ApacheSparkParameters>
 <jsdlapachespark:Connection>
 <jsdlapachespark:connectionInfo>
 <jsdlapachespark:url>{url}</jsdlapachespark:url>
 <jsdlapachespark:sparkurl>{sparkurl}</jsdlapachespark:sparkurl>
 </jsdlapachespark:connectionInfo>
 </jsdlapachespark:Connection>
 <jsdlapachespark:Action>
 <jsdlapachespark:ResourceProperties>
 <jsdlapachespark:resourcename>{resourcename}</jsdlapachespark:resourcename>
 <jsdlapachespark:resourcetype>{resourcetype}</jsdlapachespark:resourcetype>
 <jsdlapachespark:mainclass>{mainclass}</jsdlapachespark:mainclass>
 <jsdlapachespark:arguments>{arguments}</jsdlapachespark:arguments>
 </jsdlapachespark:ResourceProperties>
 <jsdlapachespark:SparkProperties>
 <jsdlapachespark:appname>{appname}</jsdlapachespark:appname>
 <jsdlapachespark:jars>{jars}</jsdlapachespark:jars>
 <jsdlapachespark:deploymode>{deploymode}</jsdlapachespark:deploymode>
 <jsdlapachespark:sparkmaster>{sparkmaster}</jsdlapachespark:sparkmaster>
 <jsdlapachespark:drivercores>{drivercores}</jsdlapachespark:drivercores>
 <jsdlapachespark:drivermemory>{drivermemory}</jsdlapachespark:drivermemory>
 <jsdlapachespark:executorcores>{executorcores}</jsdlapachespark:executorcores>
			            <jsdlapachespark:executormemory>{executormemory}
                                    </jsdlapachespark:executormemory>
				</jsdlapachespark:SparkProperties>
				<jsdlapachespark:EnvVariables>
				<jsdlapachespark:variablelistValues pairsList="</jsdlapachespark:variablelistValue">
                        	</jsdlapachespark:variablelistValues>
				</jsdlapachespark:EnvVariables>
                       </jsdlapachespark:Action>
	           </jsdlapachespark:ApacheSparkParameters>
	</jsdlapachespark:apachespark>
</jsdl:application>
</jsdl:jobDefinition>

Job log content

For information about how to display the job log from the various supported interfaces, see Analyzing the job log.

For example, you can see the job log content by running conman sj <job_name>;stdlist, where <job_name> is the Apache Spark job name.

See also

From the Dynamic Workload Console you can perform the same task as described in

Creating job definitions.

For more information about how to create and edit scheduling objects, see

Designing your Workload.