spark_utils module

This module contains utility functions and classes for working with Apache Spark.

Classes: - SparkConnection: Represents a connection to a Spark cluster.

Functions: - initializeSpark: Initializes a Spark connection and configures the Spark session based on the connection and configuration details. - read_via_spark: Reads data using Spark based on the specified connection format and credentials. - other_function_name(): Description of what this function does. - another_function_name(): Description of what this function does.

class SparkConnection(connection_string, spark_configuration, hadoop_configuration=None, jar=None)

Bases: object

A class representing a connection to a Spark cluster.

(str): The connection string for the Spark cluster.

spark_configuration (dict): Dictionary containing Spark configuration details. hadoop_configuration (dict, optional): Dictionary containing Hadoop configuration details. jar (None): Placeholder for the JAR file.

initializeSpark(): Initializes a Spark connection and configures the Spark session.

read_via_spark(): Reads data using Spark based on the specified connection format and credentials.

write_via_spark(dataframe, conn_string, table, driver, mode='append', format='jdbc'): Writes data using Spark.

__dispose__(): Disposes the Spark session and engine.

initializeSpark()

Initializes a Spark connection and configures the Spark session based on the connection and configuration details.

Returns:: The initialized SparkSession object.
Return type:: pyspark.sql.SparkSession
Raises:: Exception – If an error occurs during Spark initialization.

read_via_spark(spark_connection_details, source_format='jdbc', batch_size=10000)

Reads data from a Spark DataFrame using limit/offset pagination, yielding batches of data until all rows are read. Stops if all rows in a batch are null.

Parameters:

spark_connection_details (dict) – The connection details.
source_format (str) – The source format (e.g., “jdbc”).
batch_size (int) – The number of rows per batch.

Yields:

pyspark.sql.DataFrame – A Spark DataFrame containing a batch of rows.

Raises:

Exception – If an error occurs during the data reading process.

write_via_spark(dataframe, conn_string, table, driver, mode='append', format='jdbc')

The write_via_spark method is used to write data using Spark based on the specified connection format and credentials. It takes the DataFrame stored in the sparkDataframe attribute of the SparkConnection object and writes it to the target data destination.

The method requires a valid connection format and corresponding credentials to establish the connection for writing data. It uses the write method of the DataFrame API, specifying the connection format, options, and save mode.

If any error occurs during the data writing process, an exception is raised.

This method is typically used after establishing a Spark connection, configuring the connection format and credentials, and preparing the data in the sparkDataframe attribute.

Please note that in the provided code snippet, the variables connection_type, connection_dict, and mode are not defined. Make sure to replace them with the appropriate values based on your implementation.

Raises:: Exception – If an error occurs during the data writing process.