Prepare to use Apache Spark

Apache Spark is a distributed data processing framework that enables large-scale data analytics by coordinating work across multiple processing nodes in a cluster, known in Microsoft Fabric as a Spark pool. Put more simply, Spark uses a "divide and conquer" approach to processing large volumes of data quickly by distributing the work across multiple computers. The process of distributing tasks and collating results is handled for you by Spark.

Spark can run code written in a wide range of languages, including Java, Scala (a Java-based scripting language), Spark R, Spark SQL, and PySpark (a Spark-specific variant of Python). In practice, most data engineering and analytics workloads are accomplished using a combination of PySpark and Spark SQL.

Spark pools

A Spark pool consists of compute nodes that distribute data processing tasks. The general architecture is shown in the following diagram. 


Diagram of a Spark pool.

As shown in the diagram, a Spark pool contains two kinds of node:

  1. head node in a Spark pool coordinates distributed processes through a driver program.
  2. The pool includes multiple worker nodes on which executor processes perform the actual data processing tasks.

The Spark pool uses this distributed compute architecture to access and process data in a compatible data store - such as a data lakehouse based in OneLake.


managed it services

Comments

Popular posts from this blog

Azure built-in roles for tables

Explore Dataflows Gen2 in Microsoft Fabric

Select and configure an appropriate method for access to Azure Blobs