Spark#
plpipes provides an integration layer for Spark/Databricks via the
package plpipes.spark and also a set of database backends and
drivers for using the Spark DB engine as a regular database.
Spark support should still be considered experimental.
SparkSession#
SparkSession setup is performed using a set of drivers, which handle different environments.
Currently, the supported drivers are as follow:
-
embedded: initializes a Spark environment without any external infrastructure. Once the Python program ends, the Spark system also shuts down. -
databricks: connects to a remote Databricks cluster usingdatabricks-connect.
Every driver accepts a different set of configuration options which are described below.
Usage#
A Spark session can be accesed as follows:
Any action needed to initialize the session or any other subsystem is performed automatically.
Configuration#
The Spark layer is configured through the spark entry. A driver
setting is used to pick which driver to use, any other accepted
configuration is driver-dependent.
Example:
The entries supported by the different drivers are as follows:
embedded driver#
app_name: application name (defaults towork).home: spark working directory (defaults towork/spark).extra: subtree containing extra configuration options which are passed verbatim to theSparkSession.builder.configmethod. Note that the options accepted by that method usually start byspark. See the Spark documentation for details.
databricks driver#
profile: profile name as defined in the file~/.databrickscfg(defaults toDEFAULT).extrasubtree containing extra configuration options which are passed verbatim to theSparkSession.builder.configmethod. Note that the options accepted by that method usually start byspark. See the Spark documentation for details.
Database integration#
Spark database engine can be accessed through the plpipes.database
package as any other database.
See Databases for the details.
Databricks integration#
Access to Spark clusters inside Databricks is provided through the
databricks driver. It expects to find a databricks-connect
profile already configured in the file ~/.databrickscfg.
That file can be created automatically using the Databricks CLI which accepts several authentication mechanisms.
The following command can be used to initialize a profile:
The workspace-url can be taken the host part of the URL from the web
browser when inside the Databricks environment. It usually looks like
https://adb-xxxxxxxxxxx.azuredatabricks.net.
Also, the cluster_id must be added by hand to the profile. It
appears on the URL when inspecting the cluster inside the Databricks
environment.
Sample .databrickscfg file: