Random acts of blogging: PySpark, AWS and Delta Lake

Configuring your environment

I have been working on a pyspark/delta project for for some time, and have grown cozy and lazy with the ease of running code in the Databricks environment. Using their clusters and UI hides quite a bit of complexity around dependencies and configuration, leaving more time to get down to business.

Alas, my tenure with the company came to an end, and with it my access to these easy to use environments to run pyspark.

I wanted to recreate some of the environment where I could keep tinkering with spark and Delta, but I was quickly reminded that dealing with java dependencies has at least two right ways which require Faustian dealmaking. I am not ready yet.

So a bit of trial and error and google brought me to the following Medium article which has a recipe that works.

In this case I am installing java, scala and spark at the OS level. I am interested in running on bare metal, as running spark inside a container seems wrong.

I am using poetry to create a virtual environment where pyspark can run, and where I can manage python dependencies.

[tool.poetry]
name = "delta1"
version = "0.1.0"
description = ""
authors = ["Marco Falcioni"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10"
pyspark = "^3.3.1"
loguru = "^0.6.0"
jupyter = "^1.0.0"
[tool.poetry.group.dev.dependencies]
pytest = "^7.2.1"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

To run jupyter notebook and access AWS, export these four environment variables

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

export AWS_ACCESS_KEY_ID=XXXXX
export AWS_SECRET_ACCESS_KEY=YYYY

And then:

poetry run pyspark --packages io.delta:delta-core_2.12:2.2.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Configuring Delta Lake

TBD: here I'll speak to how to configure delta on S3.

Random acts of blogging

Sunday, January 22, 2023

PySpark, AWS and Delta Lake

Configuring your environment

Configuring Delta Lake

No comments:

Post a Comment