Configuring your environment
I have been working on a pyspark/delta project for for some time, and have grown cozy and lazy with the ease of running code in the Databricks environment. Using their clusters and UI hides quite a bit of complexity around dependencies and configuration, leaving more time to get down to business.
Alas, my tenure with the company came to an end, and with it my access to these easy to use environments to run pyspark.
I wanted to recreate some of the environment where I could keep tinkering with spark and Delta, but I was quickly reminded that dealing with java dependencies has at least two right ways which require Faustian dealmaking. I am not ready yet.
So a bit of trial and error and google brought me to the following Medium article which has a recipe that works.
In this case I am installing java, scala and spark at the OS level. I am interested in running on bare metal, as running spark inside a container seems wrong.
I am using poetry to create a virtual environment where pyspark can run, and where I can manage python dependencies.
[tool.poetry]
name = "delta1"
version = "0.1.0"
description = ""
authors = ["Marco Falcioni"]
readme = "README.md"
[tool.poetry.dependencies]
python = "^3.10"
pyspark = "^3.3.1"
loguru = "^0.6.0"
jupyter = "^1.0.0"
[tool.poetry.group.dev.dependencies]
pytest = "^7.2.1"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
export PYSPARK_DRIVER_PYTHON=jupyterexport PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export AWS_ACCESS_KEY_ID=XXXXXexport AWS_SECRET_ACCESS_KEY=YYYY
And then:
poetry run pyspark --packages io.delta:delta-core_2.12:2.2.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
No comments:
Post a Comment