Sunday, January 22, 2023

PySpark, AWS and Delta Lake

Configuring your environment

I have been working on a pyspark/delta project for for some time, and have grown cozy and lazy with the ease of running code in the Databricks environment.  Using their clusters and UI hides quite a bit of complexity around dependencies and configuration, leaving more time to get down to business.

Alas, my tenure with the company came to an end, and with it my access to these easy to use environments to run pyspark.

I wanted to recreate some of the environment where I could keep tinkering with spark and Delta, but I was quickly reminded that dealing with java dependencies has at least two right ways which require Faustian dealmaking. I am not ready yet.

So a bit of trial and error and google brought me to the following Medium article which has a recipe that works.

In this case I am installing java, scala and spark at the OS level. I am interested in running on bare metal, as running spark inside a container seems wrong.

I am using poetry to create a virtual environment where pyspark can run, and where I can manage python dependencies.

[tool.poetry]
name = "delta1"
version = "0.1.0"
description = ""
authors = ["Marco Falcioni"]
readme = "README.md" 
[tool.poetry.dependencies]
python = "^3.10"
pyspark = "^3.3.1"
loguru = "^0.6.0"
jupyter = "^1.0.0"
[tool.poetry.group.dev.dependencies]
pytest = "^7.2.1"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

To run jupyter notebook and access AWS, export these four environment variables

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export AWS_ACCESS_KEY_ID=XXXXX
export AWS_SECRET_ACCESS_KEY=YYYY 

And then:

poetry run pyspark --packages io.delta:delta-core_2.12:2.2.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Configuring Delta Lake

TBD: here I'll speak to how to configure delta on S3.

 

 


Saturday, February 20, 2016

Umberto Eco: How to calmly prepare for death. Quiet instruction to the eventual practitioner.

by Umberto Eco

I am not saying anything original, but one of the biggest problems of a human being is how to face death.  It seems that the problem is hard for non-believers (how to face the Nothingness that awaits?) but statistics say that the question also troubles many believers, who resolutely hold that there is life after death, yet anyhow think that life before death is in itself so enjoyable to consider its end unpleasant; they strive to reunite with angels, but as late as possible. 

Evidently, I am speaking of the problem of what it means to be-for-death, or even just to recognize that all humans are mortal. It seems easy as long as it concerns Socrates, but it becomes troublesome when it concerns us.  And the most difficult moment will be the one in which we'll realize that for an instant we are still here, and the instant afterwards we will be no more.

Recently a thoughtful practitioner (some Crito) asked me: Teacher, how can we best 
approach death? I replied that the only way to prepare for death is to convince oneself that everybody else is an idiot*.

I clarified to the surprised Crito.  See, I said, how can you approach death, even as a believer, if you think that while you die, strapping youngsters of either sex dance in a club enjoying themselves tremendously, brilliant scientists violate the last mysteries of the cosmos, honest politicians build a better society, media outlets strive to give just the relevant news, responsible entrepreneurs are preoccupied to ensure that their products don't harm the environment and apply themselves to restore nature to its clear streams, verdant woods, clear skies free of ozone and strewn with soft clouds full of sweet rain. 

The thought that while all these marvelous things are happening you depart is unbearable.  I am just trying to think that, just when you realize that you are leaving this world, you have the unfading certainty that the world (five billion human beings) is full of idiots, idiots dancing in clubs, idiot scientists convinced to have solved the mysteries of the cosmos, idiot politicians who propose solutions to all our troubles, idiots who fill pages and pages of insults and marginal gossip, idiot and suicidal manufacturers who are destroying the planet. Would you not be happy, then, to depart this world of idiots? 

Crito then asked me: Teacher, but when should I start thinking this way? I answered that you should not start too soon, because somebody at twenty or thirty years of age thinking that everybody else is an idiot is an idiot and will never achieve wisdom. We should start thinking that everybody else is better than us, then evolve bit by bit, have the first doubts around forty, begin a revision between fifty and sixty, and reach certainty while you march towards one hundred, but ready to call it just as soon as your number is up.

To convince oneself that all the others around us (five billion) are idiots is a subtle and shrewd art, not available to this or that Cebes with an earring (or nose ring). It requires study and effort. You can't rush it. You need to achieve it just in time to die in serenity. But the day before we must still think that somebody, who we admire and love, is not yet a complete idiot. Wisdom lies in recognizing at the right time, and not sooner, that they, too, are idiots. Just then it is fine to die.

Therefore the great art is to study bit by bit universal thought, to scan pop culture, to monitor day by day the media, the statements of self-confident artists, the free-wheeling declamations of politicians, the philosophical statements of apocalyptic critics, the aphorisms of enigmatic heroes, studying the theories, proposals, calls to action, images and wraiths. Only then you will have the overwhelming realization that everybody is an idiot.  And then you will be ready to face death.

You will need to resist until the end to reach this unconvertible realization. You will continue thinking that somebody is still saying sensible things, that that book is better than others, that that leader really wants the common good. It is natural, it is humane, it is innate in our species to refuse the notion that everybody else is an utter idiot, otherwise why would it be worth living? But when in the end you do know, you will comprehend why it is worth it, even splendid, to die.

Crito then told me: Teacher, I don't want to make rash decisions, but I think you are an idiot.  See, I said, you are on the right path.

First published in Espresso, June 12th 1997


* coglione: lit. testicle

Monday, February 13, 2012

Installing pyramid on MacOS X

Prerequisites:

1) A modern Python 2.x (x > 4)
2) Easy_Install
3) virtualenv

Virtualenv creates a python sandbox that has no dependencies on the system python library. When used this way, the sandbox cannot be broken by system changes.
You can have multiple sandboxes to manage applications that have different dependencies (versions included).

Let's first make sure that we have a sane system python:

bash> which python
/Library/Frameworks/Python.framework/Versions/2.7/bin/python

bash> python --version
Python 2.7.2

bash> which easy_install
/usr/bin/easy_install

I like to keep all my sandboxes together, in a "sandboxen" dir:

bash> mkdir sandboxen
bash> cd sandboxen
bash> virtualenv pyramid
New python executable in pyramid/bin/python
Installing setuptools............done.
Installing pip...............done.

At this point you need to activate the virtual environment. This is done by sourcing the "activate" script in ~\sandboxen\pyramid\bin. Do not make this script executable! You need to run it as

bash> source ~\sandboxen\pyramid\bin\activate
(pyramid) bash>

The prompt will include the sandbox name to remind you of the special behavior. Use the "deactivate" command to go back to the normal shell. Activate works by prepending the sandbox bin, lib and include to your path.

Pyramid Install

At this point installing pyramid from pypi should be a matter of running

easy_install pyramid==1.2

(at the time of writing 1.3 is still in alpha, so we'll stick with the released version). This will download pyramid and all of its dependencies, and install them into the sandbox. The standard dependencies include a few zope packages, Mako, Chameleon, Paster and a few more.

From this point on we will rely on paster commands to create an application and run it locally in dev mode. If you are familiar with Rails, paster is the rake equivalent.

A logical next step is the excellent tutorial "Pyramid for Humans" .

Friday, January 13, 2012

Nosetests and memory consumption

At work my main project is a pylons website for internal consumption. I do most of the coding, collect requirements, bug fixing, database design - you name it, I do it.
Last night I was about to commit a set of changes and I do run my tests before I do so. There are lots of tests, and it takes 10 minutes or so for all to run.
Of course after 9 minutes or so nosetests ends with a failure, and it's 1AM.

These are the errors:
MemoryError
Logged from file base.py, line 1388
Traceback (most recent call last):
File "c:\Python27\Lib\logging\__init__.py", line 859, in emit
stream.write(fs % msg)
File "c:\Python27\Lib\StringIO.py", line 221, in write
self.buflist.append(s)


So it's memory related. I develop on Windows 7 (yes), so I start TaskManager and in fact the nosetests process fails after reaching 1.8Gb or so.

There are two interesting facts: first, nosetests quickly goes up to 1.3Gb within the first 10 seconds of running, before any of my tests are run. I think that this is nose collecting test metadata - it seems like a lot of memory, but that is not the problem. The second fact is that as the tests run, the process gobbles more and more RAM.

The problem is that by default nose collects the stdout stream to memory: if you have some logging (say Sqlalchemy.echo=True), the logs can get very big. Basically I found the threshold at which (at least on Windows) nose will run out of space.

The solution is to run with the -s option and pipe the stdout to file. If failure logs are needed, run with --stop so that the bottom of the log has the relevant info. Still, there seem to be nothing out there on nose memory consumption. I would really like to know if anybody else has run into this.

So, do this:

nosetests -s --stop > stdout.log 2> error.log