Imports in Python

Python is a great language with a beautiful syntax, lots of useful libraries and great community.

Unfortunately, one of the things I like less is the way imports work.

In almost every large Python project I do, I want to centralise a set of shared services that all submodules could use. Example of this is having a module with a (static) function that initialises connections to databases and fetches the credentials by itself. This is unfortunately always a pain as it has to be in a folder shared by all modules.

I have thus came up with the easiest way to deal with the issue of python imports and that is by appending the module paths to the system path.

Let’s imagine the following scenario in our project structure:

We have a project named demo with a few Python files (.py) files which are officially called modules. Our main code is in the module and we want to import other modules.

Screen Shot 2018-09-02 at 18.45.11.png

You can find the exact folder structure and the whole code for the demo at this Github repository.

Specific scenarios of possible imports are:

— please note that in all of these examples we are running the script with a simple command and that we are running it inside package1 directory. The directory from where a script is executed is very important when determining how module imports will behave.

1. Importing from a module in the same directory

This is the easiest import to do:

import module_to_be_imported_same

2. Importing from a module in a sub-directory

This one is quite easy and native as well:

from package3 import module_to_be_imported_sub

3. Importing from a module in a parent directory

import sys
import module_to_be_imported_parent

4. Importing from a module at the same level but in a different directory

import sys
import module_to_be_imported_cross


I really hate that Python is not that intuitive (for someone with a Java background) when you want to share a single module across various other modules.

I come across this very often when maintaining a codebase of data pipelines (Python scripts – modules). Often logic is not shared between modules and stuff is rewritten at every single place which is a very bad practice.

Reminder, you can see the exact code and the whole setup at this repository.

If you have a better approach, I’d love to hear your approach in Python 3+ environments.

Posted in Data Engineering, Python | Leave a comment

Tips for easier development on AWS

I wanted to share a couple of tips for easier development on AWS.

1. Use a local version of AWS for development and testing

Try something like localstack to stand up a local AWS environment. This will run AWS API compliant mock applications on your local machine.

This way you can eg. create a kinesis stream, put data into it, process it and put into local S3 without spending any resources on AWS. Furthermore, this code can be translated into production code by just replacing the client instantiations.

Localstack can be easily installed with pip:

pip install localstack

Then you can run it with:

localstack start

These are the services which will spin-up:

Screen Shot 2018-06-22 at 15.19.41.png

In the section below we will using boto3 connect to a Kinesis stream ran by localstack.

2. Use pyboto3 with Python in PyCharm for auto-completion

The boto3 library is very popular for development on AWS since it’s quickly adaptable to AWS API changes. The issue is that boto3 doesn’t actually implement the methods specified in the API.

If you like to use Python for development on AWS, you probably want to also use and IDE like PyCharm for it’s many features including auto-completion. Unfortunately auto-completion does not work for boto3.

So if you want to get auto-completion in PyCharm with boto3, I highly recommend using the pyboto3 library.

It’s extremely easy to install:

pip install pyboto

Now, to use it with PyCharm, you should add a line below when defining the boto3 client like in this example with Kinesis:

pic1pic2Screen Shot 2018-06-22 at 15.34.02.png


Posted in Big Data, Data Engineering | Leave a comment

Apache Spark Presentation

I’ve published online the presentation on Apache Spark I’ve made for a introductory lecture to graduate students at Maastricht University.

If interested, please take a look at the presentation here.

Posted in Big Data, Data Engineering, Data Systems | Leave a comment

My articles for Sonra Intelligence

Apache Airflow

Using Apache Airflow to build reusable ETL on AWS Redshift

Apache Kafka + Spark Streaming + Redshift

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka

Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing

Snowflake Data Warehouse + Advanced SQL + Cloud data warehousing

Loading Data into Snowflake Data Warehouse and performance of joins

Caching in Snowflake Data Warehouse

The top 10+1 things we love about Snowflake

Learn Window Functions on Snowflake. Become a cloud data warehouse superhero

SpaceX Performance for Snowflake with Clustering Keys

Create your own custom aggregate (UDAF) and window functions in Snowflake

Posted in Big Data, Data Engineering, Data Systems, Data Warehousing | Leave a comment

Loading Data into Snowflake Data Warehouse and performance of joins

I wrote a detailed article showing how to load 6GB of data into Snowflake using the PUT and COPY INTO commands.

Then I evaluated the performance of joins and how caching and instance size affects them.

You find the full blog post here.

A small preview:

Screen Shot 2018-03-16 at 22.05.15.png

Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

Caching in Snowflake Data Warehouse

I wrote a technical article covering how Snowflake uses caching on several layers (virtual warehouses caching data and caching of result sets).

In the article I also explain how this works and what are the benefits of caching.

You can read the full blog post here.

A small preview:

Screen Shot 2018-03-14 at 10.00.17.png


Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

My favorite features of Snowflake Data Warehouse

I wrote a blog post describing my 10 favorite features of Snowflake.

You can find the full blog post here.

A small preview:

Screen Shot 2018-03-14 at 10.01.28.png

Posted in Data Systems, Data Warehousing | Leave a comment