Tips for easier development on AWS

I wanted to share a couple of tips for easier development on AWS.

1. Use a local version of AWS for development and testing

Try something like localstack to stand up a local AWS environment. This will run AWS API compliant mock applications on your local machine.

This way you can eg. create a kinesis stream, put data into it, process it and put into local S3 without spending any resources on AWS. Furthermore, this code can be translated into production code by just replacing the client instantiations.

Localstack can be easily installed with pip:

pip install localstack

Then you can run it with:

localstack start

These are the services which will spin-up:

Screen Shot 2018-06-22 at 15.19.41.png

In the section below we will using boto3 connect to a Kinesis stream ran by localstack.

2. Use pyboto3 with Python in PyCharm for auto-completion

The boto3 library is very popular for development on AWS since it’s quickly adaptable to AWS API changes. The issue is that boto3 doesn’t actually implement the methods specified in the API.

If you like to use Python for development on AWS, you probably want to also use and IDE like PyCharm for it’s many features including auto-completion. Unfortunately auto-completion does not work for boto3.

So if you want to get auto-completion in PyCharm with boto3, I highly recommend using the pyboto3 library.

It’s extremely easy to install:

pip install pyboto

Now, to use it with PyCharm, you should add a line below when defining the boto3 client like in this example with Kinesis:

pic1pic2Screen Shot 2018-06-22 at 15.34.02.png

 

Advertisements
Posted in Big Data, Data Engineering | Leave a comment

Apache Spark Presentation

I’ve published online the presentation on Apache Spark I’ve made for a introductory lecture to graduate students at Maastricht University.

If interested, please take a look at the presentation here.

Posted in Big Data, Data Engineering, Data Systems | Leave a comment

My articles for Sonra Intelligence

Apache Airflow

Using Apache Airflow to build reusable ETL on AWS Redshift

Apache Kafka + Spark Streaming + Redshift

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka

Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing

Snowflake Data Warehouse + Advanced SQL + Cloud data warehousing

Loading Data into Snowflake Data Warehouse and performance of joins

Caching in Snowflake Data Warehouse

The top 10+1 things we love about Snowflake

Learn Window Functions on Snowflake. Become a cloud data warehouse superhero

SpaceX Performance for Snowflake with Clustering Keys

Create your own custom aggregate (UDAF) and window functions in Snowflake

Posted in Big Data, Data Engineering, Data Systems, Data Warehousing | Leave a comment

Loading Data into Snowflake Data Warehouse and performance of joins

I wrote a detailed article showing how to load 6GB of data into Snowflake using the PUT and COPY INTO commands.

Then I evaluated the performance of joins and how caching and instance size affects them.

You find the full blog post here.

A small preview:

Screen Shot 2018-03-16 at 22.05.15.png

Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

Caching in Snowflake Data Warehouse

I wrote a technical article covering how Snowflake uses caching on several layers (virtual warehouses caching data and caching of result sets).

In the article I also explain how this works and what are the benefits of caching.

You can read the full blog post here.

A small preview:

Screen Shot 2018-03-14 at 10.00.17.png

 

Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

My favorite features of Snowflake Data Warehouse

I wrote a blog post describing my 10 favorite features of Snowflake.

You can find the full blog post here.

A small preview:

Screen Shot 2018-03-14 at 10.01.28.png

Posted in Data Systems, Data Warehousing | Leave a comment

Using Spark Structured Streaming to upsert Kafka messages into a database

I wrote a detailed and technical blog post demonstrating an integration of Spark Structured Streaming with Apache Kafka messages and Snowflake.

An overview of the content is:

  • querying Twitter API for realtime tweets
  • setting up a Kafka server
  • producing messages with Kafka
  • consuming and parsing Kafka messages with Spark Structured Streaming
  • explanation of the streaming model of Spark Structured Streaming
  • upserting latest data to Snowflake

You can find the full blog post here.

A small preview:

Screen Shot 2018-02-11 at 17.23.07.png

Posted in Big Data, Data Engineering, Python | Leave a comment