Apache Spark Presentation

I’ve published online the presentation on Apache Spark I’ve made for a introductory lecture to graduate students at Maastricht University.

If interested, please take a look at the presentation here.

Advertisements
Posted in Big Data, Data Engineering, Data Systems | Leave a comment

My articles for Sonra Intelligence

Apache Airflow

Using Apache Airflow to build reusable ETL on AWS Redshift

Apache Kafka + Spark Streaming + Redshift

Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka

Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing

Snowflake Data Warehouse + Advanced SQL + Cloud data warehousing

Loading Data into Snowflake Data Warehouse and performance of joins

Caching in Snowflake Data Warehouse

The top 10+1 things we love about Snowflake

Learn Window Functions on Snowflake. Become a cloud data warehouse superhero

SpaceX Performance for Snowflake with Clustering Keys

Create your own custom aggregate (UDAF) and window functions in Snowflake

Posted in Big Data, Data Engineering, Data Warehousing, Data Systems | Leave a comment

Loading Data into Snowflake Data Warehouse and performance of joins

I wrote a detailed article showing how to load 6GB of data into Snowflake using the PUT and COPY INTO commands.

Then I evaluated the performance of joins and how caching and instance size affects them.

You find the full blog post here.

A small preview:

Screen Shot 2018-03-16 at 22.05.15.png

Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

Caching in Snowflake Data Warehouse

I wrote a technical article covering how Snowflake uses caching on several layers (virtual warehouses caching data and caching of result sets).

In the article I also explain how this works and what are the benefits of caching.

You can read the full blog post here.

A small preview:

Screen Shot 2018-03-14 at 10.00.17.png

 

Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

My favorite features of Snowflake Data Warehouse

I wrote a blog post describing my 10 favorite features of Snowflake.

You can find the full blog post here.

A small preview:

Screen Shot 2018-03-14 at 10.01.28.png

Posted in Data Systems, Data Warehousing | Leave a comment

Using Spark Structured Streaming to upsert Kafka messages into a database

I wrote a detailed and technical blog post demonstrating an integration of Spark Structured Streaming with Apache Kafka messages and Snowflake.

An overview of the content is:

  • querying Twitter API for realtime tweets
  • setting up a Kafka server
  • producing messages with Kafka
  • consuming and parsing Kafka messages¬†with Spark Structured Streaming
  • explanation of the streaming model of Spark Structured Streaming
  • upserting latest data to Snowflake

You can find the full blog post here.

A small preview:

Screen Shot 2018-02-11 at 17.23.07.png

Posted in Big Data, Data Engineering, Python | Leave a comment

Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing

I wrote a blog post demonstrating advanced Spark Structured Streaming topics.

An overview of the content is:

  • setting up a Kafka server
  • producing messages with Kafka
  • consuming tweets with Spark Structured Streaming
  • watermarking messages
  • parsing JSON data
  • performing aggregattion queries on the stream of data
  • analyzing execution plans of queries
  • upserting data to Snowflake
  • checkpointing a structured stream

You can find the full blog post here.

A small preview:

Screen Shot 2018-02-11 at 16.50.58.png

Posted in Big Data, Data Engineering, Python | Leave a comment