Loading Data into Snowflake Data Warehouse and performance of joins

I wrote a detailed article showing how to load 6GB of data into Snowflake using the PUT and COPY INTO commands.

Then I evaluated the performance of joins and how caching and instance size affects them.

You find the full blog post here.

A small preview:

Screen Shot 2018-03-16 at 22.05.15.png

Advertisements
Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

Caching in Snowflake Data Warehouse

I wrote a technical article covering how Snowflake uses caching on several layers (virtual warehouses caching data and caching of result sets).

In the article I also explain how this works and what are the benefits of caching.

You can read the full blog post here.

A small preview:

Screen Shot 2018-03-14 at 10.00.17.png

 

Posted in Data Engineering, Data Warehousing, Data Systems | Leave a comment

My favorite features of Snowflake Data Warehouse

I wrote a blog post describing my 10 favorite features of Snowflake.

You can find the full blog post here.

A small preview:

Screen Shot 2018-03-14 at 10.01.28.png

Posted in Data Systems, Data Warehousing | Leave a comment

Using Spark Structured Streaming to upsert Kafka messages into a database

I wrote a detailed and technical blog post demonstrating an integration of Spark Structured Streaming with Apache Kafka messages and Snowflake.

An overview of the content is:

  • querying Twitter API for realtime tweets
  • setting up a Kafka server
  • producing messages with Kafka
  • consuming and parsing Kafka messages¬†with Spark Structured Streaming
  • explanation of the streaming model of Spark Structured Streaming
  • upserting latest data to Snowflake

You can find the full blog post here.

A small preview:

Screen Shot 2018-02-11 at 17.23.07.png

Posted in Big Data, Data Engineering, Python | Leave a comment

Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing

I wrote a blog post demonstrating advanced Spark Structured Streaming topics.

An overview of the content is:

  • setting up a Kafka server
  • producing messages with Kafka
  • consuming tweets with Spark Structured Streaming
  • watermarking messages
  • parsing JSON data
  • performing aggregattion queries on the stream of data
  • analyzing execution plans of queries
  • upserting data to Snowflake
  • checkpointing a structured stream

You can find the full blog post here.

A small preview:

Screen Shot 2018-02-11 at 16.50.58.png

Posted in Big Data, Data Engineering, Python | Leave a comment

Clustering keys Snowflake

I’ve written a blog post covering in depth the new Clustering Keys feature of Snowflake.

In the post I explained the internals of how clustering keys work in a Shared-Disk architecture of Snowflake and compared it to Redshift.

I also compared and analyzed the execution plans of queries that used clustering with those that didn’t.

You can find the full blog post here.

A small preview:

 

Screen Shot 2018-02-11 at 16.20.35.png

Posted in Big Data, Data Engineering | Leave a comment

Writing UDAFs on Snowflake

I’ve written a blog post explaining in-depth how to create User Defined Aggregate Functions (UDAFs) using Javascript on Snowflake.

You can find the blog post here.

A small preview:

Screen Shot 2018-02-11 at 16.13.00.png

Posted in Big Data, Data Engineering | Leave a comment