Category Archives: Data Engineering
Loading Data into Snowflake Data Warehouse and performance of joins
I wrote a detailed article showing how to load 6GB of data into Snowflake using the PUT and COPY INTO commands. Then I evaluated the performance of joins and how caching and instance size affects them. You find the full … Continue reading
Caching in Snowflake Data Warehouse
I wrote a technical article covering how Snowflake uses caching on several layers (virtual warehouses caching data and caching of result sets). In the article I also explain how this works and what are the benefits of caching. You can read … Continue reading
Using Spark Structured Streaming to upsert Kafka messages into a database
I wrote a detailed and technical blog post demonstrating an integration of Spark Structured Streaming with Apache Kafka messages and Snowflake. An overview of the content is: querying Twitter API for realtime tweets setting up a Kafka server producing messages … Continue reading
Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing
I wrote a blog post demonstrating advanced Spark Structured Streaming topics. An overview of the content is: setting up a Kafka server producing messages with Kafka consuming tweets with Spark Structured Streaming watermarking messages parsing JSON data performing aggregattion queries … Continue reading
Clustering keys Snowflake
I’ve written a blog post covering in depth the new Clustering Keys feature of Snowflake. In the post I explained the internals of how clustering keys work in a Shared-Disk architecture of Snowflake and compared it to Redshift. I also … Continue reading
Writing UDAFs on Snowflake
I’ve written a blog post explaining in-depth how to create User Defined Aggregate Functions (UDAFs) using Javascript on Snowflake. You can find the blog post here. A small preview:
Apache Airflow for data pipelines and ETL management
I have been looking for good workflow management software and found Apache Airflow to be superior to other solutions. I’ve taken some time to write a pretty detailed blog post on using Airflow for development of ETL pipelines. Airflow is … Continue reading