Category Archives: Data Engineering
I wanted to share a couple of tips for easier development on AWS. 1. Use a local version of AWS for development and testing Try something like localstack to stand up a local AWS environment. This will run AWS API compliant … Continue reading
I’ve published online the presentation on Apache Spark I’ve made for a introductory lecture to graduate students at Maastricht University. If interested, please take a look at the presentation here.
Apache Airflow Using Apache Airflow to build reusable ETL on AWS Redshift Apache Kafka + Spark Streaming + Redshift Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing Snowflake … Continue reading
I wrote a detailed article showing how to load 6GB of data into Snowflake using the PUT and COPY INTO commands. Then I evaluated the performance of joins and how caching and instance size affects them. You find the full … Continue reading
I wrote a technical article covering how Snowflake uses caching on several layers (virtual warehouses caching data and caching of result sets). In the article I also explain how this works and what are the benefits of caching. You can read … Continue reading
I wrote a detailed and technical blog post demonstrating an integration of Spark Structured Streaming with Apache Kafka messages and Snowflake. An overview of the content is: querying Twitter API for realtime tweets setting up a Kafka server producing messages … Continue reading
I wrote a blog post demonstrating advanced Spark Structured Streaming topics. An overview of the content is: setting up a Kafka server producing messages with Kafka consuming tweets with Spark Structured Streaming watermarking messages parsing JSON data performing aggregattion queries … Continue reading