Category Archives: Python

Using Spark Structured Streaming to upsert Kafka messages into a database

I wrote a detailed and technical blog post demonstrating an integration of Spark Structured Streaming with Apache Kafka messages and Snowflake. An overview of the content is: querying Twitter API for realtime tweets setting up a Kafka server producing messages … Continue reading

Posted in Big Data, Data Engineering, Python | Leave a comment

Advanced Spark Structured Streaming – Aggregations, Joins, Checkpointing

I wrote a blog post demonstrating advanced Spark Structured Streaming topics. An overview of the content is: setting up a Kafka server producing messages with Kafka consuming tweets with Spark Structured Streaming watermarking messages parsing JSON data performing aggregattion queries … Continue reading

Posted in Big Data, Data Engineering, Python | Leave a comment

Apache Airflow for data pipelines and ETL management

I have been looking for good workflow management software and found Apache Airflow to be superior to other solutions. I’ve taken some time to write a pretty detailed blog post on using Airflow for development of ETL pipelines. Airflow is … Continue reading

Posted in Big Data, Data Engineering, Python | Leave a comment

Advanced data analysis for cBioPortal

As part of my application¬†for the cBioPortal Google Summer of Code position I made this¬†Jupyter notebook in which I demonstrated how different studies (RDBMS equivalent of a table) could be clustered together into types of studies based on the attributes … Continue reading

Posted in Data Engineering, Data science, Python | Leave a comment