Blog migrated

Blog has been migrated to a new Github pages website

Posted in Uncategorized | Leave a comment

My favourite papers on columnar database

The Design and Implementation of Modern Column-Oriented Database Systems

  • Probably the best paper to read

Massively Parallel Databases and MapReduce Systems

  • very interesting comparison of MPP Databases (bit outdated)

Posted in Data Systems | Leave a comment

Modern data engineering stack

Data storage

  • Data lake: Object store + storing data in a columnar format
    • examples:
      • S3 + Athena (Presto)
    • PROs:
      • cheaper than a data warehouse
      • easier to store very complicated data (eg. arrays in a column) than in most data warehouses
    • CONs
      • worser performance than a data warehouse
      • can become very messy without a good data catalog and understanding of schemas
    • I would always choose a data warehousing solution over Parquet files in Object storage, though Apache Iceberg is a very promising solution in this arena…
  • Data warehouse
    • example:
      • Snowflake
      • Google BigQuery
      • Amazon Redshift
      • Click House (great solution but not ANSI SQL compliant and with disappointing window functions support in 2023!)
    • PROs
      • very fast and efficient
    • CONs
      • more expensive storage
      • less flexible schemas
    • Note
      • prefer data warehouses where the storage is fundamentally decoupled from the compute (eg. use Snowflake and not Redshift)
  • Things to avoid:
    • using a traditional RDBMS (SQL Server, Oracle, PostgreSQL) for storing data on which only analytics will be done
      • the traditional databases do not scale well (though if you have smaller data I would still entertain them) and are just not optimised for this use case no matter how many optimisation they now have (parallel execution, materialized views, column store indexes, roll-ups using upserts, B+ tree indexes…)

Data ingestion

  • ELT vs ETL
    • Always prefer the ELT approach as data warehouses
      • with an ELT approach you are more flexible, the loading process is faster and data warehouses are better at doing transformations (in a batch manner) compared to ETL tools
  • Depending on the data source
    • Is the data source from outside of your organisation and somewhat “popular” (e.g. stripe, adyen, blaze…)?
      • Easy solution: use companies that have build these connectors e.g. Fivetran, Stitch… though the final decision depends on your budget
    • On-premise, custom data source
      • write your own connectors, probably simplest in Python

Tasks scheduling

  • Ideally you could use an automated data movement platform (e.g. Fivetran stitch, meltano…) for most of the data ingestion tasks
  • For other custom tasks I like to use Airflow, ideally in a cloud setting (so that Airflow is managed by the cloud provider) and that is uses Kubernetes executors (which allow it to very easily scale)
  • For managing and executing SQL views and transformations use DBT. It’s a phenomenal tool.
Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

Important skills for data engineers

Data engineering specific:

  • SQL (advanced SQL, database internals, query tuning, administration…)
  • Good coding (Python + a JVM language)
  • Data pipeline design
  • Data modeling techniques 
  • Architectural projections
  • Business domain knowledge
  • Communication skills

Secondary:

  • Software engineering (clean code, OOP, logging, monitoring…)
  • DevOps (Unix, infrastructure management, deployment…)
  • Data science/Business intelligence
  • Visualisation (Front-end code or just GUIs)

Good to know

  • Cloud / K8s (but really depends on the environment and the size of DevOps team)
Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

Data pipeline design anti-patterns

  • Bunch of scripts
  • Single run-everything script
  • Hacky “homemade” dependency control
  • Hard-coded values eg. data sources, queries… 
  • ETL jobs that create duplicate data if ran twice or create holes in the data if not ran at all
  • No data quality checks or data ingestion monitoring
Posted in Big Data, Data Engineering, Data Systems | Leave a comment

Data extraction and transformation design patterns

  • You will need to choose between full and incremental extracts
    • Always prefer full extracts and partitioned snapshots if the data is small enough
    • If this is not possible, then design an incremental approach
  • Extract data in an idempotent manner
    • Eg. extract data for a specific timeframe from the data source and make sure that in your data store you have the same data in the same timeframe (eg. same size, distinct values in columns…)
    • Eg. perform incremental extracts using row IDs,so if you have a set of rows in the data source then do a set difference to determine which rows should be ingested
      • Otherwise maybe incrementally ingest using the created/modified timestamps but only if they are reliable. After this extraction window functions are necessary for data deduplication (choosing the latest source)
  • Use a staging table for the output of ETL job
    • Eg. if you recompute an output table, first create the temporary output table, then check it’s values are good before replacing the actual output table
  • Use partitions for per-partition data extraction
    • Eg. an ETL gets data every 1 hour or 1 day from the data source and puts it into the desired output
  • Build a dashboard to visualise input data 
  • Reproducibility through immutable datasets
    • Snapshot your extracts and dimensions
    • Have a clear data lineage
  • Favor ELT over ETL
    • Put work into the database engine
    • Have a specific ingestion staging area to put data into
  • Favor SQL over anything else for transformation
  • Add data quality tests and measures during every ingest
    • If the data quaity of some source system is below a threshold, throw that data away
  • Create data ingestion SLAs
    • A script that checks the latest timestamp for some data 
Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

General guidelines for design of batch jobs

  • Break the process down into components
    • UNIX way of doing
    • Eg first step is ingestion, second is taking the ingested data and filtering…
  • Idempotent and deterministic ETL tasks
    • Have no side effects
    • Use immutable storage
    • Usually target a single partition
    • Never UDPATE, INSERT, DELETE  (mutations)
    • Limit the number of source partitions (no wide-scans)
    • Transform your data, don’t overwrite
    • All jobs must be idempotent meaning that no matter how many times you run it, the state after N runs is same as after N + M runs
  • Use configuration files to define your ETL
    • Do not code to a specific data source, query, file format, location…
    • Code the dependencies of the job as external file
      • eg. inputs, outputs, output file format…
  • Engineer to anticipate failures
    • Use retries (potentially exponential)
      • Expect external connections to fail 
      • Expect data not to be there
  • Workflow tool you use needs to have:
    • Basic dependency management
    • Clarity on status
    • Scheduling (eg. similar to crontab)
    • Easy access to logs
    • Parametrized retries
    • Notifications (errors, success)
    • Retries of jobs
  • Make sure your code base handles:
    • Testing – test the code, especially have integration tests
    • Logging – log everything (ideally ship the code)
    • Packaging
Posted in Data Engineering, Data Systems, Data Warehousing | Leave a comment

Kubernetes auto-scaling on relative resource usage

I am working on auto-scaling in Kubernetes and I was seeing something weird in my system and only now I realised that auto-scaling based on relative metrics (like CPU usage) really has to be done on requests instead of limits since the requested resource usage is guaranteed but the area between a limit and the requests usage is not guaranteed and depends on Kubernetes scheduler, if there are overcommits and etc (described well in this article)
So if the CPU usage for the auto-scaler is calculated as:
 sum(cpu_usage_per_container/cpu_limits_per_container)
then if you don’t have enough resources on each node to satisfy all the limits, when you scale up the number of instances, you really only scale up the limits without necessarily giving the application a boost due to potential overcommits and scheduling/prioritisation complexities that brings.
In the end I changed the definition of the deployment resource I want to scale so that  limits = requests but the most important bit is to calculate CPU usage based on the requested resources and not limits.
containers:
  - name: webapp
    image: ...
    imagePullPolicy: ...
    resources:
      # ensure limits are the same as requests !!
      # the scheduler doesn't guarantee that the limit resources will be honored
      limits:
        memory: "200Mi"
        cpu: "200m"
      requests:
        memory: "200Mi"
        cpu: "200m"
Posted in Data Engineering, Data Systems, Uncategorized | Leave a comment

My view on responsibilities of a modern data engineer

In my opinion one of the main responsibilities of a data engineer is to prepare the infrastructure and clean data for efficient use by various stakeholders (analysts, data scientists…).

Core responsibilities of an analytics focused data engineer are:

  • Setting up and managing the data warehouse (including ensuring HA, backups…)
    • requires a good knowledge of databases and their internals
    • most of the textbooks are outdated as they mainly cover traditional RDMBS systems but there is a great course by Andy Pavlo from CMU
  • Setting up and managing up the data infrastructure (messaging systems, object storage, compute engines…)
    • requires knowledge of cloud vendors tools internals (eg. how is Kinesis different than Kafka?), infrastructure provisioning (Terraform, Ansible) and potentially tools like Kubernetes for custom applications
  • Designing a high-level architecture of the data warehouse and writing guidelines for managing the contents
    • requires a lot of prior experience and business specific knowledge
    • a good source is the book The Data Warehouse Toolkit by Ralph Kimball though some concepts might be outdated (handling of slowly changing dimensions)
  • Handling performance issues in the data access patterns of end users 
  • Ingesting data into the data warehouse
    • whilst a data engineer should be very good at this, the real goal is to empower the end-users to do this themselves by using (or writing) frameworks that allow declarative definitions of data pipelines
    • a data engineer should know when to use tools like Fivetran, Alooma… (which should be preferred for modern SaaS data sources) to ingest data and when roll up one’s sleeves
    • Maxime Beauchemins’ blogs posts on functional data engineering are a good source of information on this
  • Transforming the data in the data warehouse
    • whilst a data engineer should be very good at this, the real goal is to empower the end-users to do this themselves using tools like dbt
    • a great book on SQL is T-SQL Querying by Itzik Ben-Gan
  • Setting up and managing production workloads and the related infrastructure
    • examples:
      • setting up Apache Superset along with all the security, permission and other concerns
      • setting up an internal python package repository
      • creating build pipelines (eg. in Jenkins)
      • managing the production job scheduler (eg. Airflow)

Data engineering tasks sometimes overlap with the usual back-end development (and Dev-ops) tasks such as:

  • Back-end development
    • building back-end APIs to serve data (eg. using Flask or Spring Boot)
      • usually this won’t be for a typical “operational” front-end system
    • managing streaming data (eg. consuming Kafka messages in real-time and acting on them)
    • building a caching layer (Redis, Plasma cache…)
    • building a search layer (Elastic search, Apache solr…)
  • Dev-ops
    • managing infrastructure, builds..

A data engineer should also be able to do basic analytics by himself:

  • Analytics and Business intelligence
    • writing queries to answer complicated business questions
    • composing accurate reports
  • Data science
    • non-critical predictive workloads (ie. running a simple classifier or a regression task)
  • Front-end development (eg. R shiny, dash, D3.js, React.js…) or creating dash boards using a tool (Tableau…) 
Posted in Uncategorized | Leave a comment

Introduction to PostgreSQL High availability with pg_auto_failover

Introduction

I am an avid fan of PostgreSQL database due to many reasons.

What has been lacking in PostgreSQL is a simple, “built-in”, high-availability setup like Cassandra, MongoDB, Kafka… have.

I think the pg_auto_failover project open-sourced by CitusDB team (now Microsoft) is a huge step in that direction.

Setting up streaming replication in PostgreSQL is pretty easy. I recommend this video and the related tutorial to setting it up for PG 10,11. PG 12 has a minor change that makes it even simpler.

Unfortunately the native streaming replication in PostgreSQL does not know when to failover from the primary node (leader) to the secondary node (follower). There are many solutions for this problem though the most recommended one is Patroni. Patroni is unfortunately specialised for dockerized environments and is fairly complex to setup compared to pg_auto_failover.

The way pg_auto_failover handles identifying when to failover from the primary to secondary node is using an additional process called monitor that basically monitors your HA setup (formation) and performs failover if it identifies one of your nodes is down. Technically you can implement this step yourself with eg. a Python script running every 1 second to check the heartbeat of PostgreSQL processes in the HA setup.

So if you want a simple solution and don’t have Docker/Kubernetes, I’d recommend looking into pg_auto_failover.

Below are the steps to manually setup pg_auto_failover (on my Mac machine). The setup would be fairly similar on most Linux boxes. Although databases are stateful, you’d probably want to wrap this up in a script or something like Ansible playbook.

 

Local pg_auto_failover setup 

1. Install PostgreSQL to get the binaries (you might need postgresql-dev binaries for pg_auto_failover to be able to run make)

brew install postgresql

2. Install pg_auto_failover using make

git clone https://github.com/citusdata/pg_auto_failover.git
cd pg_auto_failover
make

To see where the postgresql binaries are:

pg_config --bindir

Install the pg_auto_failover binaries to pg_config –bindir

sudo make install

Add the bins to path (including auto failover related ones):

export PATH="/usr/local/Cellar/postgresql/11.3/bin/:$PATH"

3. Create the monitor (ie. a PostgreSQL and a deamon agent)

pg_autoctl create monitor --pgdata /usr/local/var/postgres_monitor --nodename localhost --pgport 5444

Output:

21:59:01 INFO  Initialising a PostgreSQL cluster at "/usr/local/var/postgres_monitor"
21:59:01 INFO /usr/local/bin/pg_ctl --pgdata /usr/local/var/postgres_monitor --options "-p 5444" --options "-h *" --wait start
21:59:01 INFO Granting connection privileges on ::1/128
21:59:01 INFO Your pg_auto_failover monitor instance is now ready on port 5444.
21:59:01 INFO pg_auto_failover monitor is ready at postgres://autoctl_node@localhost:5444/pg_auto_failover
21:59:01 INFO Monitor has been succesfully initialized.

4. Create primary node (specifying the monitor)

pg_autoctl create postgres --pgdata /usr/local/var/primary --monitor postgres://autoctl_node@localhost:5444/pg_auto_failover --pgport 5432 --pgctl /usr/local/Cellar/postgresql/11.3/bin/pg_ctl

Output – you can notice the DB also gets started automatically:

22:00:55 WARN  Failed to resolve hostname from address "192.168.0.17": nodename nor servname provided, or not known
22:00:55 INFO Using local IP address "192.168.0.17" as the --nodename.
22:00:55 INFO Found pg_ctl for PostgreSQL 11.3 at /usr/local/bin/pg_ctl
22:00:55 INFO Registered node 192.168.0.17:5432 with id 1 in formation "default", group 0.
22:00:55 INFO Writing keeper init state file at "/Users/dbg/.local/share/pg_autoctl/usr/local/var/primary/pg_autoctl.init"
22:00:55 INFO Successfully registered as "single" to the monitor.
22:00:56 INFO Initialising a PostgreSQL cluster at "/usr/local/var/primary"
22:00:56 INFO Postgres is not running, starting postgres
22:00:56 INFO /usr/local/bin/pg_ctl --pgdata /usr/local/var/primary --options "-p 5432" --options "-h *" --wait start
22:00:56 INFO CREATE DATABASE postgres;
22:00:56 INFO The database "postgres" already exists, skipping.
22:00:56 INFO FSM transition from "init" to "single": Start as a single node
22:00:56 INFO Initialising postgres as a primary
22:00:56 INFO Transition complete: current state is now "single"
22:00:56 INFO Keeper has been succesfully initialized.
pg_autoctl run --pgdata /usr/local/var/primary
22:03:57 INFO Managing PostgreSQL installation at "/usr/local/var/primary"
22:03:57 INFO pg_autoctl service is starting
22:03:57 INFO Calling node_active for node default/1/0 with current state: single, PostgreSQL is running, sync_state is "", WAL delta is -1.
22:04:02 INFO Calling node_active for node default/1/0 with current state: single, PostgreSQL is running, sync_state is "", WAL delta is -1.

5. Create and start the secondary node (also specifying the monitor)

pg_autoctl create postgres --pgdata /usr/local/var/secondary --monitor postgres://autoctl_node@localhost:5444/pg_auto_failover --pgport 5433 --pgctl /usr/local/Cellar/postgresql/11.3/bin/pg_ctl

pg_autoctl run --pgdata /usr/local/var/secondary/

6. Connection string from clients to postgres is the following

pg_autoctl show uri --formation default --pgdata /usr/local/var/primary

Output:

postgres://192.168.0.17:5433,192.168.0.17:5432/postgres?target_session_attrs=read-write

The PostgreSQL client will automatically failover from 192.168.0.17:5433 to 192.168.0.17:5432 in case of the DBs is down

Testing failover

Now with the primary and secondary running, we will test some failover scenarios.

Testing failover 1 — killing the secondary

Lets see what PostgreSQL processes are running locally:

[~] => ps -ef | grep postgresql

Output:

  501   642     1   0  9:59pm ??         0:00.96 /usr/local/Cellar/postgresql/11.3/bin/postgres -D /usr/local/var/postgres_monitor -p 5444 -h *  
5017529 1 0 11:56pm ?? 0:00.05 /usr/local/Cellar/postgresql/11.3/bin/postgres -D /usr/local/var/primary -p 5432 -h *
5017680 1 0 11:57pm ?? 0:00.04 /usr/local/Cellar/postgresql/11.3/bin/postgres -D /usr/local/var/secondary -p 5433 -h *
50177391123 0 11:57pm ttys0030:00.00 grep postgresql

We can notice the 3 components – monitor, primary and secondary. Let’s kill the secondary:

kill -9 7680

Secondary node logs after executing kill -9:

23:58:10 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
23:58:15 ERROR Failed to signal pid 7680, read from Postgres pid file.
23:58:15 INFO Is PostgreSQL at "/usr/local/var/secondary" up and running?
23:58:15 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is not running, sync_state is "", WAL delta is -1.
23:58:15 INFO Postgres is not running, starting postgres
23:58:15 INFO /usr/local/Cellar/postgresql/11.3/bin/pg_ctl --pgdata /usr/local/var/secondary --options "-p 5433" --options "-h *" --wait start
23:58:15 ERROR Failed to signal pid 7680, read from Postgres pid file.
23:58:15 INFOIs PostgreSQL at "/usr/local/var/secondary" up and running?
23:58:15 ERROR Failed to get Postgres pid, see above for details
23:58:15 ERROR Given --pgport 5433 doesn't match PostgreSQL port 0 from "/usr/local/var/secondary/postmaster.pid"
23:58:15 FATAL Failed to discover PostgreSQL setup, please fix previous errors.
23:58:15 ERROR Failed to restart PostgreSQL, see PostgreSQL logs for instance at "/usr/local/var/secondary".
23:58:15 WARNpg_autoctl failed to ensure current state "secondary": PostgreSQL is not running
23:58:20 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 589336.
23:58:20 INFO FSM transition from "secondary" to "catchingup": Failed to report back to the monitor, not eligible for promotion
23:58:20 INFO Transition complete: current state is now "catchingup"
23:58:20 INFO Calling node_active for node default/2/0 with current state: catchingup, PostgreSQL is running, sync_state is "", WAL delta is -1.
23:58:20 INFO FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again
23:58:20 INFO Transition complete: current state is now "secondary"
23:58:20 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 589336.
23:58:25 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 589336.
23:58:30 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 589336.
23:58:35 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 589336.

We notice that:

  1. The secondary node was nicely getting the data from the primary (no WAL lag)
  2. We then killed it and the pg_auto_ctl saw it was down
  3. pg_auto_ctl restarted the PostgreSQL server by re-running the server
  4. we can notice that the FSM (finite state machine) state was “catching-up” and later it was back to “secondary” once it caught up with the streaming replication from the primary.

Simultaneous logs on the primary after we killed the secondary node:

23:58:05 INFO  Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
23:58:10 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
23:58:15 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
23:58:15 INFO FSM transition from "primary" to "wait_primary": Secondary became unhealthy
23:58:15 INFO Disabling synchronous replication
23:58:15 INFO Transition complete: current state is now "wait_primary"
23:58:15 INFO Calling node_active for node default/1/0 with current state: wait_primary, PostgreSQL is running, sync_state is "async", WAL delta is 0.
23:58:20 INFO Calling node_active for node default/1/0 with current state: wait_primary, PostgreSQL is running, sync_state is "async", WAL delta is 0.
23:58:20 INFO FSM transition from "wait_primary" to "primary": A healthy secondary appeared
23:58:20 INFO Enabling synchronous replication
23:58:20 INFO Transition complete: current state is now "primary"
23:58:20 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
23:58:25 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
23:58:30 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
23:58:35 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
23:58:40 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
23:58:45 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
The log output here shows us why pg_auto_failover is a great failover solution.

It saw that the secondary was down, and instead of continuing synchronous streaming replication it switched to asynchronous. This means commits can still be done on the primary node, otherwise the primary would stop accepting writes (since they couldn’t be synced ie. commited at the same time on the secondary).

Test failover 2 — killing the primary


[~] => ps -ef | grep postgresql

Output:

501   642     1   0  9:59pm ??         0:01.00 /usr/local/Cellar/postgresql/11.3/bin/postgres -D /usr/local/var/postgres_monitor -p 5444 -h *
501  7529     1   0 11:56pm ??         0:00.08 /usr/local/Cellar/postgresql/11.3/bin/postgres -D /usr/local/var/primary -p 5432 -h *
501  7773     1   0 11:58pm ??         0:00.05 /usr/local/Cellar/postgresql/11.3/bin/postgres -D /usr/local/var/secondary -p 5433 -h *
501  8042  1123   0 12:01am ttys003    0:00.00 grep postgresql
[~] => kill -9 7529

Simultaneous logs on the secondary after we killed the primary node:

00:01:06 INFO  Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 589336.
00:01:11 INFO  Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 589336.
00:01:16 ERROR PostgreSQL cannot reach the primary server: the system view pg_stat_wal_receiver has no rows.
00:01:16 INFO  Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is -1.
00:01:21 INFO  Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
00:01:21 INFO  FSM transition from "secondary" to "prepare_promotion": Stop traffic to primary, wait for it to finish draining.
00:01:21 INFO  Transition complete: current state is now "prepare_promotion"
00:01:21 INFO  Calling node_active for node default/2/0 with current state: prepare_promotion, PostgreSQL is running, sync_state is "", WAL delta is -1.
00:01:21 INFO  FSM transition from "prepare_promotion" to "stop_replication": Prevent against split-brain situations.
00:01:21 INFO  Prevent writes to the promoted standby while the primary is not demoted yet, by making the service incompatible with target_session_attrs = read-write
00:01:21 INFO Setting default_transaction_read_only to on
00:01:21 INFO Promoting postgres
00:01:21 INFO Other node in the HA group is 192.168.0.17:5432
00:01:21 INFO Create replication slot "pgautofailover_standby"
00:01:21 INFO Disabling synchronous replication
00:01:21 INFO Transition complete: current state is now "stop_replication"
00:01:21 INFO Calling node_active for node default/2/0 with current state: stop_replication, PostgreSQL is running, sync_state is "", WAL delta is -1.
00:01:26 INFO Calling node_active for node default/2/0 with current state: stop_replication, PostgreSQL is running, sync_state is "", WAL delta is -1.
00:01:26 INFO FSM transition from "stop_replication" to "wait_primary": Confirmed promotion with the monitor
00:01:26 INFO Setting default_transaction_read_only to off
00:01:26 INFO Transition complete: current state is now "wait_primary"
00:01:26 INFO Calling node_active for node default/2/0 with current state: wait_primary, PostgreSQL is running, sync_state is "", WAL delta is -1.
00:01:31 INFO Calling node_active for node default/2/0 with current state: wait_primary, PostgreSQL is running, sync_state is "async", WAL delta is 0.
00:01:31 INFO FSM transition from "wait_primary" to "primary": A healthy secondary appeared
00:01:31 INFO  Enabling synchronous replication
00:01:31 INFO  Transition complete: current state is now "primary"
00:01:31 INFO  Calling node_active for node default/2/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.

Logs on the primary after we killed the (primary) PostgreSQL process:

00:01:01 INFO  Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
00:01:06 INFO  Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
00:01:11 INFO  Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "sync", WAL delta is 0.
00:01:16 ERROR Failed to signal pid 7529, read from Postgres pid file.
00:01:16 INFO  Is PostgreSQL at "/usr/local/var/primary" up and running?
00:01:16 INFO  Calling node_active for node default/1/0 with current state: primary, PostgreSQL is not running, sync_state is "", WAL delta is -1.
00:01:16 INFO  Postgres is not running, starting postgres
00:01:16 INFO   /usr/local/bin/pg_ctl --pgdata /usr/local/var/primary --options "-p 5432" --options "-h *" --wait start
00:01:17 WARN  PostgreSQL was not running, restarted with pid 8064
00:01:22 ERROR PostgreSQL primary server has lost track of its standby: pg_stat_replication reports no client using the slot "pgautofailover_standby".
00:01:22 INFO  Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is "", WAL delta is -1.
00:01:22 INFO  FSM transition from "primary" to "demote_timeout": A failover occurred, no longer primary
00:01:22 INFO  Transition complete: current state is now "demote_timeout"
00:01:22 INFO  Calling node_active for node default/1/0 with current state: demote_timeout, PostgreSQL is not running, sync_state is "", WAL delta is -1.
00:01:27 INFO  Calling node_active for node default/1/0 with current state: demote_timeout, PostgreSQL is not running, sync_state is "", WAL delta is -1.
00:01:27 INFO  FSM transition from "demote_timeout" to "demoted": Demote timeout expired
00:01:27 INFO  pg_ctl: no server running
00:01:27 INFO  pg_ctl stop failed, but PostgreSQL is not running anyway
00:01:27 INFO  Transition complete: current state is now "demoted"
00:01:27 INFO  Calling node_active for node default/1/0 with current state: demoted, PostgreSQL is not running, sync_state is "", WAL delta is -1.
00:01:27 INFO  FSM transition from "demoted" to "catchingup": A new primary is available. First, try to rewind. If that fails, do a pg_basebackup.
00:01:27 INFO  The primary node returned by the monitor is 192.168.0.17:5433
00:01:27 INFO  Rewinding PostgreSQL to follow new primary 192.168.0.17:5433
00:01:27 INFO  pg_ctl: no server running
00:01:27 INFO  pg_ctl stop failed, but PostgreSQL is not running anyway
00:01:27 INFO  Running /usr/local/bin/pg_rewind --target-pgdata "/usr/local/var/primary" --source-server " host='192.168.0.17' port=5433 user='pgautofailover_replicator' dbname='postgres'" --progress ...
00:01:28 INFO  connected to server
servers diverged at WAL location 0/3092C58 on timeline 5
rewinding from last common checkpoint at 0/308FE18 on timeline 5
reading source file list
reading target file list
reading WAL in target
need to copy 115 MB (total source directory size is 135 MB)
118766/118766 kB (100%) copied
creating backup label and updating control file
syncing target data directory
Done!
00:01:28 INFO  Writing recovery configuration to "/usr/local/var/primary/recovery.conf"
00:01:28 INFO  Postgres is not running, starting postgres
00:01:28 INFO   /usr/local/bin/pg_ctl --pgdata /usr/local/var/primary --options "-p 5432" --options "-h *" --wait start
00:01:28 INFO  Transition complete: current state is now "catchingup"
00:01:28 INFO  Calling node_active for node default/1/0 with current state: catchingup, PostgreSQL is running, sync_state is "", WAL delta is -1.
00:01:28 INFO  FSM transition from "catchingup" to "secondary": Convinced the monitor that I'm up and running, and eligible for promotion again
00:01:28 INFO  Transition complete: current state is now "secondary"
00:01:28 INFO  Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
00:01:33 INFO  Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.
00:01:38 INFO  Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is "", WAL delta is 0.

Events:

  • primary is down so writes and replication are stopped
  • the secondary then gets promoted to a wait_primary (not a full primary since it doesn’t have a secondary)
  • in the meantime ex-primary gets back to lifeit catches up with the primary (ex-secondary) node using pg_rewind and what is especially interesting is that it would revert to pg_base_backup if it wasn’t able to rewind
  • the ex-primary then becomes a secondary
  • the ex-secondary becomes a primary with synchronous replication
Posted in Data Engineering, Data Systems, Data Warehousing | 3 Comments