5 min read

AWS RDS Postgres to Alibaba replication

In tech, you frequently get a nonsense adventure of the day which doesn't really make sense... and today's adventure is to setup replication between an AWS RDS Postgres database and the Alibaba cloud.

So the task at hand

  • Source Database: Postgres 12.5 running on AWS RDS, with about 300 Gb of data
  • Destination: Postgres 12.4 running on Alicloud

After some digging around, find Chinese language documentation with instructions for local to Alicloud migration. This is probably as good as it's going to get. Thank God for Google Translate.

So I decided to use their Data Transmission Service.. which seems to login as superuser and use the Postgres logical replication slots.

The site only has instructions in Chinese, which I don't read, but luckily Google Translate seems pretty good.

Soon enough got the issues.

Route 1 - Use the Data Transmission Service

Attempt 1: Test Connection success and Pre Check Failure

  • Source Test Connection successful
  • Destination Test Connection successful

But then Pre-Check fail!

Time to upgrade our Alibaba support. 2 hours later... Turns out since Source and Destination are in different zones, both zones IP addresses have to be whitelisted at source.

Attempt 2: Single table schema migration fails

More investigation.. Turns out Sequences are not migrated when you migrate a table schema.. so they have to be created before I guess?

Attempt 3: Single table migration where no sequence is present

Success!!! Yes! Great.. but then we kind of need synchronization.. not just simple migration. More investigation is necessary to stumble upon

Attempt 4:  Is synchronization possible with Alibaba's Data Transmission Service

That would be a no. You can sync to an analytical database but not a transaction db. So that's the end of that adventure

Route 2 - Setup our own ECS with storage

Can we sync from RDS to an Alibaba Postgres running on a normal ECS instance? Self managed database on the Alibaba side.

Should be able to.  Time to try this out next...

Attempt 1: Postgres publish/subscribe

  • AWS has Elastic Block Store... the Alicloud version seems to be Block Storage/Cloud Disk - basically a way to retain your database harddisk in the machine if it reboots or crashes
  • So create an ECS instance first,
  • Mess around with whitelisting IP addresses
  • Format the data disk for the Block Store
  • Boot up and move the postgres data directory to the Block Store
  • Finally start off the migration from AWS using pub/sub
  • And success on migration on one table! Great!
  • But no updates on the table... why?

Attempt 2: pglogical

After some exploration,  Postgres' publish subscribe functions take some management... and the pglogical extension manages a lot of that for you. The documentation however is not great.

Some notes:

  • Schema and tables have to be created on the subscriber before migration begins
  • keep postgres logs open in a terminal while creating the migration
sudo tailf logfile.log
  • Generated columns cannot get replicated (I failed three times!!!)
  • First they cannot exist on published table, meaning the table columns have to be added specifically
  • Next they cannot be created as generated on the subscriber side, this generated really nasty segfault errors on the subscriber database such that the database was kept in an infinite loop which got restarted everytime the database was restarted.
2021-05-26 01:04:50.387 CST [22240] LOG:  database system is ready to accept connections
2021-05-26 01:04:50.388 CST [9056] LOG:  starting pglogical supervisor
2021-05-26 01:04:50.394 CST [9058] LOG:  manager worker [9058] at slot 0 generation 1 detaching cleanly
2021-05-26 01:04:50.399 CST [9059] LOG:  manager worker [9059] at slot 0 generation 2 detaching cleanly
2021-05-26 01:04:50.404 CST [9060] LOG:  starting pglogical database manager for database databasename
2021-05-26 01:04:50.407 CST [9061] LOG:  starting apply for subscription ali_subscription
2021-05-26 01:04:50.592 CST [22240] LOG:  background worker "pglogical apply 16384:2677935694" (PID 9061) was terminated by signal 11: Segmentation fault
2021-05-26 01:04:50.592 CST [22240] LOG:  terminating any other active server processes
2021-05-26 01:04:50.593 CST [9054] WARNING:  terminating connection because of crash of another server process
2021-05-26 01:04:50.593 CST [9054] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2021-05-26 01:04:50.593 CST [9054] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2021-05-26 01:04:50.595 CST [22240] LOG:  all server processes terminated; reinitializing
2021-05-26 01:04:50.605 CST [9062] LOG:  database system was interrupted; last known up at 2021-05-26 01:04:50 CST
2021-05-26 01:04:50.669 CST [9062] LOG:  recovered replication state of node 1 to B42/62E72770
2021-05-26 01:04:50.669 CST [9062] LOG:  database system was not properly shut down; automatic recovery in progress
2021-05-26 01:04:50.671 CST [9062] LOG:  invalid record length at 5/51C98F50: wanted 24, got 0
2021-05-26 01:04:50.671 CST [9062] LOG:  redo is not required
  • Finally, I tried using a trigger on insert to populate the generated field, but the trigger didn't activate because pglogical uses COPY
CREATE OR REPLACE FUNCTION public.set_loan_id_on_loan_app() RETURNS trigger
    LANGUAGE plpgsql
    AS $$
BEGIN
    UPDATE public.loan_app SET loan_id=(NEW.userid::TEXT || '#' || NEW.id::TEXT) WHERE id=NEW.id;
    RETURN NEW;
END;
$$;

CREATE TRIGGER set_loan_id_on_loan_app AFTER INSERT ON public.loan_app FOR EACH ROW EXECUTE FUNCTION public.set_loan_id_on_loan_app();
  • Ended up setting up pg_cron to populate the field every minute - this is also pretty finicky on network access as it gets created in the postgres database and needs access to the subscriber database. This helped.

And finally the full sync process can begin... This just reinforces my belief that all data migration has to be troubleshot manually in a startup... I also deeply regret using Generated columns now. Another note never to use a new Postgres feature until its been stable for a couple of years...

  • Delete foreign keys in the subscriber schema - the replica will go out of sync at some point and resynchronization requires Truncate, which is not possible with foreign keys.

pglogical documentation also leaves much to be desired... I mean users have had to backwards engineer sync_status codes!!!

#define SYNC_STATUS_NONE		'\0'	/* No sync. */
#define SYNC_STATUS_INIT		'i'		/* Ask for sync. */
#define SYNC_STATUS_STRUCTURE	's'     /* Sync structure */
#define SYNC_STATUS_DATA		'd'		/* Data sync. */
#define SYNC_STATUS_CONSTAINTS	'c'		/* Constraint sync (post-data structure). */
#define SYNC_STATUS_SYNCWAIT	'w'		/* Table sync is waiting to get OK from main thread. */
#define SYNC_STATUS_CATCHUP		'u'		/* Catching up. */
#define SYNC_STATUS_SYNCDONE	'y'		/* Synchronization finished (at lsn). */
#define SYNC_STATUS_READY		'r'		/* Done. */

Update 2021-06-03:

Lots and lots of problems with the pglogical library. Sync almost always gets interrupted and then forced to truncate and resync the database. After some investigation it seems Postgres native logical replication since v10 is more stable. pglogical was a v9 and prior library.

So back we go to logical replication, and it is pretty smooth..

Lots of problems with sync dropping off due to wal_receiver and wal_sender timeouts, fix that by increasing the timeouts. More issues with database CPU on sender... fix that by increasing the AWS RDS instance type to add vCPUs.