In tech, you frequently get a nonsense adventure of the day which doesn't really make sense... and today's adventure is to setup replication between an AWS RDS Postgres database and the Alibaba cloud.
So the task at hand
- Source Database: Postgres 12.5 running on AWS RDS, with about 300 Gb of data
- Destination: Postgres 12.4 running on Alicloud
After some digging around, find Chinese language documentation with instructions for local to Alicloud migration. This is probably as good as it's going to get. Thank God for Google Translate.
So I decided to use their Data Transmission Service.. which seems to login as superuser and use the Postgres logical replication slots.
The site only has instructions in Chinese, which I don't read, but luckily Google Translate seems pretty good.
Soon enough got the issues.
Route 1 - Use the Data Transmission Service
Attempt 1: Test Connection success and Pre Check Failure
- Source Test Connection successful
- Destination Test Connection successful
But then Pre-Check fail!
Time to upgrade our Alibaba support. 2 hours later... Turns out since Source and Destination are in different zones, both zones IP addresses have to be whitelisted at source.
Attempt 2: Single table schema migration fails
More investigation.. Turns out Sequences are not migrated when you migrate a table schema.. so they have to be created before I guess?
Attempt 3: Single table migration where no sequence is present
Success!!! Yes! Great.. but then we kind of need synchronization.. not just simple migration. More investigation is necessary to stumble upon
Attempt 4: Is synchronization possible with Alibaba's Data Transmission Service
That would be a no. You can sync to an analytical database but not a transaction db. So that's the end of that adventure
Route 2 - Setup our own ECS with storage
Can we sync from RDS to an Alibaba Postgres running on a normal ECS instance? Self managed database on the Alibaba side.
Should be able to. Time to try this out next...
Attempt 1: Postgres publish/subscribe
- AWS has Elastic Block Store... the Alicloud version seems to be Block Storage/Cloud Disk - basically a way to retain your database harddisk in the machine if it reboots or crashes
- So create an ECS instance first,
- Mess around with whitelisting IP addresses
- Format the data disk for the Block Store
- Boot up and move the postgres data directory to the Block Store
- Finally start off the migration from AWS using pub/sub
- And success on migration on one table! Great!
- But no updates on the table... why?
Attempt 2: pglogical
After some exploration, Postgres' publish subscribe functions take some management... and the pglogical extension manages a lot of that for you. The documentation however is not great.
- Schema and tables have to be created on the subscriber before migration begins
- keep postgres logs open in a terminal while creating the migration
sudo tailf logfile.log
- Generated columns cannot get replicated (I failed three times!!!)
- First they cannot exist on published table, meaning the table columns have to be added specifically
- Next they cannot be created as generated on the subscriber side, this generated really nasty segfault errors on the subscriber database such that the database was kept in an infinite loop which got restarted everytime the database was restarted.
2021-05-26 01:04:50.387 CST  LOG: database system is ready to accept connections 2021-05-26 01:04:50.388 CST  LOG: starting pglogical supervisor 2021-05-26 01:04:50.394 CST  LOG: manager worker  at slot 0 generation 1 detaching cleanly 2021-05-26 01:04:50.399 CST  LOG: manager worker  at slot 0 generation 2 detaching cleanly 2021-05-26 01:04:50.404 CST  LOG: starting pglogical database manager for database databasename 2021-05-26 01:04:50.407 CST  LOG: starting apply for subscription ali_subscription 2021-05-26 01:04:50.592 CST  LOG: background worker "pglogical apply 16384:2677935694" (PID 9061) was terminated by signal 11: Segmentation fault 2021-05-26 01:04:50.592 CST  LOG: terminating any other active server processes 2021-05-26 01:04:50.593 CST  WARNING: terminating connection because of crash of another server process 2021-05-26 01:04:50.593 CST  DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2021-05-26 01:04:50.593 CST  HINT: In a moment you should be able to reconnect to the database and repeat your command. 2021-05-26 01:04:50.595 CST  LOG: all server processes terminated; reinitializing 2021-05-26 01:04:50.605 CST  LOG: database system was interrupted; last known up at 2021-05-26 01:04:50 CST 2021-05-26 01:04:50.669 CST  LOG: recovered replication state of node 1 to B42/62E72770 2021-05-26 01:04:50.669 CST  LOG: database system was not properly shut down; automatic recovery in progress 2021-05-26 01:04:50.671 CST  LOG: invalid record length at 5/51C98F50: wanted 24, got 0 2021-05-26 01:04:50.671 CST  LOG: redo is not required
- Finally, I tried using a trigger on insert to populate the generated field, but the trigger didn't activate because pglogical uses COPY
CREATE OR REPLACE FUNCTION public.set_loan_id_on_loan_app() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN UPDATE public.loan_app SET loan_id=(NEW.userid::TEXT || '#' || NEW.id::TEXT) WHERE id=NEW.id; RETURN NEW; END; $$; CREATE TRIGGER set_loan_id_on_loan_app AFTER INSERT ON public.loan_app FOR EACH ROW EXECUTE FUNCTION public.set_loan_id_on_loan_app();
- Ended up setting up pg_cron to populate the field every minute - this is also pretty finicky on network access as it gets created in the postgres database and needs access to the subscriber database. This helped.
And finally the full sync process can begin... This just reinforces my belief that all data migration has to be troubleshot manually in a startup... I also deeply regret using Generated columns now. Another note never to use a new Postgres feature until its been stable for a couple of years...
- Delete foreign keys in the subscriber schema - the replica will go out of sync at some point and resynchronization requires Truncate, which is not possible with foreign keys.
pglogical documentation also leaves much to be desired... I mean users have had to backwards engineer sync_status codes!!!
#define SYNC_STATUS_NONE '\0' /* No sync. */ #define SYNC_STATUS_INIT 'i' /* Ask for sync. */ #define SYNC_STATUS_STRUCTURE 's' /* Sync structure */ #define SYNC_STATUS_DATA 'd' /* Data sync. */ #define SYNC_STATUS_CONSTAINTS 'c' /* Constraint sync (post-data structure). */ #define SYNC_STATUS_SYNCWAIT 'w' /* Table sync is waiting to get OK from main thread. */ #define SYNC_STATUS_CATCHUP 'u' /* Catching up. */ #define SYNC_STATUS_SYNCDONE 'y' /* Synchronization finished (at lsn). */ #define SYNC_STATUS_READY 'r' /* Done. */
Lots and lots of problems with the pglogical library. Sync almost always gets interrupted and then forced to truncate and resync the database. After some investigation it seems Postgres native logical replication since v10 is more stable. pglogical was a v9 and prior library.
So back we go to logical replication, and it is pretty smooth..
Lots of problems with sync dropping off due to wal_receiver and wal_sender timeouts, fix that by increasing the timeouts. More issues with database CPU on sender... fix that by increasing the AWS RDS instance type to add vCPUs.