Client_addr on pg_catalog.pg_stat_replication wrong ip address - istio enabled #1629

cristi-vlad · 2021-09-27T15:15:04Z

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? e.g. registry.opensource.zalan.do/acid/postgres-operator:v1.6.3
Where do you run it - cloud or metal? Kubernetes PKS
Are you running Postgres Operator in production? yes
Type of issue? Question / Bug

While trying to do a incluster upgrade from PGVERSION 12 to PGVERSION 13 discovered that members ip's are not correctly written into pg_catalog.pg_stat_replication

While running python3 /scripts/inplace_upgrade.py 3 (three nodes cluster), i have following error message:

2021-09-27 14:58:37,457 inplace_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload. 2021-09-27 14:58:37,500 inplace_upgrade WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'. 2021-09-27 14:58:37,504 inplace_upgrade INFO: establishing a new patroni connection to the postgres cluster 2021-09-27 14:58:37,561 inplace_upgrade ERROR: Member hco-pg-1-1 is not streaming from the primary
After debugging, discovered that into pg_catalog.pg_stat_replication, client_addr is 127.0.0.6 for both nodes that are replicating data from master

postgres=# SELECT * from pg_catalog.pg_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time -----+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------------+-----------------+-----------------+---------------+------------+------------------------------- 894 | 16637 | standby | hco-pg-1-2 | **127.0.0.6** | | 40175 | 2021-09-27 13:40:26.305049+00 | | streaming | 9/4E027518 | 9/4E027518 | 9/4E027518 | 9/4E027518 | 00:00:00.002132 | 00:00:00.002812 | 00:00:00.002913 | 0 | async | 2021-09-27 15:09:14.095679+00 886 | 16637 | standby | hco-pg-1-1 | **127.0.0.6** | | 36155 | 2021-09-27 13:40:05.528001+00 | | streaming | 9/4E027518 | 9/4E027518 | 9/4E027518 | 9/4E027518 | 00:00:00.001441 | 00:00:00.002128 | 00:00:00.002146 | 0 | async | 2021-09-27 15:09:14.09543+00 (2 rows)

Cluster looks like this:

`root@hco-pg-1-0:/home/postgres# patronictl list

Cluster: hco-pg-1 (6995167694694490192) ----+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------+------------+---------+---------+----+-----------+
| hco-pg-1-0 | 11.32.16.6 | Leader | running | 11 | |
| hco-pg-1-1 | 11.32.16.7 | Replica | running | 11 | 0 |
| hco-pg-1-2 | 11.32.16.9 | Replica | running | 11 | 0 |
+------------+------------+---------+---------+----+-----------+
`

After debugging into /scripts/inplace_upgrade.py, found out that into below code, section ip = member.conn_kwargs().get('host') retrieves correct replica ip, then while searching replication lag by ip into lag = streaming.get(ip), value of lag will be None since ip won't match as into pg_catalog.pg_stat_replication i only have client_addr = 127.0.0.6 for both nodes.

def ensure_replicas_state(self, cluster): """ This method checks the satatus of all replicas and also tries to open connections to all of them and puts into theself.replica_connectionsdict for a future usage. """ self.replica_connections = {} streaming = {a: l for a, l in self.postgresql.query( ("SELECT client_addr, pg_catalog.pg_{0}_{1}_diff(pg_catalog.pg_current_{0}_{1}()," " COALESCE(replay_{1}, '0/0'))::bigint FROM pg_catalog.pg_stat_replication") .format(self.postgresql.wal_name, self.postgresql.lsn_name))} print("Streaming: ", streaming) def ensure_replica_state(member): ip = member.conn_kwargs().get('host') lag = streaming.get(ip) if lag is None: return logger.error('Member %s is not streaming from the primary', member.name) if lag > 16*1024*1024: return logger.error('Replication lag %s on member %s is too high', lag, member.name)

My question would be if this is because we are using istio injection (envoy proxy) for our zalando postgres clusters or if we have some other issue and how we can solve this.

Thank you !
/Cristi Vlad

The text was updated successfully, but these errors were encountered:

CyberDem0n · 2021-09-28T09:35:32Z

My question would be if this is because we are using istio injection (envoy proxy) for our zalando postgres clusters

Yes, it is exactly due to the intermediate proxy between primary and replicas. As a result, the client_addr in the pg_stat_replication doesn't match with the actual IPs of replicas.

and how we can solve this.

In theory, we can do the check based on the pg_stat_replication.application_name, but it wouldn't be so strict and it is not possible to guaranty that it is actually the replica that is streaming and not something else that decided to use the same application_name.

sylvainOL · 2023-01-31T15:04:31Z

Hello, is there any news on this topic?
I've the exact same issue and I'd like to see how we can make it work!

thanks!

mblsf · 2023-05-24T08:02:27Z

We are facing the same issue in the environment of one of our customers.
Are there any plans to fix this sooner or later?

sylvainOL · 2023-05-24T08:45:06Z

I think we need to patch https://github.com/zalando/spilo/blob/master/postgres-appliance/major_upgrade/inplace_upgrade.py
I was thinking checking if an env var is set (USE_APPLICATION_NAME_IN_UPGRADE) and use pg_stat_replication.application_name instead of pg_stat_replication.client_addr.

@CyberDem0n, would it be OK?

I can make a MR for that

sylvainOL · 2023-09-05T07:03:58Z

I've just made a simple change in spilo so let's see what's maintainers thinks about it

Jan-M · 2023-11-06T11:17:32Z

Sometimes using \x is very helpful for readable output :)

seekermarcel · 2025-02-12T11:12:44Z

since @sylvainOL seems to be unreachable i took over his changes

sylvainOL · 2025-02-12T12:25:57Z

Hello,
sorry I didn't saw the new comments. I'm not using zalando operator anynore so please @seekermarcel take over

sorry for that

seekermarcel · 2025-02-17T15:36:10Z

@cristi-vlad it's working now in my pr. please verify it on your end

alfsch · 2025-02-21T08:15:38Z

The fix in zalando/spilo#1082 works for me very well. When could we expect an official release (@Jan-M)? I don't want to fiddle around with selfmade spilo images in production clusters ;-)

cp319391 · 2025-02-21T09:06:20Z

works for me as well. thanks @seekermarcel.
waiting for the release with the changes.

theBNT · 2025-02-21T10:03:49Z

This is awesome news, thanks so much! Since Jan was not very active in the last time, is this something you can/want to chime in @FxKu @hughcapet?

FxKu · 2025-03-27T12:50:14Z

Sorry to keep you waiting. Let me get some focus time on this topic next week so we can include into the upcoming release.

sylvainOL linked a pull request Sep 5, 2023 that will close this issue

allow to use application name for upgrade zalando/spilo#915

Open

seekermarcel linked a pull request Feb 12, 2025 that will close this issue

allow to use application name for upgrade (takeover) zalando/spilo#1082

Open

FxKu added this to the 1.15.0 milestone Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client_addr on pg_catalog.pg_stat_replication wrong ip address - istio enabled #1629

Client_addr on pg_catalog.pg_stat_replication wrong ip address - istio enabled #1629

cristi-vlad commented Sep 27, 2021 •

edited

Loading

CyberDem0n commented Sep 28, 2021

sylvainOL commented Jan 31, 2023

mblsf commented May 24, 2023

sylvainOL commented May 24, 2023

sylvainOL commented Sep 5, 2023

Jan-M commented Nov 6, 2023

seekermarcel commented Feb 12, 2025

sylvainOL commented Feb 12, 2025

seekermarcel commented Feb 17, 2025

alfsch commented Feb 21, 2025 •

edited

Loading

cp319391 commented Feb 21, 2025

theBNT commented Feb 21, 2025

FxKu commented Mar 27, 2025

Client_addr on pg_catalog.pg_stat_replication wrong ip address - istio enabled #1629

Client_addr on pg_catalog.pg_stat_replication wrong ip address - istio enabled #1629

Comments

cristi-vlad commented Sep 27, 2021 • edited Loading

CyberDem0n commented Sep 28, 2021

sylvainOL commented Jan 31, 2023

mblsf commented May 24, 2023

sylvainOL commented May 24, 2023

sylvainOL commented Sep 5, 2023

Jan-M commented Nov 6, 2023

seekermarcel commented Feb 12, 2025

sylvainOL commented Feb 12, 2025

seekermarcel commented Feb 17, 2025

alfsch commented Feb 21, 2025 • edited Loading

cp319391 commented Feb 21, 2025

theBNT commented Feb 21, 2025

FxKu commented Mar 27, 2025

cristi-vlad commented Sep 27, 2021 •

edited

Loading

alfsch commented Feb 21, 2025 •

edited

Loading