Persiste pgstat file in PS to include it in basebackup and so do not … #349

knizhnik · 2024-01-25T15:33:28Z

Statistic is saved in local file and so lost on compute restart.

Persist in in page server using the same AUX file mechanism used for replication slots

See more about motivation in https://neondb.slack.com/archives/C04DGM6SMTM/p1703077676522789

This prepares PostgreSQL for compatibility with Neon's storage. Significant changes compared to the PostgreSQL 15 patchset include: - Backported changes for users and roles are no longer required - Use RM_NEON_ID for changes in WAL, instead of modifying core WAL records

Co-authored-by: Konstantin Knizhnik <[email protected]>

* Neon logical replication support for PG16 * Log heap rewrite file after creation. --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Arseny Sher <[email protected]>

Co-authored-by: Konstantin Knizhnik <[email protected]>

* Update WAL buffers when restoring WAL at compute needed for LR * Fix copying data in WAL buffers --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

PG16 adds new function to SMGR: zeroextend It's implementation in Neon actually wal-log zero pages of extended relation. This zero page is wal-logged using XLOG_FPI. As far as page is zero, the hole optimization (excluding from the image everything between pg_upper and pd_lower) doesn't work. This PR allows to set hole size to BLCKSZ in case of zero page (PageIsNull() returns true). --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

* Prevent output callbacks from hearing about neon-file messages

arnica-github-connector · 2024-01-25T15:33:57Z

src/backend/replication/logical/message.c

+wallog_file_descriptor(char const* path, int fd)
+{
+	char	prefix[MAXPGPATH];
+	snprintf(prefix, sizeof(prefix), "neon-file:%s", path);


Static Code Analysis Risk: CWE 121 -Stack-based Buffer Overflow -Stack based buffer overflow

The software directly writes into a stack buffer. This might lead to a stack-based buffer overflow. Avoid directly writing into stack buffers without proper boundary checks. Replace unsafe functions like strcpy, strcat, wcscpy, and wcscat with their safer counterparts such as strlcpy, strlcat, wcslcpy, and wcslcat, and use functions like strncpy, stpncpy, and their wide-character variants with caution, ensuring manual null-termination and proper buffer size checks.

Severity: High 🚨
Status: Open 🔴

References:

https://cwe.mitre.org/data/definitions/121

https://github.com/googleprojectzero/weggli

You received this notification because a new code risk has been identified

MMeent · 2024-01-25T15:39:28Z

Oppose: We should be logging less data, not more. Persisting this directly to S3 is prefered over sending this through safekeeper->Pageserver. Pageserver (or Compute) can pull this directly from S3 whenever they need to move these files around.

Same with Logical Replication files (though that ship has sailed by now). We shouldn't have to WAL-log anythin to update a replication slot: they're not related to persistent changes in data and should be able to catch up, but with WAL-logging of those files it's impossible to have a replication stream that's persistently up-to-date with the current insertion point: You need to log its progress to be persistent, and logging implies XLogPtr increasing, thus the replica slot is not up to date anymore.

knizhnik · 2024-01-25T15:59:37Z

We should be logging less data, no more

This file is save once on instance termomnation. And it size is few kb. I do not think that it somehow affect performance or storage usage.

I agree that as far as this file do not need to be versioned, PS and our object store may be not the best place for such data.
But we already have such mechanism. And there are no clear better alternatives.

@hlinnaka suggested to store this information in control plane. But afraid that it can become bottleneck (if thousands of tenants will try to persist files in it.

@MMeent suggested to write directly to S3. I do not think that it is good idea to allow compute to write something directly to S3 (due to security reasons: please remember that compute is not aassumed to be save).

One of the main advantage of the current solution with AUX files is that it provides very simple and efficient way to restore this files at compute: they are just included in basebackup.

So, I agree that we may be need to find some alternative storage for this data. But IMHO implementing separate service just to handle this files is overkill. As for me current solution ism the best compromise. And the fact that this information is actually versioned may be useful and somehow used in future. For example it is not bad than when we take snapshot (create branch) at some particular LSN, then we also get statistic information at the point.

Combining replication with branching is more tricky and right now it is prohibited. But still - we may found some usage of it in future.

But my main argument is still that I do not see better alternatives.

MMeent · 2024-01-25T17:12:12Z

suggested to write directly to S3.

It doesn't have to be directly from Compute. As long as S3 is the storage medium, we could route the IOs just fine through the Pageserver connection (if so desired). The main argument I'm making is that it shouldn't be in the WAL, because it's a real ugly hack that's barely functional, and only works on branches with no replicas.

We need stats on read replicas, too, but with this that won't work correctly, as we'll always start with the primary's latest written statistics, instead of the secondary's statistics.

But my main argument is still that I do not see better alternatives.

My point is that these should be persisted as blobs on S3, not as WAL records in SK/PS, that is, we should have some persistent "scratch" storage for each compute on S3. This storage should be reset when the compute's timeline is changed (to prevent issues with replication switching timelines thus changing history; you can't reuse the history like that).

We can't realistically use the statistics files at different points in time: branches (or read replicas) created after a database load but before the primary shut down won't have the statistics for the new table; branches/read replicas created after truncating a table will have "full" statistics for an empty table.
I think we'd best have a cleanup job in our extension (or Compute image) that uploads known persistent-but-unlogged data files, like the statistics files, pg_stat_statements' database, and logical replication's data.

knizhnik · 2024-01-25T19:31:23Z

How we can propagate this file to PS?
Extend our PS API?
Also PS is non-reliable storage - if it is restarted some recently uploaded changes can be lost (doesn't matter whether we are storing them in local storage or in S3).
Sending them through the WAL enforce durability.
And it is much easy solution.

What are the cons of this approach?
As far as I understand only some extra storage consumption. But are you serious? 100 bytes per replication slot and 24kb for pgstat.stat which is saved only on shutdown. Can it cause some noticeable storage overhead?

MMeent · 2024-01-25T19:52:48Z

How we can propagate this file to PS? Extend our PS API?

Yes, with a new "storeFilePersistently" call? Shouldn't be too hard to create, I think.

Also PS is non-reliable storage - if it is restarted some recently uploaded changes can be lost

... how? S3 doesn't lose data, right? Just use PS as a proxy for S3, which does the authentication and authorization for that compute node's "persistent ephemeral files".

(doesn't matter whether we are storing them in local storage or in S3). Sending them through the WAL enforce durability. And it is much easy solution.

"easy" as in, we need to encode the file into WAL records, then decode it from WAL, then make sure it's correctly located in the basebackup, we cannot delete the files easily, cannot ever have a consistent LSN in the file because the moment we've written the file it's out-of-date, etc.?

I'm not convinced it's a good solution. PostgreSQL doesn't WAL-log those files, and I think it's for good reasons.

What are the cons of this approach? As far as I understand only some extra storage consumption. But are you serious? 100 bytes per replication slot and 24kb for pgstat.stat which is saved only on shutdown. Can it cause some noticeable storage overhead?

You're missing the pg_prewarm buffer dump (which when used could increase cold start performance significantly) which is dependent on shared_buffers (storing buffers at ~20B each, * 10s of 1000s of buffers), and that pgstat file is dependent on how many relations there are, etc.

Anyway, I'm not here to argue storage overhead. I'm here to argue that it's overhead in our WAL system that shouldn't be in our WAL system because it wasn't designed to be crash-safe through WAL.

knizhnik · 2024-01-25T20:29:27Z

Yes, with a new "storeFilePersistently" call? Shouldn't be too hard to create, I think.

Definitely it is possible. Actually in my implementation of autoprewam neondatabase/neon#2734
I already did it (add generic fcntl-like call which can be used for many purposes, foe example allow to remove special methods for handling unlogged relations):

smgr_fcntl(RelationGetSmgr(rel), SMGR_FCNTL_CACHE_SNAPSHOT, block_info_array, num_blocks * sizeof(BlockInfoRecord));

Still I think that if it is possible not to extend API, then it is better not to do it.

S3 doesn't lose data, right? J

S3 is not loosing data (I hope). But to be sure that data is really saved we need to send sync request to PS, at PS save it to S3, wait completion, send reply to compute and and after receiving this reply backend can continue processing.
I afraid it can have awful impact on speed of some operations (including node suspension).

"easy" as in, we need to encode the file into WAL records, then decode it from WAL, then make sure it's correctly located in the basebackup, we cannot delete the files easily, cannot ever have a consistent LSN in the file because the moment we've written the file it's out-of-date, etc.?

All this operations (serialize, deserialize, collect for basebackup, ...)we need to implement in any case.
The only difference is that current approach is already implemented and you suggest to do it once again.

You're missing the pg_prewarm buffer dump

It is not supported now and I do not see any intention to support it (this my PR mentioned above was created 2 years ago).
And even if we want to use the same mechanism: 128Mb shared buffers with 20 bytes per buffer is 320kb which needs to be saved on compute shutdown. Compare for example with serialised catalog we are storing in PS which is copied on each update (i.e. on each DDL).

Anyway, I'm not here to argue storage overhead. I'm here to argue that it's overhead in our WAL system that shouldn't be in our WAL system because it wasn't designed to be crash-safe through WAL.

Storing active XIDs in WAL is ok? PostgreSQL is doing it, and I think it's for good reasons.

Well, seriously, I am not 100% sure that this approach with logical messages is the best oe. IMHO it has more pros than cons comparing with alternatives. If we store pgstat.stat in S3, then it means that at each node startup we have do download something from S3. It seems to be completely unacceptable. Store it in local files system of PS? It is even more awful and unsafe (although I agree with you that all this data is not so critical and can be lost).

MMeent · 2024-01-26T17:25:30Z

I already did it (add generic fcntl-like call which can be used for many purposes, foe example allow to remove special methods for handling unlogged relations):
smgr_fcntl(RelationGetSmgr(rel), SMGR_FCNTL_CACHE_SNAPSHOT, block_info_array, num_blocks * sizeof(BlockInfoRecord));
Still I think that if it is possible not to extend API, then it is better not to do it.

I don't think smgr is the right place for generic API calls. Maybe not even for whole-file reads or writes; considering that it's built on page-sized IOs on relfiles.

Additionally, I think we shouldn't limit the API to only certain known files. If we can support arbitrary paths I think that'd be better in the long run, as it would reduce the code required in pageserver for every new unlogged file.

If we store pgstat.stat in S3, then it means that at each node startup we have do download something from S3. It seems to be completely unacceptable.

Why would that be unacceptable?
Downloading from S3 only needs to take a few milliseconds, which IMO is OK for statistics etc. We can pull all this in parallel to the basebackup, as we don't even need to wait for a consistent LSN or SK quorum to get all this data.

Compare for example with serialised catalog we are storing in PS which is copied on each update (i.e. on each DDL).

I do not believe we store serialized catalogs; updated using full COW, on PS. That's 15+MB of data per database; I do not believe that we do that.

* Load SLRU segments on demand refer #8763 * Fix errors in downloading SLRU segments * Fix build problems * Undo occcasional changes * Remove unintenmded changes * Fix smgr_read_slru_segment * Determine SLRU kind in extension * Use ctl->PagePrecedes for SLRU page comparison in SimpleLruDownloadSegment to address wraparround --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

…loose this information on compute restart

knizhnik · 2024-02-01T06:47:33Z

I don't think smgr is the right place for generic API calls.

If we want to send some request to page server then it it hard to do somehow else bypassing SMGR. Also SMGR already has connection with PS and establishing yet another connection seems to be not so good idea.

So we want to persist file. Who else should be responsible for it except storage manager?

Why would that be unacceptable?

S3 access latency is about 100–200 milliseconds. And we need to perform several requests.
It means than node restart time can not be smaller than few seconds.

I do not believe we store serialized catalogs; updated using full COW, on PS. That's 15+MB of data per database; I do not believe that we do that.

Unfortunately it is really the current status qou. I also do mot like this solution and absolutely sure that sometime we should fix it.

MMeent · 2024-02-01T12:48:10Z

I don't think smgr is the right place for generic API calls.

If we want to send some request to page server then it it hard to do somehow else bypassing SMGR. Also SMGR already has connection with PS and establishing yet another connection seems to be not so good idea.

So we want to persist file. Who else should be responsible for it except storage manager?

I don't know, but it's currently not handled by the SMGR. I don't think we should such large changes to our fork here; just a "write this to ~~disk~~ S3" hook should be sufficient IMO.

Why would that be unacceptable?

S3 access latency is about 100–200 milliseconds. And we need to perform several requests. It means than node restart time can not be smaller than few seconds.

Shouldn't parallel requests to S3 speed this up significantly? Also, 100ms sounds very high for these "small" objects, shouldn't it be closer to 10-50ms?

I do not believe we store serialized catalogs; updated using full COW, on PS. That's 15+MB of data per database; I do not believe that we do that.

Unfortunately it is really the current status qou. I also do mot like this solution and absolutely sure that sometime we should fix it.

So every time DDL is ran we write a new catalog to disk? That sounds very bad.

knizhnik · 2024-02-08T17:08:05Z

Replaced with #356

MMeent and others added 13 commits November 8, 2023 16:07

Fix WAL-logging XLOG_NEON_HEAP_LOCK records.

ed56101

Upadte last written LSN after walloging all createdb stuff (#308)

bc5f001

Co-authored-by: Konstantin Knizhnik <[email protected]>

Neon logical replication support for PG16 (#310)

17bc650

* Neon logical replication support for PG16 * Log heap rewrite file after creation. --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Arseny Sher <[email protected]>

Fix elog format error in wallog_mapping_file (#317)

7f8a2cb

Co-authored-by: Konstantin Knizhnik <[email protected]>

Update WAL buffers when restoring WAL at compute needed for LR (#323)

812c5a2

* Update WAL buffers when restoring WAL at compute needed for LR * Fix copying data in WAL buffers --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

Prevent output callbacks from hearing about neon-file messages (#328)

1127d0a

* Prevent output callbacks from hearing about neon-file messages

Use strncmp instead of strcmp

de8242c

Flush slot creation/drop.

225071f

Remove excessive walsender reply logging.

73eda3d

Allow publications FOR ALL TABLES to neon_superuser

7be4a52

Switch GetCurrentRoleId to GetUserId

cf30276

arnica-github-connector bot reviewed Jan 25, 2024

View reviewed changes

knizhnik and others added 3 commits January 31, 2024 15:41

Persiste pgstat file in PS to include it in basebackup and so do not …

b461700

…loose this information on compute restart

Do not save pgstat file in recovery

e222bb0

knizhnik force-pushed the persist_pgstat_file branch from cd1091a to e222bb0 Compare February 1, 2024 06:07

tristan957 force-pushed the REL_16_STABLE_neon branch from f7ea954 to 550cdd2 Compare February 7, 2024 16:39

knizhnik closed this Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persiste pgstat file in PS to include it in basebackup and so do not … #349

Persiste pgstat file in PS to include it in basebackup and so do not … #349

knizhnik commented Jan 25, 2024

arnica-github-connector bot Jan 25, 2024

MMeent commented Jan 25, 2024

knizhnik commented Jan 25, 2024

MMeent commented Jan 25, 2024

knizhnik commented Jan 25, 2024

MMeent commented Jan 25, 2024

knizhnik commented Jan 25, 2024

MMeent commented Jan 26, 2024

knizhnik commented Feb 1, 2024

MMeent commented Feb 1, 2024

knizhnik commented Feb 8, 2024

Persiste pgstat file in PS to include it in basebackup and so do not … #349

Persiste pgstat file in PS to include it in basebackup and so do not … #349

Conversation

knizhnik commented Jan 25, 2024

arnica-github-connector bot Jan 25, 2024

Choose a reason for hiding this comment

Static Code Analysis Risk: CWE 121 -Stack-based Buffer Overflow -Stack based buffer overflow

References:

MMeent commented Jan 25, 2024

knizhnik commented Jan 25, 2024

MMeent commented Jan 25, 2024

knizhnik commented Jan 25, 2024

MMeent commented Jan 25, 2024

knizhnik commented Jan 25, 2024

MMeent commented Jan 26, 2024

knizhnik commented Feb 1, 2024

MMeent commented Feb 1, 2024

knizhnik commented Feb 8, 2024