Skip to content

Persiste pgstat file in PS to include it in basebackup and so do not … #349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

knizhnik
Copy link

Statistic is saved in local file and so lost on compute restart.

Persist in in page server using the same AUX file mechanism used for replication slots

See more about motivation in https://neondb.slack.com/archives/C04DGM6SMTM/p1703077676522789

MMeent and others added 13 commits November 8, 2023 16:07
This prepares PostgreSQL for compatibility with Neon's storage.
Significant changes compared to the PostgreSQL 15 patchset include:

- Backported changes for users and roles are no longer required
- Use RM_NEON_ID for changes in WAL, instead of modifying core WAL records
* Neon logical replication support for PG16

* Log heap rewrite file after creation.

---------

Co-authored-by: Konstantin Knizhnik <[email protected]>
Co-authored-by: Arseny Sher <[email protected]>
* Update WAL buffers when restoring WAL at compute needed for LR

* Fix copying data in WAL buffers

---------

Co-authored-by: Konstantin Knizhnik <[email protected]>
PG16 adds new function to SMGR: zeroextend
It's implementation in Neon actually wal-log zero pages of extended relation.
This zero page is wal-logged using XLOG_FPI.
As far as page is zero, the hole optimization (excluding from the image everything between pg_upper and pd_lower) doesn't work.

This PR allows to set hole size to BLCKSZ in case of zero page (PageIsNull() returns true).
---------

Co-authored-by: Konstantin Knizhnik <[email protected]>
* Prevent output callbacks from hearing about neon-file messages
wallog_file_descriptor(char const* path, int fd)
{
char prefix[MAXPGPATH];
snprintf(prefix, sizeof(prefix), "neon-file:%s", path);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static Code Analysis Risk: CWE 121 -Stack-based Buffer Overflow -Stack based buffer overflow

The software directly writes into a stack buffer. This might lead to a stack-based buffer overflow. Avoid directly writing into stack buffers without proper boundary checks. Replace unsafe functions like strcpy, strcat, wcscpy, and wcscat with their safer counterparts such as strlcpy, strlcat, wcslcpy, and wcslcat, and use functions like strncpy, stpncpy, and their wide-character variants with caution, ensuring manual null-termination and proper buffer size checks.

Severity: High 🚨
Status: Open 🔴

References:

  1. https://cwe.mitre.org/data/definitions/121
  2. https://github.com/googleprojectzero/weggli

You received this notification because a new code risk has been identified

@MMeent
Copy link

MMeent commented Jan 25, 2024

Oppose: We should be logging less data, not more. Persisting this directly to S3 is prefered over sending this through safekeeper->Pageserver. Pageserver (or Compute) can pull this directly from S3 whenever they need to move these files around.

Same with Logical Replication files (though that ship has sailed by now). We shouldn't have to WAL-log anythin to update a replication slot: they're not related to persistent changes in data and should be able to catch up, but with WAL-logging of those files it's impossible to have a replication stream that's persistently up-to-date with the current insertion point: You need to log its progress to be persistent, and logging implies XLogPtr increasing, thus the replica slot is not up to date anymore.

@knizhnik
Copy link
Author

We should be logging less data, no more

This file is save once on instance termomnation. And it size is few kb. I do not think that it somehow affect performance or storage usage.

I agree that as far as this file do not need to be versioned, PS and our object store may be not the best place for such data.
But we already have such mechanism. And there are no clear better alternatives.

@hlinnaka suggested to store this information in control plane. But afraid that it can become bottleneck (if thousands of tenants will try to persist files in it.

@MMeent suggested to write directly to S3. I do not think that it is good idea to allow compute to write something directly to S3 (due to security reasons: please remember that compute is not aassumed to be save).

One of the main advantage of the current solution with AUX files is that it provides very simple and efficient way to restore this files at compute: they are just included in basebackup.

So, I agree that we may be need to find some alternative storage for this data. But IMHO implementing separate service just to handle this files is overkill. As for me current solution ism the best compromise. And the fact that this information is actually versioned may be useful and somehow used in future. For example it is not bad than when we take snapshot (create branch) at some particular LSN, then we also get statistic information at the point.

Combining replication with branching is more tricky and right now it is prohibited. But still - we may found some usage of it in future.

But my main argument is still that I do not see better alternatives.

@MMeent
Copy link

MMeent commented Jan 25, 2024

suggested to write directly to S3.

It doesn't have to be directly from Compute. As long as S3 is the storage medium, we could route the IOs just fine through the Pageserver connection (if so desired). The main argument I'm making is that it shouldn't be in the WAL, because it's a real ugly hack that's barely functional, and only works on branches with no replicas.

We need stats on read replicas, too, but with this that won't work correctly, as we'll always start with the primary's latest written statistics, instead of the secondary's statistics.

But my main argument is still that I do not see better alternatives.

My point is that these should be persisted as blobs on S3, not as WAL records in SK/PS, that is, we should have some persistent "scratch" storage for each compute on S3. This storage should be reset when the compute's timeline is changed (to prevent issues with replication switching timelines thus changing history; you can't reuse the history like that).

We can't realistically use the statistics files at different points in time: branches (or read replicas) created after a database load but before the primary shut down won't have the statistics for the new table; branches/read replicas created after truncating a table will have "full" statistics for an empty table.
I think we'd best have a cleanup job in our extension (or Compute image) that uploads known persistent-but-unlogged data files, like the statistics files, pg_stat_statements' database, and logical replication's data.

@knizhnik
Copy link
Author

How we can propagate this file to PS?
Extend our PS API?
Also PS is non-reliable storage - if it is restarted some recently uploaded changes can be lost (doesn't matter whether we are storing them in local storage or in S3).
Sending them through the WAL enforce durability.
And it is much easy solution.

What are the cons of this approach?
As far as I understand only some extra storage consumption. But are you serious? 100 bytes per replication slot and 24kb for pgstat.stat which is saved only on shutdown. Can it cause some noticeable storage overhead?

@MMeent
Copy link

MMeent commented Jan 25, 2024

How we can propagate this file to PS? Extend our PS API?

Yes, with a new "storeFilePersistently" call? Shouldn't be too hard to create, I think.

Also PS is non-reliable storage - if it is restarted some recently uploaded changes can be lost

... how? S3 doesn't lose data, right? Just use PS as a proxy for S3, which does the authentication and authorization for that compute node's "persistent ephemeral files".

(doesn't matter whether we are storing them in local storage or in S3). Sending them through the WAL enforce durability. And it is much easy solution.

"easy" as in, we need to encode the file into WAL records, then decode it from WAL, then make sure it's correctly located in the basebackup, we cannot delete the files easily, cannot ever have a consistent LSN in the file because the moment we've written the file it's out-of-date, etc.?

I'm not convinced it's a good solution. PostgreSQL doesn't WAL-log those files, and I think it's for good reasons.

What are the cons of this approach? As far as I understand only some extra storage consumption. But are you serious? 100 bytes per replication slot and 24kb for pgstat.stat which is saved only on shutdown. Can it cause some noticeable storage overhead?

You're missing the pg_prewarm buffer dump (which when used could increase cold start performance significantly) which is dependent on shared_buffers (storing buffers at ~20B each, * 10s of 1000s of buffers), and that pgstat file is dependent on how many relations there are, etc.

Anyway, I'm not here to argue storage overhead. I'm here to argue that it's overhead in our WAL system that shouldn't be in our WAL system because it wasn't designed to be crash-safe through WAL.

@knizhnik
Copy link
Author

Yes, with a new "storeFilePersistently" call? Shouldn't be too hard to create, I think.

Definitely it is possible. Actually in my implementation of autoprewam neondatabase/neon#2734
I already did it (add generic fcntl-like call which can be used for many purposes, foe example allow to remove special methods for handling unlogged relations):

smgr_fcntl(RelationGetSmgr(rel), SMGR_FCNTL_CACHE_SNAPSHOT, block_info_array, num_blocks * sizeof(BlockInfoRecord));

Still I think that if it is possible not to extend API, then it is better not to do it.

S3 doesn't lose data, right? J

S3 is not loosing data (I hope). But to be sure that data is really saved we need to send sync request to PS, at PS save it to S3, wait completion, send reply to compute and and after receiving this reply backend can continue processing.
I afraid it can have awful impact on speed of some operations (including node suspension).

"easy" as in, we need to encode the file into WAL records, then decode it from WAL, then make sure it's correctly located in the basebackup, we cannot delete the files easily, cannot ever have a consistent LSN in the file because the moment we've written the file it's out-of-date, etc.?

All this operations (serialize, deserialize, collect for basebackup, ...)we need to implement in any case.
The only difference is that current approach is already implemented and you suggest to do it once again.

You're missing the pg_prewarm buffer dump

It is not supported now and I do not see any intention to support it (this my PR mentioned above was created 2 years ago).
And even if we want to use the same mechanism: 128Mb shared buffers with 20 bytes per buffer is 320kb which needs to be saved on compute shutdown. Compare for example with serialised catalog we are storing in PS which is copied on each update (i.e. on each DDL).

Anyway, I'm not here to argue storage overhead. I'm here to argue that it's overhead in our WAL system that shouldn't be in our WAL system because it wasn't designed to be crash-safe through WAL.

Storing active XIDs in WAL is ok? PostgreSQL is doing it, and I think it's for good reasons.

Well, seriously, I am not 100% sure that this approach with logical messages is the best oe. IMHO it has more pros than cons comparing with alternatives. If we store pgstat.stat in S3, then it means that at each node startup we have do download something from S3. It seems to be completely unacceptable. Store it in local files system of PS? It is even more awful and unsafe (although I agree with you that all this data is not so critical and can be lost).

@MMeent
Copy link

MMeent commented Jan 26, 2024

I already did it (add generic fcntl-like call which can be used for many purposes, foe example allow to remove special methods for handling unlogged relations):

smgr_fcntl(RelationGetSmgr(rel), SMGR_FCNTL_CACHE_SNAPSHOT, block_info_array, num_blocks * sizeof(BlockInfoRecord));

Still I think that if it is possible not to extend API, then it is better not to do it.

I don't think smgr is the right place for generic API calls. Maybe not even for whole-file reads or writes; considering that it's built on page-sized IOs on relfiles.

Additionally, I think we shouldn't limit the API to only certain known files. If we can support arbitrary paths I think that'd be better in the long run, as it would reduce the code required in pageserver for every new unlogged file.

If we store pgstat.stat in S3, then it means that at each node startup we have do download something from S3. It seems to be completely unacceptable.

Why would that be unacceptable?
Downloading from S3 only needs to take a few milliseconds, which IMO is OK for statistics etc. We can pull all this in parallel to the basebackup, as we don't even need to wait for a consistent LSN or SK quorum to get all this data.

Compare for example with serialised catalog we are storing in PS which is copied on each update (i.e. on each DDL).

I do not believe we store serialized catalogs; updated using full COW, on PS. That's 15+MB of data per database; I do not believe that we do that.

knizhnik and others added 3 commits January 31, 2024 15:41
* Load SLRU segments on demand

refer #8763

* Fix errors in downloading SLRU segments

* Fix build problems

* Undo occcasional changes

* Remove unintenmded changes

* Fix smgr_read_slru_segment

* Determine SLRU kind in extension

* Use ctl->PagePrecedes for SLRU page comparison in SimpleLruDownloadSegment to address wraparround

---------

Co-authored-by: Konstantin Knizhnik <[email protected]>
@knizhnik knizhnik force-pushed the persist_pgstat_file branch from cd1091a to e222bb0 Compare February 1, 2024 06:07
@knizhnik
Copy link
Author

knizhnik commented Feb 1, 2024

I don't think smgr is the right place for generic API calls.

If we want to send some request to page server then it it hard to do somehow else bypassing SMGR. Also SMGR already has connection with PS and establishing yet another connection seems to be not so good idea.

So we want to persist file. Who else should be responsible for it except storage manager?

Why would that be unacceptable?

S3 access latency is about 100–200 milliseconds. And we need to perform several requests.
It means than node restart time can not be smaller than few seconds.

I do not believe we store serialized catalogs; updated using full COW, on PS. That's 15+MB of data per database; I do not believe that we do that.

Unfortunately it is really the current status qou. I also do mot like this solution and absolutely sure that sometime we should fix it.

@MMeent
Copy link

MMeent commented Feb 1, 2024

I don't think smgr is the right place for generic API calls.

If we want to send some request to page server then it it hard to do somehow else bypassing SMGR. Also SMGR already has connection with PS and establishing yet another connection seems to be not so good idea.

So we want to persist file. Who else should be responsible for it except storage manager?

I don't know, but it's currently not handled by the SMGR. I don't think we should such large changes to our fork here; just a "write this to disk S3" hook should be sufficient IMO.

Why would that be unacceptable?

S3 access latency is about 100–200 milliseconds. And we need to perform several requests. It means than node restart time can not be smaller than few seconds.

Shouldn't parallel requests to S3 speed this up significantly? Also, 100ms sounds very high for these "small" objects, shouldn't it be closer to 10-50ms?

I do not believe we store serialized catalogs; updated using full COW, on PS. That's 15+MB of data per database; I do not believe that we do that.

Unfortunately it is really the current status qou. I also do mot like this solution and absolutely sure that sometime we should fix it.

So every time DDL is ran we write a new catalog to disk? That sounds very bad.

@knizhnik
Copy link
Author

knizhnik commented Feb 8, 2024

Replaced with #356

@knizhnik knizhnik closed this Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants