-
Notifications
You must be signed in to change notification settings - Fork 14
Persiste pgstat file in PS to include it in basebackup and so do not … #349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This prepares PostgreSQL for compatibility with Neon's storage. Significant changes compared to the PostgreSQL 15 patchset include: - Backported changes for users and roles are no longer required - Use RM_NEON_ID for changes in WAL, instead of modifying core WAL records
Co-authored-by: Konstantin Knizhnik <[email protected]>
* Neon logical replication support for PG16 * Log heap rewrite file after creation. --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Arseny Sher <[email protected]>
Co-authored-by: Konstantin Knizhnik <[email protected]>
* Update WAL buffers when restoring WAL at compute needed for LR * Fix copying data in WAL buffers --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
PG16 adds new function to SMGR: zeroextend It's implementation in Neon actually wal-log zero pages of extended relation. This zero page is wal-logged using XLOG_FPI. As far as page is zero, the hole optimization (excluding from the image everything between pg_upper and pd_lower) doesn't work. This PR allows to set hole size to BLCKSZ in case of zero page (PageIsNull() returns true). --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
* Prevent output callbacks from hearing about neon-file messages
wallog_file_descriptor(char const* path, int fd) | ||
{ | ||
char prefix[MAXPGPATH]; | ||
snprintf(prefix, sizeof(prefix), "neon-file:%s", path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static Code Analysis Risk: CWE 121 -Stack-based Buffer Overflow -Stack based buffer overflow
The software directly writes into a stack buffer. This might lead to a stack-based buffer overflow. Avoid directly writing into stack buffers without proper boundary checks. Replace unsafe functions like strcpy, strcat, wcscpy, and wcscat with their safer counterparts such as strlcpy, strlcat, wcslcpy, and wcslcat, and use functions like strncpy, stpncpy, and their wide-character variants with caution, ensuring manual null-termination and proper buffer size checks.
Severity: High 🚨
Status: Open 🔴
References:
You received this notification because a new code risk has been identified
Oppose: We should be logging less data, not more. Persisting this directly to S3 is prefered over sending this through safekeeper->Pageserver. Pageserver (or Compute) can pull this directly from S3 whenever they need to move these files around. Same with Logical Replication files (though that ship has sailed by now). We shouldn't have to WAL-log anythin to update a replication slot: they're not related to persistent changes in data and should be able to catch up, but with WAL-logging of those files it's impossible to have a replication stream that's persistently up-to-date with the current insertion point: You need to log its progress to be persistent, and logging implies XLogPtr increasing, thus the replica slot is not up to date anymore. |
This file is save once on instance termomnation. And it size is few kb. I do not think that it somehow affect performance or storage usage. I agree that as far as this file do not need to be versioned, PS and our object store may be not the best place for such data. @hlinnaka suggested to store this information in control plane. But afraid that it can become bottleneck (if thousands of tenants will try to persist files in it. @MMeent suggested to write directly to S3. I do not think that it is good idea to allow compute to write something directly to S3 (due to security reasons: please remember that compute is not aassumed to be save). One of the main advantage of the current solution with AUX files is that it provides very simple and efficient way to restore this files at compute: they are just included in basebackup. So, I agree that we may be need to find some alternative storage for this data. But IMHO implementing separate service just to handle this files is overkill. As for me current solution ism the best compromise. And the fact that this information is actually versioned may be useful and somehow used in future. For example it is not bad than when we take snapshot (create branch) at some particular LSN, then we also get statistic information at the point. Combining replication with branching is more tricky and right now it is prohibited. But still - we may found some usage of it in future. But my main argument is still that I do not see better alternatives. |
It doesn't have to be directly from Compute. As long as S3 is the storage medium, we could route the IOs just fine through the Pageserver connection (if so desired). The main argument I'm making is that it shouldn't be in the WAL, because it's a real ugly hack that's barely functional, and only works on branches with no replicas. We need stats on read replicas, too, but with this that won't work correctly, as we'll always start with the primary's latest written statistics, instead of the secondary's statistics.
My point is that these should be persisted as blobs on S3, not as WAL records in SK/PS, that is, we should have some persistent "scratch" storage for each compute on S3. This storage should be reset when the compute's timeline is changed (to prevent issues with replication switching timelines thus changing history; you can't reuse the history like that). We can't realistically use the statistics files at different points in time: branches (or read replicas) created after a database load but before the primary shut down won't have the statistics for the new table; branches/read replicas created after truncating a table will have "full" statistics for an empty table. |
How we can propagate this file to PS? What are the cons of this approach? |
Yes, with a new "storeFilePersistently" call? Shouldn't be too hard to create, I think.
... how? S3 doesn't lose data, right? Just use PS as a proxy for S3, which does the authentication and authorization for that compute node's "persistent ephemeral files".
"easy" as in, we need to encode the file into WAL records, then decode it from WAL, then make sure it's correctly located in the basebackup, we cannot delete the files easily, cannot ever have a consistent LSN in the file because the moment we've written the file it's out-of-date, etc.? I'm not convinced it's a good solution. PostgreSQL doesn't WAL-log those files, and I think it's for good reasons.
You're missing the pg_prewarm buffer dump (which when used could increase cold start performance significantly) which is dependent on shared_buffers (storing buffers at ~20B each, * 10s of 1000s of buffers), and that pgstat file is dependent on how many relations there are, etc. Anyway, I'm not here to argue storage overhead. I'm here to argue that it's overhead in our WAL system that shouldn't be in our WAL system because it wasn't designed to be crash-safe through WAL. |
Definitely it is possible. Actually in my implementation of autoprewam neondatabase/neon#2734
Still I think that if it is possible not to extend API, then it is better not to do it.
S3 is not loosing data (I hope). But to be sure that data is really saved we need to send sync request to PS, at PS save it to S3, wait completion, send reply to compute and and after receiving this reply backend can continue processing.
All this operations (serialize, deserialize, collect for basebackup, ...)we need to implement in any case.
It is not supported now and I do not see any intention to support it (this my PR mentioned above was created 2 years ago).
Storing active XIDs in WAL is ok? PostgreSQL is doing it, and I think it's for good reasons. Well, seriously, I am not 100% sure that this approach with logical messages is the best oe. IMHO it has more pros than cons comparing with alternatives. If we store pgstat.stat in S3, then it means that at each node startup we have do download something from S3. It seems to be completely unacceptable. Store it in local files system of PS? It is even more awful and unsafe (although I agree with you that all this data is not so critical and can be lost). |
I don't think smgr is the right place for generic API calls. Maybe not even for whole-file reads or writes; considering that it's built on page-sized IOs on relfiles. Additionally, I think we shouldn't limit the API to only certain known files. If we can support arbitrary paths I think that'd be better in the long run, as it would reduce the code required in pageserver for every new unlogged file.
Why would that be unacceptable?
I do not believe we store serialized catalogs; updated using full COW, on PS. That's 15+MB of data per database; I do not believe that we do that. |
* Load SLRU segments on demand refer #8763 * Fix errors in downloading SLRU segments * Fix build problems * Undo occcasional changes * Remove unintenmded changes * Fix smgr_read_slru_segment * Determine SLRU kind in extension * Use ctl->PagePrecedes for SLRU page comparison in SimpleLruDownloadSegment to address wraparround --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
…loose this information on compute restart
cd1091a
to
e222bb0
Compare
If we want to send some request to page server then it it hard to do somehow else bypassing SMGR. Also SMGR already has connection with PS and establishing yet another connection seems to be not so good idea. So we want to persist file. Who else should be responsible for it except storage manager?
S3 access latency is about 100–200 milliseconds. And we need to perform several requests.
Unfortunately it is really the current status qou. I also do mot like this solution and absolutely sure that sometime we should fix it. |
I don't know, but it's currently not handled by the SMGR. I don't think we should such large changes to our fork here; just a "write this to
Shouldn't parallel requests to S3 speed this up significantly? Also, 100ms sounds very high for these "small" objects, shouldn't it be closer to 10-50ms?
So every time DDL is ran we write a new catalog to disk? That sounds very bad. |
f7ea954
to
550cdd2
Compare
Replaced with #356 |
Statistic is saved in local file and so lost on compute restart.
Persist in in page server using the same AUX file mechanism used for replication slots
See more about motivation in https://neondb.slack.com/archives/C04DGM6SMTM/p1703077676522789