-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Free space management initial PR #5205
base: master
Are you sure you want to change the base?
Conversation
bool isInMemoryMode() const; | ||
}; | ||
|
||
class PageChunkManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the best name for this, let me know if you have better ideas
metadata.numPages += (newNumPages - numPages); | ||
const auto numNewPages = | ||
writeValues(state, metadata.numValues, data, nullChunkData, 0 /*dataOffset*/, numValues); | ||
KU_ASSERT(numNewPages == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there actually a case when we append pages here? I only see this function being called when appending to a dictionary chunks for in-place string writes which I don't think should add new pages to the data file (the pages should all have been added when the data/offset chunks were initially allocated). I also haven't hit the assertion I added when running tests.
I feel like if there is a case where we do append pages this way is wrong since updateShadowedPageWithCursor
just appends to the end of the file which doesn't guarantee that all the pages for a data/offset chunk are contiguous so if we don't use this logic I'll just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjaminwinger can you double check here?
@@ -20,16 +20,16 @@ struct ChunkCheckpointState { | |||
struct ColumnCheckpointState { | |||
ColumnChunkData& persistentData; | |||
std::vector<ChunkCheckpointState> chunkCheckpointStates; | |||
common::row_idx_t maxRowIdxToWrite; | |||
common::row_idx_t endRowIdxToWrite; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed this from maxRowIdxToWrite
to endRowIdxToWrite
this value is exclusive and the misleading naming was causing it to be used incorrectly (inclusively)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5205 +/- ##
==========================================
- Coverage 86.97% 86.95% -0.02%
==========================================
Files 1405 1411 +6
Lines 61629 61864 +235
Branches 7541 7570 +29
==========================================
+ Hits 53603 53796 +193
- Misses 7853 7895 +42
Partials 173 173 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
BenchmarksI tested on three FSM strategies:
BaselinesBelow are the database sizes of some of datasets with a single COPY (no FSM triggered):
Results
ObservationsThe space penalty for always allocating a power of 2 for number of pages generally offsets the benefits of potentially having less fragmentation (even when there are fewer free pages left the database size is still larger). Since the FSM seems to have a minimum impact on query times, this strategy is not worth it. Instead I went with the first strategy since the additional time penalty of needing to iterate is minimal. One thing to keep in mind though is that this strategy results in many tiny free chunks accumulating over time. Although these chunks may have a minimal impact on the space footprint, due to their size there is a low chance of them being reallocated. We should keep this in mind when we design our |
Benchmark ResultMaster commit hash:
|
Benchmark ResultMaster commit hash:
|
… checkpoint offset + immediately make freed chunks reusable
src/storage/page_chunk_manager.cpp
Outdated
const auto pageIdx = entry.startPageIdx + i; | ||
fileHandle->removePageFromFrameIfNecessary(pageIdx); | ||
} | ||
freeSpaceManager->addFreeChunk(entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a freed chunk immediately rewritable may not be fully robust so I may need to change this to only making freed chunks rewritable after a checkpoint completes (the reasoning should be similar to why we have shadow files). This will make the FSM efficiency slightly worse though.
A case like the following might happen otherwise:
begin out of place update -> free existing chunk -> reclaim the same chunk -> start writing -> DB crashes
When the DB is reloaded part of the existing chunk will already be written to so there is no way to recover
…eManager->FileHandle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks good to me in the high level. I have some initial comments here. Wanna take a closer look after they're addressed.
@@ -234,6 +235,9 @@ class KUZU_API ColumnChunkData { | |||
|
|||
void updateStats(const common::ValueVector* vector, const common::SelectionView& selVector); | |||
|
|||
virtual void reclaimAllocatedPages(PageChunkManager& pageChunkManager, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this function is necessary as the reclaim is just based on ChunkState
, and you don't need to go inside ColumnChunkData
for that as the ChunkState
is already a function param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved logic to ChunkState
metadata.numPages += (newNumPages - numPages); | ||
const auto numNewPages = | ||
writeValues(state, metadata.numValues, data, nullChunkData, 0 /*dataOffset*/, numValues); | ||
KU_ASSERT(numNewPages == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjaminwinger can you double check here?
Benchmark ResultMaster commit hash:
|
Benchmark ResultMaster commit hash:
|
Benchmark ResultMaster commit hash:
|
Description
This PR adds the main infrastructure needed to support free space management. In this PR space is only reclaimed after out-of-place updates of column chunks, reclaiming space after other operations (e.g. dropping columns) will be added in future PRs.
New classes:
PageChunkManager
: All data file page allocations are routed through this class. Consumers of this class will request sequential chunks of pages from this class, which will either be allocated by appending new pages to the data file or by reallocating free pages. This class is also used by consumers to free pages when they are no longer needed.FreeSpaceManager
: Keeps track of chunks of pages that are freed. When page allocations are requested, this class can be used to reallocate previously freed pages. This class should not be used directly, instead calls should be made to thePageChunkManager
which will handle accesses to this class.Contributor agreement