-
Notifications
You must be signed in to change notification settings - Fork 280
additional aidb 4.0 additions #6722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
timwaizenegger
merged 24 commits into
DOCS-1430--aidb-version-4-0
from
DOCS-1430--aidb-version-4-0-processingmodes
Apr 29, 2025
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
679a454
add relnotes and docs update for reranking enhancement
timwaizenegger 4150aab
update generated release notes
github-actions[bot] e81fc35
add relnotes for Improved debugging capabilities for AI models access…
timwaizenegger ac029ce
update generated release notes
github-actions[bot] ef923c7
processing and error log
timwaizenegger 2377706
update generated release notes
github-actions[bot] 55484bf
enhance pgfs docs
timwaizenegger c3ed727
fix a path
timwaizenegger e46eb14
typos / fixes
timwaizenegger 76488b7
typos / fixes
timwaizenegger 4a8a873
processing mode re-name
timwaizenegger b4be30c
relnote for auto processing
timwaizenegger a1cd877
update generated release notes
github-actions[bot] 022ee1d
update KB reference
timwaizenegger d8fa501
fix links
timwaizenegger 3c3c1a7
auto processing
timwaizenegger a9f09be
Update advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/auto…
timwaizenegger 929bcaf
Update advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/auto…
timwaizenegger 317e5ac
Update advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/auto…
timwaizenegger 7336a9a
Update advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/auto…
timwaizenegger fc9effa
Update advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/auto…
timwaizenegger 9919b6b
Update advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/auto…
timwaizenegger 23e1b21
add reference for "kbstat" view
timwaizenegger 969f2ad
remove content I used for copy/pasting
timwaizenegger File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
138 changes: 138 additions & 0 deletions
138
advocacy_docs/edb-postgres-ai/ai-accelerator/capabilities/auto-processing.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
--- | ||
title: "Pipeline Auto-Processing" | ||
navTitle: "Auto-Processing" | ||
description: "Pipeline Auto-Processing" | ||
--- | ||
|
||
## Overview | ||
Pipeline Auto-Processing is designed to keep source data and pipeline output in sync. Without this capability, users would have to | ||
manually trigger processing or provide external scripts, schedulers or triggers: | ||
- **Full sync:** Insert/delete/update are all handled automatically. No lost updates, missing data or stale records. | ||
- **Change detection:** Only new or changed records are processed. No unnecessary re-processing of known records. | ||
- **Batch processing:** Records are grouped into batches to be processed concurrently. Reducing overhead and achieving optimal performance e.g. with GPU-based AI model inference tasks. | ||
- **Background processing:** When enabled, the pipeline runs in a background worker process so that it doesn't block or delay other DB operations. Ideal for processing huge datasets. | ||
- **Live processing for Postgres tables:** When the data source is a Postgres table, live trigger-based auto processing can be enabled so that pipeline results are always guaranteed up to date. | ||
- **Quick turnaround:** Once a batch has finished processing, the results are immediately available. No full listing of the source data is needed to start processing. This is important for large external volumes where a full listing can take a long time. | ||
|
||
### Example for Knowledge Base Pipeline | ||
A knowledge base is created for a Postgres table containing products with product descriptions. | ||
The user configures background auto-processing to always keep embeddings in sync without blocking or delaying any operations on the products table. | ||
|
||
The pipeline will process any pre-existing product records in the background, the user can query the statistics table to see the progress. | ||
|
||
The background process will run when new data is inserted, existing data modified or deleted. | ||
|
||
Queries on the Knowledge Base (i.e. retrieval operations) will always return accurate results within a small background processing delay. | ||
|
||
|
||
### Supported pipelines | ||
The Knowledge Base pipeline supports all capabilities listed here. | ||
|
||
The Preparer pipeline does not yet support batch processing and background auto-processing | ||
|
||
## Auto-Processing modes | ||
We offer the following Auto-Processing modes to suit different requirements and use-cases. | ||
|
||
!!! Note | ||
Live auto-processing is only available for table sources. Not for volume sources. | ||
!!! | ||
|
||
### Live | ||
We set up Postgres Triggers on the source table to immediately process any changes. Processing happens within the trigger function. | ||
This means it happens within the same transaction that modifies the data, guaranteeing up to date results. | ||
|
||
#### Considerations | ||
- Transactional guarantee / immediate results. Pipeline results are always up to date with the source data. | ||
- Blocks / delays operations on the source data. Modifying transactions on the source data are delayed until processing is complete. | ||
|
||
### Background | ||
We start a Postgres background worker for each pipeline TODO | ||
|
||
|
||
!!! Note | ||
Make sure Postgres allows running enough background workers for the number of pipelines where you wish to use this processing mode. | ||
!!! | ||
#### Considerations | ||
TODO | ||
|
||
### Disabled | ||
TODO | ||
|
||
|
||
|
||
## Observability | ||
We provide detailed status and progress output for all auto-processing modes. | ||
|
||
A good place to get an overview is the statistics table. | ||
Look up the view `aidb.knowledge_base_stats` or use its short alias `aidb.kbstat`. The table shows all configured knowledge base pipelines, | ||
which processing mode is set, and statistics about the processed records: | ||
```sql | ||
SELECT * from aidb.kbstat; | ||
__OUTPUT__ | ||
knowledge base | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings) | ||
------------------------+-----------------+-------------------------+-------------------------+-----------------------+------------------- | ||
kb_table_text_bg | Background | 0 | | 15 | 15 | ||
kb_table_text_manual | Disabled | 0 | | 15 | 15 | ||
kb_table_image_manual | Disabled | 0 | | 3 | 3 | ||
kb_table_text_live | Live | 0 | | 15 | 15 | ||
kb_table_image_bg | Background | 0 | | 3 | 3 | ||
kb_volume_text_bg | Background | | 6 | 7 | 7 | ||
kb_volume_text_manual | Disabled | | 0 | 0 | 0 | ||
kb_volume_image_bg | Background | | 4 | 177 | 6 | ||
kb_volume_image_manual | Disabled | | 1 | 177 | 6 | ||
(9 rows) | ||
``` | ||
|
||
The [change detection](#change-detection) mechanism is central to how auto-processing works. It is different for volume and table sources. | ||
For this reason, the stats table has different columns for these two source types. | ||
|
||
* `table: unprocessed rows`: How many unique rows are listed in the backlog of change events. | ||
* If auto-processing is disabled, no (new) change events are captured. | ||
* `volume: scans completed`: How many full listings of the source have been completed so far. | ||
* `count(source records)`: How many records exist in the source for this pipeline. | ||
* for table sources, this number is always accurate. | ||
* for volume sources, we can only update this number after a full scan has completed. | ||
* `count(embeddings)`: How many embeddings exist in the vector destination table for this pipeline. | ||
|
||
|
||
|
||
## Configuration | ||
Auto-processing can be configured at creation time: | ||
- With [`aidb.create_table_knowledge_base`](../reference/knowledge_bases#aidbcreate_table_knowledge_base) | ||
- With [`aidb.create_volume_knowledge_base`](../reference/knowledge_bases#aidbcreate_volume_knowledge_base) | ||
|
||
As well as for existing pipelines: | ||
- With [`aidb.set_auto_knowledge_base`](../reference/knowledge_bases#aidbset_auto_knowledge_base) | ||
|
||
## Batch processing | ||
In Background and Disabled modes, (auto) processing happens in batches of configurable size. Within each batch, | ||
|
||
## Change detection | ||
AIDB auto-processing is designed around change detection mechanisms for table and volume data sources. This allows it to only | ||
process data when necessary. | ||
|
||
### Table sources | ||
When background auto-processing is configured, Postgres triggers are set up on the source table to detect changes. These triggers are very lightweight. | ||
They only record change events and insert them into a "change events" table. No actual processing happens in the trigger function. | ||
|
||
The background worker will then process these events asynchronously. | ||
|
||
!!! Notice | ||
When auto-processing is disabled, no change detection on tables is possible. This means the manual `bulk_embedding()` operation has to process | ||
all source records. | ||
!!! | ||
|
||
|
||
### Volume sources | ||
This source type provides a `last_modified` timestamp for each source record. The system keeps track of those timestamps in a "state" table. | ||
In each pipeline execution, the system lists the contents of the volume and compares it to the timestamps to see whether any records have changed or were added. | ||
|
||
This mechanism works in disabled and in background auto-processing. | ||
|
||
The system detects deleted objects after a full listing is complete. Only then can it be certain that a previously processed record is no longer present in the source. | ||
|
||
Unfortunately, object stores (and other external storage locations supported by our volumes) have limited query capabilities. This means: | ||
!!! Note | ||
Change detection for volumes is based on polling i.e., repeated listing. This might be an expensive operation when using cloud object stores like AWS S3. | ||
You can use a long `background_sync_interval` (like one per day) on pipelines with volume sources to control this cost. | ||
!!! |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
26 changes: 26 additions & 0 deletions
26
advocacy_docs/edb-postgres-ai/ai-accelerator/pgfs/functions/local.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
--- | ||
title: "Pipelines PGFS with local file storage" | ||
navTitle: "Local file storage" | ||
description: "How to use Pipelines PGFS with local file storage." | ||
--- | ||
|
||
|
||
## Overview: local file systems | ||
PGFS uses the `file:` prefix to indicate a local file system. | ||
|
||
The general syntax for using local file systems is this: | ||
```sql | ||
select pgfs.create_storage_location( | ||
'local_images', | ||
'file:///var/lib/edb/pipelines/images' | ||
); | ||
``` | ||
|
||
!!! Note | ||
Paths must always be absolute; e.g., start at the root `/`. Together with the protocol prefix `file://`, this means your path will have three slashes as in the example above. | ||
!!! | ||
|
||
!!! Note | ||
Any local path that you want to access must be allowlisted in the [PGFS settings](../settings). | ||
!!! | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference content for this table is missing.