Skip to content

Commit d81d392

Browse files
Merge pull request #6722 from EnterpriseDB/DOCS-1430--aidb-version-4-0-processingmodes
additional aidb 4.0 additions
2 parents 81b5c60 + 969f2ad commit d81d392

File tree

21 files changed

+548
-274
lines changed

21 files changed

+548
-274
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
---
2+
title: "Pipeline Auto-Processing"
3+
navTitle: "Auto-Processing"
4+
description: "Pipeline Auto-Processing"
5+
---
6+
7+
## Overview
8+
Pipeline Auto-Processing is designed to keep source data and pipeline output in sync. Without this capability, users would have to
9+
manually trigger processing or provide external scripts, schedulers or triggers:
10+
- **Full sync:** Insert/delete/update are all handled automatically. No lost updates, missing data or stale records.
11+
- **Change detection:** Only new or changed records are processed. No unnecessary re-processing of known records.
12+
- **Batch processing:** Records are grouped into batches to be processed concurrently. Reducing overhead and achieving optimal performance e.g. with GPU-based AI model inference tasks.
13+
- **Background processing:** When enabled, the pipeline runs in a background worker process so that it doesn't block or delay other DB operations. Ideal for processing huge datasets.
14+
- **Live processing for Postgres tables:** When the data source is a Postgres table, live trigger-based auto processing can be enabled so that pipeline results are always guaranteed up to date.
15+
- **Quick turnaround:** Once a batch has finished processing, the results are immediately available. No full listing of the source data is needed to start processing. This is important for large external volumes where a full listing can take a long time.
16+
17+
### Example for Knowledge Base Pipeline
18+
A knowledge base is created for a Postgres table containing products with product descriptions.
19+
The user configures background auto-processing to always keep embeddings in sync without blocking or delaying any operations on the products table.
20+
21+
The pipeline will process any pre-existing product records in the background, the user can query the statistics table to see the progress.
22+
23+
The background process will run when new data is inserted, existing data modified or deleted.
24+
25+
Queries on the Knowledge Base (i.e. retrieval operations) will always return accurate results within a small background processing delay.
26+
27+
28+
### Supported pipelines
29+
The Knowledge Base pipeline supports all capabilities listed here.
30+
31+
The Preparer pipeline does not yet support batch processing and background auto-processing
32+
33+
## Auto-Processing modes
34+
We offer the following Auto-Processing modes to suit different requirements and use-cases.
35+
36+
!!! Note
37+
Live auto-processing is only available for table sources. Not for volume sources.
38+
!!!
39+
40+
### Live
41+
We set up Postgres Triggers on the source table to immediately process any changes. Processing happens within the trigger function.
42+
This means it happens within the same transaction that modifies the data, guaranteeing up to date results.
43+
44+
#### Considerations
45+
- Transactional guarantee / immediate results. Pipeline results are always up to date with the source data.
46+
- Blocks / delays operations on the source data. Modifying transactions on the source data are delayed until processing is complete.
47+
48+
### Background
49+
We start a Postgres background worker for each pipeline TODO
50+
51+
52+
!!! Note
53+
Make sure Postgres allows running enough background workers for the number of pipelines where you wish to use this processing mode.
54+
!!!
55+
#### Considerations
56+
TODO
57+
58+
### Disabled
59+
TODO
60+
61+
62+
63+
## Observability
64+
We provide detailed status and progress output for all auto-processing modes.
65+
66+
A good place to get an overview is the statistics table.
67+
Look up the view `aidb.knowledge_base_stats` or use its short alias `aidb.kbstat`. The table shows all configured knowledge base pipelines,
68+
which processing mode is set, and statistics about the processed records:
69+
```sql
70+
SELECT * from aidb.kbstat;
71+
__OUTPUT__
72+
knowledge base | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings)
73+
------------------------+-----------------+-------------------------+-------------------------+-----------------------+-------------------
74+
kb_table_text_bg | Background | 0 | | 15 | 15
75+
kb_table_text_manual | Disabled | 0 | | 15 | 15
76+
kb_table_image_manual | Disabled | 0 | | 3 | 3
77+
kb_table_text_live | Live | 0 | | 15 | 15
78+
kb_table_image_bg | Background | 0 | | 3 | 3
79+
kb_volume_text_bg | Background | | 6 | 7 | 7
80+
kb_volume_text_manual | Disabled | | 0 | 0 | 0
81+
kb_volume_image_bg | Background | | 4 | 177 | 6
82+
kb_volume_image_manual | Disabled | | 1 | 177 | 6
83+
(9 rows)
84+
```
85+
86+
The [change detection](#change-detection) mechanism is central to how auto-processing works. It is different for volume and table sources.
87+
For this reason, the stats table has different columns for these two source types.
88+
89+
* `table: unprocessed rows`: How many unique rows are listed in the backlog of change events.
90+
* If auto-processing is disabled, no (new) change events are captured.
91+
* `volume: scans completed`: How many full listings of the source have been completed so far.
92+
* `count(source records)`: How many records exist in the source for this pipeline.
93+
* for table sources, this number is always accurate.
94+
* for volume sources, we can only update this number after a full scan has completed.
95+
* `count(embeddings)`: How many embeddings exist in the vector destination table for this pipeline.
96+
97+
98+
99+
## Configuration
100+
Auto-processing can be configured at creation time:
101+
- With [`aidb.create_table_knowledge_base`](../reference/knowledge_bases#aidbcreate_table_knowledge_base)
102+
- With [`aidb.create_volume_knowledge_base`](../reference/knowledge_bases#aidbcreate_volume_knowledge_base)
103+
104+
As well as for existing pipelines:
105+
- With [`aidb.set_auto_knowledge_base`](../reference/knowledge_bases#aidbset_auto_knowledge_base)
106+
107+
## Batch processing
108+
In Background and Disabled modes, (auto) processing happens in batches of configurable size. Within each batch,
109+
110+
## Change detection
111+
AIDB auto-processing is designed around change detection mechanisms for table and volume data sources. This allows it to only
112+
process data when necessary.
113+
114+
### Table sources
115+
When background auto-processing is configured, Postgres triggers are set up on the source table to detect changes. These triggers are very lightweight.
116+
They only record change events and insert them into a "change events" table. No actual processing happens in the trigger function.
117+
118+
The background worker will then process these events asynchronously.
119+
120+
!!! Notice
121+
When auto-processing is disabled, no change detection on tables is possible. This means the manual `bulk_embedding()` operation has to process
122+
all source records.
123+
!!!
124+
125+
126+
### Volume sources
127+
This source type provides a `last_modified` timestamp for each source record. The system keeps track of those timestamps in a "state" table.
128+
In each pipeline execution, the system lists the contents of the volume and compares it to the timestamps to see whether any records have changed or were added.
129+
130+
This mechanism works in disabled and in background auto-processing.
131+
132+
The system detects deleted objects after a full listing is complete. Only then can it be certain that a previously processed record is no longer present in the source.
133+
134+
Unfortunately, object stores (and other external storage locations supported by our volumes) have limited query capabilities. This means:
135+
!!! Note
136+
Change detection for volumes is based on polling i.e., repeated listing. This might be an expensive operation when using cloud object stores like AWS S3.
137+
You can use a long `background_sync_interval` (like one per day) on pipelines with volume sources to control this cost.
138+
!!!

advocacy_docs/edb-postgres-ai/ai-accelerator/index.mdx

+1
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ navigation:
2020
- preparers
2121
- knowledge_base
2222
- pgfs
23+
- volumes
2324
- "#Pipelines resources"
2425
- reference
2526
- rel_notes

advocacy_docs/edb-postgres-ai/ai-accelerator/knowledge_base/usage.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ create_table_knowledge_base(
2828
topk INTEGER DEFAULT 1,
2929
distance_operator aidb.distanceoperator DEFAULT 'L2',
3030
options JSONB DEFAULT '{}'::JSONB,
31-
processing_mode aidb.PipelineProcessingMode DEFAULT 'Disabled'
31+
auto_processing aidb.pipelineautoprocessingmode DEFAULT 'Disabled'
3232
)
3333
```
3434

@@ -96,7 +96,7 @@ aidb.create_volume_knowledge_base(
9696
topk INTEGER DEFAULT 1,
9797
distance_operator aidb.distanceoperator DEFAULT 'L2',
9898
options JSONB DEFAULT '{}'::JSONB
99-
processing_mode aidb.PipelineProcessingMode DEFAULT 'Disabled'
99+
auto_processing aidb.pipelineautoprocessingmode DEFAULT 'Disabled'
100100
)
101101
```
102102

advocacy_docs/edb-postgres-ai/ai-accelerator/models/index.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,4 @@ Pipelines has a model registry that manages configured instances of models. Any
1919

2020
## Next steps
2121

22-
Once you are familiar with models, you can learn how to use those models with [knowledge bases](../knowledge_bases).
22+
Once you are familiar with models, you can learn how to use those models with [knowledge bases](../knowledge_base).

advocacy_docs/edb-postgres-ai/ai-accelerator/models/primitives.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ navTitle: "Primitives"
44
description: "The model primitives available in EDB Postgres AI - AI Accelerator Pipelines."
55
---
66

7-
For most use cases, we recommend that you use the aidb [knowledge bases](../knowledge_bases) to interact with models. They can manage creating embeddings and retrieving matching data for many applications.
7+
For most use cases, we recommend that you use the aidb [knowledge bases](../knowledge_base) to interact with models. They can manage creating embeddings and retrieving matching data for many applications.
88

99
However, if you need to interact with models directly, you can use the following primitives. The encode functions generate embeddings for text and images, and the decode functions generate text from embeddings.
1010

advocacy_docs/edb-postgres-ai/ai-accelerator/models/supported-models/nim_reranking.mdx

+20-5
Original file line numberDiff line numberDiff line change
@@ -21,24 +21,39 @@ Reranking is a method in text search that sorts results by relevance to make the
2121
* nvidia/llama-3.2-nv-rerankqa-1b-v2 (default)
2222

2323

24+
## Example
25+
The function accepts a string as the "rerank query" and an array of texts to rerank.
26+
The `id` column in the output refers to the index of the text in the input array.
27+
28+
```sql
29+
aidb=# SELECT * from aidb.rerank_text('my_nim_reranker', 'how can I open a door?', '{Ask for help, Push the handle, Lie down and wait, Shout at it}'::text[]) ORDER BY logit_score DESC;
30+
text | logit_score | id
31+
-------------------+--------------+----
32+
Push the handle | -3.697265625 | 1
33+
Ask for help | -6.2578125 | 0
34+
Shout at it | -7.39453125 | 3
35+
Lie down and wait | -11.375 | 2
36+
(4 rows)
37+
```
38+
2439

2540
## Creating the default model
2641

2742
```sql
2843
SELECT aidb.create_model(
29-
'my_nim_reranker',
44+
'my_nim_reranker',
3045
'nim_reranking',
31-
credentials=>'{"api_key": "<API_KEY_HERE>"'::JSONB
32-
);
46+
'{"url":"http://nim-nv-rerankqa-llama-l-1xgpu-g6-predictor.default.svc.cluster.local/v1/ranking", "model": "nvidia/llama-3.2-nv-rerankqa-1b-v2"}'
47+
);
3348
```
3449

35-
There's only one model, the default `nvidia/nvclip`, so you don't need to specify the model in the configuration.
50+
This example uses a locally deployed NIM model that does not require credentials. Credentials and other configuration can be provided as described in [using models](../using-models).
3651

3752
## Model configuration settings
3853

3954
The following configuration settings are available for NIM models:
4055

41-
* `model` &mdash; The NIM model to use. The default is `nvidia/llama-3.2-nv-rerankqa-1b-v2` and is the only model available.
56+
* `model` &mdash; The NIM model to use. The default is `nvidia/llama-3.2-nv-rerankqa-1b-v2` and is the only supported model.
4257
* `url` &mdash; The URL of the model to use. This setting is optional and can be used to specify a custom model URL. The default is `https://ai.api.nvidia.com/v1/retrieval`.
4358

4459
## Model credentials

advocacy_docs/edb-postgres-ai/ai-accelerator/pgfs/functions.mdx renamed to advocacy_docs/edb-postgres-ai/ai-accelerator/pgfs/functions/index.mdx

+14-28
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,36 @@
11
---
22
title: "PGFS functions for Pipelines"
33
navTitle: "PGFS functions"
4-
description: "How to use PGFS functions to access external storage in Pipelines."
4+
description: "How to use PGFS functions to manage external storage in Pipelines."
55
---
66

7-
## Using the PGFS functions
8-
9-
The PGFS extension provides a set of functions to create and manage storage locations.
10-
11-
### Creating a storage location
7+
## Creating a storage location
128

139
Start with creating a storage location. A storage location is a reference to a location in an external file system. You can create a storage location with the `pgfs.create_storage_location` function:
1410

1511
```sql
16-
select pgfs.create_storage_location('my_storage', 's3://my_bucket','','{}'::JSONB,'{}'::JSONB);
12+
select pgfs.create_storage_location('storage_location_name', 'protocol://path', options => '{}'::JSONB, credentials => '{}'::JSONB);
1713
```
1814

19-
The `create_strorage_location` function takes a name for the storage location and then a URL for the location. Prefix the URL with `s3:` for an S3-compatible bucket or `file:` for a local file system.
20-
21-
22-
```sql
23-
select pgfs.create_storage_location('my_file_storage', 'file:///tmp/my_path', NULL, '{}'::json, '{}'::json );
24-
```
15+
### Storage provider types
16+
Detailed instructions for the supported storage providers can be found here:
2517

26-
When using the `file:` schema, provide an absolute path, one that starts with `/`, for example `/tmp/my_path`). Together with the schema indicator `file://`, there are then three slashes at the beginning of the path.
27-
28-
The function also takes an optional `msl_id` parameter, which isn't used. It also requires `options` and `credentials` parameters. If those are unused, you must pass them as empty JSON objects.
18+
- [S3-compatible storage](s3)
19+
- [Local file system](local)
2920

3021
### Creating a storage location with options and credentials
3122

32-
Using the `options` and `credentials` parameters allows a range of other settings to be passed.
23+
Using the `options` and `credentials` parameters allows a range of settings to be passed.
3324

3425
The `options` parameter is a JSON object that can be used to pass additional options to the storage location.
3526
The `credentials` parameter is a JSON object that can be used to pass credentials to the storage location.
3627

3728
The difference between `options` and `credentials` is that options remain visible to users querying the extension while credentials are hidden to all users except superusers and the user that creates the storage location.
3829

39-
For example, you can create a storage location with options and credentials like this:
40-
41-
```sql
42-
select pgfs.create_storage_location('my_storage', 's3://my_private_bucket', null, '{"region": "eu-west"}'::JSONB, '{"access_key_id": "youraccesskeyid", "secret_access_key":"yoursecretaccesskey"}'::JSONB);
43-
```
44-
45-
Once you've created a storage location, you can use it to create foreign tables and access files in the external file system. To use it with aidb, you need to create a volume from the storage location. To do that, see [Creating a volume](../knowledge_bases/usage#creating-a-new-volume).
30+
### Testing storage locations and using them with AIDB
31+
To use a storage location with aidb, you need to create a volume from the storage location. To do that, see [Creating a volume](../../volumes).
4632

47-
### Listing storage locations
33+
## Listing storage locations
4834

4935
You can list all storage locations with the `pgfs.list_storage_locations` function:
5036

@@ -54,7 +40,7 @@ select * from pgfs.list_storage_locations();
5440

5541
This command returns a table of currently defined storage locations. Credentials are shown only if the user has the necessary permissions. Otherwise the column is NULL.
5642

57-
### Getting a storage location
43+
## Getting a storage location
5844

5945
You can get the details of a specific storage location with the `pgfs.get_storage_location` function:
6046

@@ -64,15 +50,15 @@ select * from pgfs.get_storage_location('my_storage');
6450

6551
This command returns the details of the storage location named `my_storage`.
6652

67-
### Updating a storage location
53+
## Updating a storage location
6854

6955
You can update a storage location with the `pgfs.update_storage_location` function:
7056

7157
```sql
7258
select pgfs.update_storage_location('my_storage', 's3://my_bucket', null, '{"region": "eu-west"}'
7359
```
7460

75-
### Deleting a storage location
61+
## Deleting a storage location
7662

7763
You can delete a storage location with the `pgfs.delete_storage_location` function:
7864

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
title: "Pipelines PGFS with local file storage"
3+
navTitle: "Local file storage"
4+
description: "How to use Pipelines PGFS with local file storage."
5+
---
6+
7+
8+
## Overview: local file systems
9+
PGFS uses the `file:` prefix to indicate a local file system.
10+
11+
The general syntax for using local file systems is this:
12+
```sql
13+
select pgfs.create_storage_location(
14+
'local_images',
15+
'file:///var/lib/edb/pipelines/images'
16+
);
17+
```
18+
19+
!!! Note
20+
Paths must always be absolute; e.g., start at the root `/`. Together with the protocol prefix `file://`, this means your path will have three slashes as in the example above.
21+
!!!
22+
23+
!!! Note
24+
Any local path that you want to access must be allowlisted in the [PGFS settings](../settings).
25+
!!!
26+

0 commit comments

Comments
 (0)