EnterpriseDB · timwaizenegger · Apr 29, 2025 · Apr 24, 2025 · Apr 24, 2025 · Apr 24, 2025
@@ -0,0 +1,138 @@
+---
+title: "Pipeline Auto-Processing"
+navTitle: "Auto-Processing"
+description: "Pipeline Auto-Processing"
+---
+
+## Overview
+Pipeline Auto-Processing is designed to keep source data and pipeline output in sync. Without this capability, users would have to
+manually trigger processing or provide external scripts, schedulers or triggers:
+- **Full sync:** Insert/delete/update are all handled automatically. No lost updates, missing data or stale records.
+- **Change detection:** Only new or changed records are processed. No unnecessary re-processing of known records.
+- **Batch processing:** Records are grouped into batches to be processed concurrently. Reducing overhead and achieving optimal performance e.g. with GPU-based AI model inference tasks.
+- **Background processing:** When enabled, the pipeline runs in a background worker process so that it doesn't block or delay other DB operations. Ideal for processing huge datasets.
+- **Live processing for Postgres tables:** When the data source is a Postgres table, live trigger-based auto processing can be enabled so that pipeline results are always guaranteed up to date.
+- **Quick turnaround:** Once a batch has finished processing, the results are immediately available. No full listing of the source data is needed to start processing. This is important for large external volumes where a full listing can take a long time.
+
+### Example for Knowledge Base Pipeline
+A knowledge base is created for a Postgres table containing products with product descriptions.
+The user configures background auto-processing to always keep embeddings in sync without blocking or delaying any operations on the products table.
+
+The pipeline will process any pre-existing product records in the background, the user can query the statistics table to see the progress.
+
+The background process will run when new data is inserted, existing data modified or deleted.
+
+Queries on the Knowledge Base (i.e. retrieval operations) will always return accurate results within a small background processing delay.
+
+
+### Supported pipelines
+The Knowledge Base pipeline supports all capabilities listed here.
+
+The Preparer pipeline does not yet support batch processing and background auto-processing
+
+## Auto-Processing modes
+We offer the following Auto-Processing modes to suit different requirements and use-cases.
+
+!!! Note
+Live auto-processing is only available for table sources. Not for volume sources.
+!!!
+
+### Live
+We set up Postgres Triggers on the source table to immediately process any changes. Processing happens within the trigger function.
+This means it happens within the same transaction that modifies the data, guaranteeing up to date results.
+
+#### Considerations
+- Transactional guarantee / immediate results. Pipeline results are always up to date with the source data.
+- Blocks / delays operations on the source data. Modifying transactions on the source data are delayed until processing is complete.
+
+### Background
+We start a Postgres background worker for each pipeline TODO
+
+
+!!! Note
+Make sure Postgres allows running enough background workers for the number of pipelines where you wish to use this processing mode.
+!!!
+#### Considerations
+TODO
+
+### Disabled
+TODO
+
+
+
+## Observability
+We provide detailed status and progress output for all auto-processing modes.
+
+A good place to get an overview is the statistics table.
+Look up the view `aidb.knowledge_base_stats` or use its short alias `aidb.kbstat`. The table shows all configured knowledge base pipelines,
+which processing mode is set, and statistics about the processed records:
+```sql
+SELECT * from aidb.kbstat;
+__OUTPUT__
+     knowledge base     | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings)
+------------------------+-----------------+-------------------------+-------------------------+-----------------------+-------------------
+ kb_table_text_bg       | Background      |                       0 |                         |                    15 |                15
+ kb_table_text_manual   | Disabled        |                       0 |                         |                    15 |                15
+ kb_table_image_manual  | Disabled        |                       0 |                         |                     3 |                 3
+ kb_table_text_live     | Live            |                       0 |                         |                    15 |                15
+ kb_table_image_bg      | Background      |                       0 |                         |                     3 |                 3
+ kb_volume_text_bg      | Background      |                         |                       6 |                     7 |                 7
+ kb_volume_text_manual  | Disabled        |                         |                       0 |                     0 |                 0
+ kb_volume_image_bg     | Background      |                         |                       4 |                   177 |                 6
+ kb_volume_image_manual | Disabled        |                         |                       1 |                   177 |                 6
+(9 rows)
+```
+
+The [change detection](#change-detection) mechanism is central to how auto-processing works. It is different for volume and table sources.
+For this reason, the stats table has different columns for these two source types.
+
+* `table: unprocessed rows`: How many unique rows are listed in the backlog of change events.
+  * If auto-processing is disabled, no (new) change events are captured.
+* `volume: scans completed`: How many full listings of the source have been completed so far.
+* `count(source records)`: How many records exist in the source for this pipeline.
+  * for table sources, this number is always accurate.
+  * for volume sources, we can only update this number after a full scan has completed.
+* `count(embeddings)`: How many embeddings exist in the vector destination table for this pipeline.
+
+
+
+## Configuration
+Auto-processing can be configured at creation time:
+- With [`aidb.create_table_knowledge_base`](../reference/knowledge_bases#aidbcreate_table_knowledge_base)
+- With [`aidb.create_volume_knowledge_base`](../reference/knowledge_bases#aidbcreate_volume_knowledge_base)
+
+As well as for existing pipelines:
+- With [`aidb.set_auto_knowledge_base`](../reference/knowledge_bases#aidbset_auto_knowledge_base)
+
+## Batch processing
+In Background and Disabled modes, (auto) processing happens in batches of configurable size. Within each batch,
+
+## Change detection
+AIDB auto-processing is designed around change detection mechanisms for table and volume data sources. This allows it to only
+process data when necessary.
+
+### Table sources
+When background auto-processing is configured, Postgres triggers are set up on the source table to detect changes. These triggers are very lightweight.
+They only record change events and insert them into a "change events" table. No actual processing happens in the trigger function.
+
+The background worker will then process these events asynchronously.
+
+!!! Notice
+When auto-processing is disabled, no change detection on tables is possible. This means the manual `bulk_embedding()` operation has to process
+all source records.
+!!!
+
+
+### Volume sources
+This source type provides a `last_modified` timestamp for each source record. The system keeps track of those timestamps in a "state" table.
+In each pipeline execution, the system lists the contents of the volume and compares it to the timestamps to see whether any records have changed or were added.
+
+This mechanism works in disabled and in background auto-processing.
+
+The system detects deleted objects after a full listing is complete. Only then can it be certain that a previously processed record is no longer present in the source.
+
+Unfortunately, object stores (and other external storage locations supported by our volumes) have limited query capabilities. This means:
+!!! Note
+Change detection for volumes is based on polling i.e., repeated listing. This might be an expensive operation when using cloud object stores like AWS S3.
+You can use a long `background_sync_interval` (like one per day) on pipelines with volume sources to control this cost.
+!!!
@@ -20,6 +20,7 @@ navigation:
 - preparers
 - knowledge_base
 - pgfs
+- volumes
 - "#Pipelines resources"
 - reference
 - rel_notes

@@ -28,7 +28,7 @@ create_table_knowledge_base(
     topk                                INTEGER DEFAULT 1,
     distance_operator                   aidb.distanceoperator DEFAULT 'L2',
     options                             JSONB DEFAULT '{}'::JSONB,
-    processing_mode                     aidb.PipelineProcessingMode DEFAULT 'Disabled'
+    auto_processing                     aidb.pipelineautoprocessingmode DEFAULT 'Disabled'
 )
 ```
 
@@ -96,7 +96,7 @@ aidb.create_volume_knowledge_base(
     topk                              INTEGER DEFAULT 1,
     distance_operator                 aidb.distanceoperator DEFAULT 'L2',
     options                           JSONB DEFAULT '{}'::JSONB
-    processing_mode                   aidb.PipelineProcessingMode DEFAULT 'Disabled'
+    auto_processing                   aidb.pipelineautoprocessingmode DEFAULT 'Disabled'
 )
 ```
 

@@ -19,4 +19,4 @@ Pipelines has a model registry that manages configured instances of models. Any
 
 ## Next steps
 
-Once you are familiar with models, you can learn how to use those models with [knowledge bases](../knowledge_bases).
+Once you are familiar with models, you can learn how to use those models with [knowledge bases](../knowledge_base).
@@ -4,7 +4,7 @@ navTitle: "Primitives"
 description: "The model primitives available in EDB Postgres AI - AI Accelerator Pipelines."
 ---
 
-For most use cases, we recommend that you use the aidb [knowledge bases](../knowledge_bases) to interact with models. They can manage creating embeddings and retrieving matching data for many applications.
+For most use cases, we recommend that you use the aidb [knowledge bases](../knowledge_base) to interact with models. They can manage creating embeddings and retrieving matching data for many applications.
 
 However, if you need to interact with models directly, you can use the following primitives. The encode functions generate embeddings for text and images, and the decode functions generate text from embeddings.
 

@@ -21,24 +21,39 @@ Reranking is a method in text search that sorts results by relevance to make the
 * nvidia/llama-3.2-nv-rerankqa-1b-v2 (default)
 
 
+## Example
+The function accepts a string as the "rerank query" and an array of texts to rerank.
+The `id` column in the output refers to the index of the text in the input array.
+
+```sql
+aidb=# SELECT * from aidb.rerank_text('my_nim_reranker', 'how can I open a door?', '{Ask for help, Push the handle, Lie down and wait, Shout at it}'::text[]) ORDER BY logit_score DESC;
+       text        | logit_score  | id
+-------------------+--------------+----
+ Push the handle   | -3.697265625 |  1
+ Ask for help      |   -6.2578125 |  0
+ Shout at it       |  -7.39453125 |  3
+ Lie down and wait |      -11.375 |  2
+(4 rows)
+```
+
 
 ## Creating the default model
 
 ```sql
 SELECT aidb.create_model(
-    'my_nim_reranker', 
+    'my_nim_reranker',
     'nim_reranking',
-    credentials=>'{"api_key": "<API_KEY_HERE>"'::JSONB
-);
+    '{"url":"http://nim-nv-rerankqa-llama-l-1xgpu-g6-predictor.default.svc.cluster.local/v1/ranking", "model": "nvidia/llama-3.2-nv-rerankqa-1b-v2"}'
+    );
 ```
 
-There's only one model, the default `nvidia/nvclip`, so you don't need to specify the model in the configuration. 
+This example uses a locally deployed NIM model that does not require credentials. Credentials and other configuration can be provided as described in [using models](../using-models).
 
 ## Model configuration settings
 
 The following configuration settings are available for NIM models:
 
-* `model` &mdash; The NIM model to use. The default is `nvidia/llama-3.2-nv-rerankqa-1b-v2` and is the only model available.
+* `model` &mdash; The NIM model to use. The default is `nvidia/llama-3.2-nv-rerankqa-1b-v2` and is the only supported model.
 * `url` &mdash; The URL of the model to use. This setting is optional and can be used to specify a custom model URL. The default is `https://ai.api.nvidia.com/v1/retrieval`.
 
 ## Model credentials

@@ -1,50 +1,36 @@
 ---
 title: "PGFS functions for Pipelines"
 navTitle: "PGFS functions"
-description: "How to use PGFS functions to access external storage in Pipelines."
+description: "How to use PGFS functions to manage external storage in Pipelines."
 ---
 
-## Using the PGFS functions
-
-The PGFS extension provides a set of functions to create and manage storage locations.
-
-### Creating a storage location
+## Creating a storage location
 
 Start with creating a storage location. A storage location is a reference to a location in an external file system. You can create a storage location with the `pgfs.create_storage_location` function:
 
 ```sql
-select pgfs.create_storage_location('my_storage', 's3://my_bucket','','{}'::JSONB,'{}'::JSONB);
+select pgfs.create_storage_location('storage_location_name', 'protocol://path', options => '{}'::JSONB, credentials => '{}'::JSONB);
 ```
 
-The `create_strorage_location` function takes a name for the storage location and then a URL for the location. Prefix the URL with `s3:` for an S3-compatible bucket or `file:` for a local file system.
-
-
-```sql
-select pgfs.create_storage_location('my_file_storage', 'file:///tmp/my_path',  NULL,  '{}'::json,    '{}'::json );
-```
+### Storage provider types
+Detailed instructions for the supported storage providers can be found here:
 
-When using the `file:` schema, provide an absolute path, one that starts with `/`, for example `/tmp/my_path`). Together with the schema indicator `file://`, there are then three slashes at the beginning of the path.
-
-The function also takes an optional `msl_id` parameter, which isn't used. It also requires `options` and `credentials` parameters. If those are unused, you must pass them as empty JSON objects.
+- [S3-compatible storage](s3)
+- [Local file system](local)
 
 ### Creating a storage location with options and credentials
 
-Using the `options` and `credentials` parameters allows a range of other settings to be passed.
+Using the `options` and `credentials` parameters allows a range of settings to be passed.
 
 The `options` parameter is a JSON object that can be used to pass additional options to the storage location.
 The `credentials` parameter is a JSON object that can be used to pass credentials to the storage location.
 
 The difference between `options` and `credentials` is that options remain visible to users querying the extension while credentials are hidden to all users except superusers and the user that creates the storage location.
 
-For example, you can create a storage location with options and credentials like this:
-
-```sql
-select pgfs.create_storage_location('my_storage', 's3://my_private_bucket', null, '{"region": "eu-west"}'::JSONB, '{"access_key_id": "youraccesskeyid", "secret_access_key":"yoursecretaccesskey"}'::JSONB);
-```
-
-Once you've created a storage location, you can use it to create foreign tables and access files in the external file system. To use it with aidb, you need to create a volume from the storage location. To do that, see [Creating a volume](../knowledge_bases/usage#creating-a-new-volume).
+### Testing storage locations and using them with AIDB
+To use a storage location with aidb, you need to create a volume from the storage location. To do that, see [Creating a volume](../../volumes).
 
-### Listing storage locations
+## Listing storage locations
 
 You can list all storage locations with the `pgfs.list_storage_locations` function:
 
@@ -54,7 +40,7 @@ select * from pgfs.list_storage_locations();
 
 This command returns a table of currently defined storage locations. Credentials are shown only if the user has the necessary permissions. Otherwise the column is NULL.
 
-### Getting a storage location
+## Getting a storage location
 
 You can get the details of a specific storage location with the `pgfs.get_storage_location` function:
 
@@ -64,15 +50,15 @@ select * from pgfs.get_storage_location('my_storage');
 
 This command returns the details of the storage location named `my_storage`.
 
-### Updating a storage location
+## Updating a storage location
 
 You can update a storage location with the `pgfs.update_storage_location` function:
 
 ```sql
 select pgfs.update_storage_location('my_storage', 's3://my_bucket', null, '{"region": "eu-west"}'
 ```
 
-### Deleting a storage location
+## Deleting a storage location
 
 You can delete a storage location with the `pgfs.delete_storage_location` function:
 

@@ -0,0 +1,26 @@
+---
+title: "Pipelines PGFS with local file storage"
+navTitle: "Local file storage"
+description: "How to use Pipelines PGFS with local file storage."
+---
+
+
+## Overview: local file systems
+PGFS uses the `file:` prefix to indicate a local file system.
+
+The general syntax for using local file systems is this:
+```sql
+select pgfs.create_storage_location(
+    'local_images',
+    'file:///var/lib/edb/pipelines/images'
+       );
+```
+
+!!! Note
+Paths must always be absolute; e.g., start at the root `/`. Together with the protocol prefix `file://`, this means your path will have three slashes as in the example above.
+!!!
+
+!!! Note
+Any local path that you want to access must be allowlisted in the [PGFS settings](../settings).
+!!!
+
Original file line number	Diff line number	Diff line change
Expand Up		@@ -19,4 +19,4 @@ Pipelines has a model registry that manages configured instances of models. Any

		## Next steps

		Once you are familiar with models, you can learn how to use those models with [knowledge bases](../knowledge_bases).
		Once you are familiar with models, you can learn how to use those models with [knowledge bases](../knowledge_base).