|
| 1 | +--- |
| 2 | +title: "Pipeline Auto-Processing" |
| 3 | +navTitle: "Auto-Processing" |
| 4 | +description: "Pipeline Auto-Processing" |
| 5 | +--- |
| 6 | + |
| 7 | +## Overview |
| 8 | +Pipeline Auto-Processing is designed to keep source data and pipeline output in sync. Without this capability, users would have to |
| 9 | +manually trigger processing or provide external scripts, schedulers or triggers: |
| 10 | +- **Full sync:** Insert/delete/update are all handled automatically. No lost updates, missing data or stale records. |
| 11 | +- **Change detection:** Only new or changed records are processed. No unnecessary re-processing of known records. |
| 12 | +- **Batch processing:** Records are grouped into batches to be processed concurrently. Reducing overhead and achieving optimal performance e.g. with GPU-based AI model inference tasks. |
| 13 | +- **Background processing:** When enabled, the pipeline runs in a background worker process so that it doesn't block or delay other DB operations. Ideal for processing huge datasets. |
| 14 | +- **Live processing for Postgres tables:** When the data source is a Postgres table, live trigger-based auto processing can be enabled so that pipeline results are always guaranteed up to date. |
| 15 | +- **Quick turnaround:** Once a batch has finished processing, the results are immediately available. No full listing of the source data is needed to start processing. This is important for large external volumes where a full listing can take a long time. |
| 16 | + |
| 17 | +### Example for Knowledge Base Pipeline |
| 18 | +A knowledge base is created for a Postgres table containing products with product descriptions. |
| 19 | +The user configures background auto-processing to always keep embeddings in sync without blocking or delaying any operations on the products table. |
| 20 | + |
| 21 | +The pipeline will process any pre-existing product records in the background, the user can query the statistics table to see the progress. |
| 22 | + |
| 23 | +The background process will run when new data is inserted, existing data modified or deleted. |
| 24 | + |
| 25 | +Queries on the Knowledge Base (i.e. retrieval operations) will always return accurate results within a small background processing delay. |
| 26 | + |
| 27 | + |
| 28 | +### Supported pipelines |
| 29 | +The Knowledge Base pipeline supports all capabilities listed here. |
| 30 | + |
| 31 | +The Preparer pipeline does not yet support batch processing and background auto-processing |
| 32 | + |
| 33 | +## Auto-Processing modes |
| 34 | +We offer the following Auto-Processing modes to suit different requirements and use-cases. |
| 35 | + |
| 36 | +!!! Note |
| 37 | +Live auto-processing is only available for table sources. Not for volume sources. |
| 38 | +!!! |
| 39 | + |
| 40 | +### Live |
| 41 | +We set up Postgres Triggers on the source table to immediately process any changes. Processing happens within the trigger function. |
| 42 | +This means it happens within the same transaction that modifies the data, guaranteeing up to date results. |
| 43 | + |
| 44 | +#### Considerations |
| 45 | +- Transactional guarantee / immediate results. Pipeline results are always up to date with the source data. |
| 46 | +- Blocks / delays operations on the source data. Modifying transactions on the source data are delayed until processing is complete. |
| 47 | + |
| 48 | +### Background |
| 49 | +We start a Postgres background worker for each pipeline TODO |
| 50 | + |
| 51 | + |
| 52 | +!!! Note |
| 53 | +Make sure Postgres allows running enough background workers for the number of pipelines where you wish to use this processing mode. |
| 54 | +!!! |
| 55 | +#### Considerations |
| 56 | +TODO |
| 57 | + |
| 58 | +### Disabled |
| 59 | +TODO |
| 60 | + |
| 61 | + |
| 62 | + |
| 63 | +## Observability |
| 64 | +We provide detailed status and progress output for all auto-processing modes. |
| 65 | + |
| 66 | +A good place to get an overview is the statistics table. |
| 67 | +Look up the view `aidb.knowledge_base_stats` or use its short alias `aidb.kbstat`. The table shows all configured knowledge base pipelines, |
| 68 | +which processing mode is set, and statistics about the processed records: |
| 69 | +```sql |
| 70 | +SELECT * from aidb.kbstat; |
| 71 | +__OUTPUT__ |
| 72 | + knowledge base | auto processing | table: unprocessed rows | volume: scans completed | count(source records) | count(embeddings) |
| 73 | +------------------------+-----------------+-------------------------+-------------------------+-----------------------+------------------- |
| 74 | + kb_table_text_bg | Background | 0 | | 15 | 15 |
| 75 | + kb_table_text_manual | Disabled | 0 | | 15 | 15 |
| 76 | + kb_table_image_manual | Disabled | 0 | | 3 | 3 |
| 77 | + kb_table_text_live | Live | 0 | | 15 | 15 |
| 78 | + kb_table_image_bg | Background | 0 | | 3 | 3 |
| 79 | + kb_volume_text_bg | Background | | 6 | 7 | 7 |
| 80 | + kb_volume_text_manual | Disabled | | 0 | 0 | 0 |
| 81 | + kb_volume_image_bg | Background | | 4 | 177 | 6 |
| 82 | + kb_volume_image_manual | Disabled | | 1 | 177 | 6 |
| 83 | +(9 rows) |
| 84 | +``` |
| 85 | + |
| 86 | +The [change detection](#change-detection) mechanism is central to how auto-processing works. It is different for volume and table sources. |
| 87 | +For this reason, the stats table has different columns for these two source types. |
| 88 | + |
| 89 | +* `table: unprocessed rows`: How many unique rows are listed in the backlog of change events. |
| 90 | + * If auto-processing is disabled, no (new) change events are captured. |
| 91 | +* `volume: scans completed`: How many full listings of the source have been completed so far. |
| 92 | +* `count(source records)`: How many records exist in the source for this pipeline. |
| 93 | + * for table sources, this number is always accurate. |
| 94 | + * for volume sources, we can only update this number after a full scan has completed. |
| 95 | +* `count(embeddings)`: How many embeddings exist in the vector destination table for this pipeline. |
| 96 | + |
| 97 | + |
| 98 | + |
| 99 | +## Configuration |
| 100 | +Auto-processing can be configured at creation time: |
| 101 | +- With [`aidb.create_table_knowledge_base`](../reference/knowledge_bases#aidbcreate_table_knowledge_base) |
| 102 | +- With [`aidb.create_volume_knowledge_base`](../reference/knowledge_bases#aidbcreate_volume_knowledge_base) |
| 103 | + |
| 104 | +As well as for existing pipelines: |
| 105 | +- With [`aidb.set_auto_knowledge_base`](../reference/knowledge_bases#aidbset_auto_knowledge_base) |
| 106 | + |
| 107 | +## Batch processing |
| 108 | +In Background and Disabled modes, (auto) processing happens in batches of configurable size. Within each batch, |
| 109 | + |
| 110 | +## Change detection |
| 111 | +AIDB auto-processing is designed around change detection mechanisms for table and volume data sources. This allows it to only |
| 112 | +process data when necessary. |
| 113 | + |
| 114 | +### Table sources |
| 115 | +When background auto-processing is configured, Postgres triggers are set up on the source table to detect changes. These triggers are very lightweight. |
| 116 | +They only record change events and insert them into a "change events" table. No actual processing happens in the trigger function. |
| 117 | + |
| 118 | +The background worker will then process these events asynchronously. |
| 119 | + |
| 120 | +!!! Notice |
| 121 | +When auto-processing is disabled, no change detection on tables is possible. This means the manual `bulk_embedding()` operation has to process |
| 122 | +all source records. |
| 123 | +!!! |
| 124 | + |
| 125 | + |
| 126 | +### Volume sources |
| 127 | +This source type provides a `last_modified` timestamp for each source record. The system keeps track of those timestamps in a "state" table. |
| 128 | +In each pipeline execution, the system lists the contents of the volume and compares it to the timestamps to see whether any records have changed or were added. |
| 129 | + |
| 130 | +This mechanism works in disabled and in background auto-processing. |
| 131 | + |
| 132 | +The system detects deleted objects after a full listing is complete. Only then can it be certain that a previously processed record is no longer present in the source. |
| 133 | + |
| 134 | +Unfortunately, object stores (and other external storage locations supported by our volumes) have limited query capabilities. This means: |
| 135 | +!!! Note |
| 136 | +Change detection for volumes is based on polling i.e., repeated listing. This might be an expensive operation when using cloud object stores like AWS S3. |
| 137 | +You can use a long `background_sync_interval` (like one per day) on pipelines with volume sources to control this cost. |
| 138 | +!!! |
0 commit comments