Skip to content

Commit 6d3e3ec

Browse files
authored
Merge pull request #6359 from EnterpriseDB/release/2024-12-18a
Release: 2024-12-18a
2 parents cb3f515 + 5a2fcb0 commit 6d3e3ec

File tree

17 files changed

+488
-351
lines changed

17 files changed

+488
-351
lines changed
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
---
2+
title: Querying Delta Lake Tables in S3-compatible object storage
3+
navTitle: External Tables
4+
description: Access and Query data stored as Delta Lake Tablles in S3-compatible object storage using External Tables
5+
deepToC: true
6+
---
7+
8+
## Overview
9+
10+
External tables allow you to access and query data stored in S3-compatible object storage using SQL. You can create an external table that references data in S3-compatible object storage and query the data using standard SQL commands.
11+
12+
## Prerequisites
13+
14+
* An EDB Postgres AI account and a Lakehouse node.
15+
* An S3-compatible object storage location with data stored as Delta Lake Tables.
16+
* See [Bringing your own data](reference/loadingdata) for more information on how to prepare your data.
17+
* Credentials to access the S3-compatible object storage location, unless it is a public bucket.
18+
* These credentials will be stored within the database. We recommend creating a separate user with limited permissions for this purpose.
19+
20+
!!! Note Regions, latency and cost
21+
Using an S3 bucket that isn't in the same region as your node will
22+
23+
* be slow because of cross-region latencies
24+
* will incur AWS costs (between $0.01 and $0.02 / GB) for data transfer. Currently these egress costs are not passed through to you but we do track them and reserve the right to terminate an instance.
25+
!!!
26+
27+
## Creating an External Storage Location
28+
29+
The first step is to create an external storage location which references S3-compatible object storage where your data resides. A storage location is an object within the database which you refer to to access the data; each storage location has a name for this purpose.
30+
31+
Creating a named storage location is performed with SQL by executing the `pgaa.create_storage_location` function.
32+
`pgaa` is the name of the extension and namespace that provides the functionality to query external storage locations.
33+
The `create_storage_location` function takes a name for the new storage location, and the URI of the S3-compatible object storage location as parameters.
34+
The function optionally can take a third parameter, `options`, which is a JSON object for specifying optional settings, detailed in the [functions reference](reference/functions#pgaacreate_storage_location).
35+
For example, in the options, you can specify the access key ID and secret access key for the storage location to enable access to a private bucket.
36+
37+
The following example creates an external table that references a public S3-compatible object storage location:
38+
39+
```sql
40+
SELECT pgaa.create_storage_location('sample-data', 's3://pgaa-sample-data-eu-west-1');
41+
```
42+
43+
The next example creates an external storage location that references a private S3-compatible object storage location:
44+
45+
```sql
46+
SELECT pgaa.create_storage_location('private-data', 's3://my-private-bucket', '{"access_key_id": "my-access-key-id","secret_access_key": "my-secret-access-key"}');
47+
```
48+
49+
## Creating an External Table
50+
51+
After creating the external storage location, you can create an external table that references the data in the storage location.
52+
The following example creates an external table that references a Delta Lake Table in the S3-compatible object storage location:
53+
54+
```sql
55+
CREATE TABLE public.customer () USING PGAA WITH (pgaa.storage_location = 'sample-data', pgaa.path = 'tpch_sf_1/customer');
56+
```
57+
58+
Note that the schema is not defined in the `CREATE TABLE` statement. The pgaa extension expects the schema to be defined in the storage location, and the schema itself is derived from the schema stored at the path specified in the `pgaa.path` option. The pgaa extension will infer the best Postgres-equivalent data types for the columns in the Delta Table.
59+
60+
## Querying an External Table
61+
62+
After creating the external table, you can query the data in the external table using standard SQL commands. The following example queries the external table created in the previous step:
63+
64+
```sql
65+
SELECT COUNT(*) FROM public.customer;
66+
```

advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx

Lines changed: 15 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -81,50 +81,33 @@ Persistent data in system tables (users, roles, etc) is stored in an attached
8181
block storage device and will survive a restart or backup/restore cycle.
8282
* Only Postgres 16 is supported.
8383

84-
For more notes about supported instance sizes,
85-
see [Reference - Supported AWS instances](./reference/#supported-aws-instances).
84+
For more notes about supported instance sizes,see [Reference - Supported AWS instances](./reference/instances).
8685

8786
## Operating a Lakehouse node
8887

8988
### Connect to the node
9089

91-
You can connect to the Lakehouse node with any Postgres client, in the same way
92-
that you connect to any other cluster from EDB Postgres AI Cloud Service
93-
(formerly known as BigAnimal): navigate to the cluster detail page and copy its
94-
connection string.
90+
You can connect to the Lakehouse node with any Postgres client, in the same way that you connect to any other cluster from EDB Postgres AI Cloud Service (formerly known as BigAnimal): navigate to the cluster detail page and copy its connection string.
9591

96-
For example, you might copy the `.pgpass` blob into `~/.pgpass` (making sure to
97-
replace `$YOUR_PASSWORD` with the password you provided when launching the
98-
cluster). Then you can copy the connection string and use it as an argument to
99-
`psql` or `pgcli`.
92+
For example, you might copy the `.pgpass` blob into `~/.pgpass` (making sure to replace `$YOUR_PASSWORD` with the password you provided when launching the cluster).
93+
Then you can copy the connection string and use it as an argument to `psql` or `pgcli`.
10094

101-
In general, you should be able to connect to the database with any Postgres
102-
client. We expect all introspection queries to work, and if you find one that
103-
doesn't, then that's a bug.
95+
In general, you should be able to connect to the database with any Postgres client.
96+
We expect all introspection queries to work, and if you find one that doesn't, then that's a bug.
10497

10598
### Understand the constraints
10699

107-
* Every cluster uses EPAS or PGE. So expect to see boilerplate tables from those
108-
flavors in the installation when you connect.
109-
* Queryable data (like the benchmarking datasets) is stored in object storage
110-
as Delta Tables. Every cluster comes pre-loaded to point to a storage bucket
111-
with benchmarking data inside (TPC-H, TPC-DS, Clickbench) at
112-
scale factors 1 and 10.
100+
* Every cluster uses EPAS or PGE. So expect to see boilerplate tables from those flavors in the installation when you connect.
101+
* Queryable data (like the benchmarking datasets) is stored in object storage as Delta Tables. Every cluster comes pre-loaded to point to a storage bucket with benchmarking data inside (TPC-H, TPC-DS, Clickbench) at scale factors from 1 to 1000.
113102
* Only AWS is supported at the moment. Bring Your Own Account (BYOA) is not supported.
114-
* You can deploy a cluster in any region that is activated in
115-
your EDB Postgres AI Account. Each region has a bucket with a copy of the
116-
benchmarking data, and so when you launch a cluster, it will use the
117-
benchmarking data in the location closest to it.
118-
* The cluster is ephemeral. None of the data is stored on the hard drive,
119-
except for data in system tables, e.g. roles and users and grants.
120-
If you restart the cluster, or backup the cluster and then restore it,
121-
it will restore these system tables. But the data in object storage will
103+
* You can deploy a cluster in any region that is activated in your EDB Postgres AI Account. Each region has a bucket with a copy of the
104+
benchmarking data, and so when you launch a cluster, it will use the benchmarking data in the location closest to it.
105+
* The cluster is ephemeral. None of the data is stored on the hard drive, except for data in system tables, e.g. roles and users and grants.
106+
If you restart the cluster, or backup the cluster and then restore it, it will restore these system tables. But the data in object storage will
122107
remain untouched.
123-
* The cluster supports READ ONLY queries of the data in object
124-
storage (but it supports write queries to system tables for creating users,
108+
* The cluster supports READ ONLY queries of the data in object storage (but it supports write queries to system tables for creating users,
125109
etc.). You cannot write directly to object storage. You cannot create new tables.
126-
* If you want to load your own data into object storage,
127-
see [Reference - Bring your own data](./reference/#advanced-bring-your-own-data).
110+
* If you want to load your own data into object storage, see [Reference - Bring your own data](reference/loadingdata).
128111

129112
## Inspect the benchmark datasets
130113

@@ -140,7 +123,7 @@ The available benchmarking datasets are:
140123
* 1 Billion Row Challenge
141124

142125
For more details on benchmark datasets,
143-
see Reference - Available benchmarking datasets](./reference/#available-benchmarking-datasets).
126+
see Reference - Available benchmarking datasets](./reference/datasets).
144127

145128
## Query the benchmark datasets
146129

0 commit comments

Comments
 (0)