Skip to content

Commit 22a31f4

Browse files
committed
Amazon Redshift Serverless RSQL ETL Framework
1 parent 28ad988 commit 22a31f4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+4809
-6
lines changed

.gitignore

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
*.js
2+
!jest.config.js
3+
*.d.ts
4+
node_modules
5+
6+
# CDK asset staging directory
7+
.cdk.staging
8+
cdk.out
9+
cdk.context.json

GraphView.png

125 KB
Loading

README.md

+59-6
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,65 @@
1-
## My Project
1+
# Amazon Redshift Serverless RSQL ETL Framework
22

3-
TODO: Fill this README out!
3+
The goal of the Amazon Redshift Serverless RSQL ETL Framework project is to run complex ETL jobs implemented in Amazon Redshift RSQL scripts in the AWS Cloud without having to manage any infrastructure. The solution creates a fully serverless and cost-effective Amazon Redshift ETL orchestration framework. It uses Amazon Redshift RSQL and AWS services such as AWS Batch and AWS Step Functions.
44

5-
Be sure to:
5+
The deployment is fully automated using AWS Cloud Development Kit (AWS CDK) and comprises of the following stacks:
66

7-
* Change the title in this README
8-
* Edit your repository description on GitHub
7+
1. `EcrRepositoryStack` - Creates a private Amazon Elastic Container Registry (Amazon ECR) repository that hosts our Docker image with Amazon Redshift RSQL
8+
2. `RsqlDockerImageStack` - Builds our Docker image asset and uploads it to the ECR repository
9+
3. `VpcStack` - Creates a VPC with isolated subnets, creates an Amazon Simple Storage Service (Amazon S3) VPC endpoint gateway, as well as Amazon ECR, Amazon Redshift, and Amazon CloudWatch VPC endpoint interfaces
10+
4. `RedshiftStack` - Creates an Amazon Redshift cluster, enables encryption, enforces encryption in-transit, enables auditing, and deploys the Amazon Redshift cluster in isolated subnets
11+
5. `BatchStack` - Creates a compute environment (using AWS Fargate), job queue, and job definition (using our Docker image with RSQL)
12+
6. `S3Stack` - Creates data, scripts, and logging buckets; enables encryption at-rest; enforces secure transfer; enables object versioning; and disables public access
13+
7. `SnsStack` - Creates an Amazon Simple Notification Service (Amazon SNS) topic and email subscription (email is passed as a parameter)
14+
8. `StepFunctionsStack` - Creates a state machine to orchestrate serverless RSQL ETL jobs
15+
9. `SampleDataDeploymentStack` - Deploys sample RSQL ETL scripts and sample TPC benchmark datasets
16+
17+
The following diagram shows the final architecture.
18+
19+
![Serverless RSQL ETL Framework Architecture](ServerlessRSQLETLFramework.png)
20+
21+
## Deploy AWS CDK stacks
22+
23+
To deploy the serverless RSQL ETL framework solution, use the following code. Replace `123456789012` with your AWS account number, `eu-west-1` with the AWS Region to which you want deploy the solution, and `[email protected]` with your email address to which ETL success and failure notifications are sent.
24+
25+
```
26+
git clone https://github.com/aws-samples/amazon-redshift-serverless-rsql-etl-framework
27+
cd amazon-redshift-serverless-rsql-etl-framework
28+
npm install
29+
./cdk.sh 123456789012 eu-west-1 bootstrap
30+
./cdk.sh 123456789012 eu-west-1 deploy --all --parameters SnsStack:[email protected]
31+
```
32+
33+
The whole process takes a few minutes.
34+
35+
## Execute Step Functions state machine
36+
37+
After AWS CDK finishes, a new state machine is created in your account called `ServerlessRSQLETLFramework`. To run it, complete the following steps:
38+
39+
1. Navigate to the Step Functions console.
40+
2. Choose the function to open the details page.
41+
3. Choose **Edit**, and then choose **Workflow Studio New**. The following screenshot shows our state machine.
42+
![Sample State Machine](StateMachine.png)
43+
4. Choose **Cancel** to leave Workflow Studio, then choose **Cancel** again to leave the edit mode. You will be brought back to the details page.
44+
5. Choose **Start execution**. A dialog box appears. By default, the **Name** parameter is set to a random identifier, and the **Input** parameter is set to a sample JSON document.
45+
6. Delete the **Input** parameter and choose **Start execution** to start the state machine.
46+
47+
The Graph view on the details page updates in real time. The state machine starts with a parallel state with two branches. In the left branch, the first job loads customer data into staging table, then in the second job merges new and existing customer records. In the right branch, two smaller tables for regions and nations are loaded and then merged one after another. The parallel state waits until all branches are complete before moving to the vacuum-analyze state, which runs VACUUM and ANALYZE commands on Amazon Redshift. The sample state machine also implements the Amazon SNS Publish API actions to send notifications about success or failure.
48+
49+
From the Graph view, you can check the status of each state by choosing it. Every state that uses an external resource has a link to it on the Details tab. In our example, next to every AWS Batch Job state, you can see a link to the AWS Batch Job details page. Here, you can view the status, runtime, parameters, IAM roles, link to Amazon CloudWatch Logs with the logs produced by ETL scripts, and more.
50+
51+
![Graph View](GraphView.png)
52+
53+
### Execute Step Functions state machine using AWS CLI
54+
55+
To start the state machine using AWS CLI, use the following code:
56+
57+
```
58+
# fetch the state machine ARN from CloudFormation stack
59+
STATE_MACHINE_ARN=$(aws cloudformation describe-stacks --stack-name StepFunctionsStack --query "Stacks[0].Outputs[?OutputKey=='StateMachineArn'].OutputValue" --output text)
60+
# start the state machine
61+
aws stepfunctions start-execution --state-machine-arn $STATE_MACHINE_ARN
62+
```
963

1064
## Security
1165

@@ -14,4 +68,3 @@ See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more inform
1468
## License
1569

1670
This project is licensed under the Apache-2.0 License.
17-

ServerlessRSQLETLFramework.png

34.1 KB
Loading

StateMachine.png

90.8 KB
Loading

bin/serverless_rsql_etl_framework.ts

+69
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
#!/usr/bin/env node
2+
import 'source-map-support/register';
3+
import * as cdk from 'aws-cdk-lib';
4+
import { EcrRepositoryStack } from '../lib/ecr-stack';
5+
import { RsqlDockerImageStack } from '../lib/rsql-docker-image-stack';
6+
import { S3Stack } from '../lib/s3-stack';
7+
import { VpcStack } from '../lib/vpc-stack';
8+
import { RedshiftStack } from '../lib/redshift-stack';
9+
import { BatchStack } from '../lib/batch-stack';
10+
import { StepFunctionsStack } from '../lib/stepfunctions-stack';
11+
import { SnsStack } from '../lib/sns-stack';
12+
import { SampleDataDeploymentStack } from '../lib/sample-data-deployment-stack';
13+
import { Tags } from 'aws-cdk-lib';
14+
15+
const env = {
16+
account: process.env.CDK_DEPLOY_ACCOUNT || process.env.CDK_DEFAULT_ACCOUNT,
17+
region: process.env.CDK_DEPLOY_REGION || process.env.CDK_DEFAULT_REGION
18+
};
19+
20+
const app = new cdk.App();
21+
22+
Tags.of(app).add('purpose', 'aws-blog-demo-serverless-rsql-etl-framework');
23+
24+
const vpcStack = new VpcStack(app, 'VpcStack', {
25+
env: env,
26+
});
27+
const s3Stack = new S3Stack(app, 'S3Stack', {
28+
env: env,
29+
});
30+
const redshiftStack = new RedshiftStack(app, 'RedshiftStack', {
31+
env: env,
32+
vpc: vpcStack.vpc,
33+
loggingBucket: s3Stack.loggingBucket,
34+
scriptsBucket: s3Stack.scriptsBucket,
35+
dataBucket: s3Stack.dataBucket,
36+
});
37+
const ecrRepositoryStack = new EcrRepositoryStack(app, 'EcrRepositoryStack', {
38+
env: env,
39+
});
40+
const rsqlDockerImageStack = new RsqlDockerImageStack(app, 'RsqlDockerImageStack', {
41+
env: env,
42+
repository: ecrRepositoryStack.repository,
43+
redshift: redshiftStack.redshift,
44+
});
45+
const batchStack = new BatchStack(app, 'BatchStack', {
46+
env: env,
47+
redshift: redshiftStack.redshift,
48+
ecrRepository: ecrRepositoryStack.repository,
49+
vpc: vpcStack.vpc,
50+
scriptsBucket: s3Stack.scriptsBucket,
51+
});
52+
const snsStack = new SnsStack(app, 'SnsStack', {
53+
env: env,
54+
});
55+
const stepFunctionsStack = new StepFunctionsStack(app, 'StepFunctionsStack', {
56+
env: env,
57+
jobDefinition: batchStack.jobDefinition,
58+
jobQueue: batchStack.jobQueue,
59+
scriptsBucket: s3Stack.scriptsBucket,
60+
dataBucket: s3Stack.dataBucket,
61+
redshiftRole: redshiftStack.redshiftRole,
62+
snsTopic: snsStack.topic,
63+
snsKey: snsStack.key,
64+
});
65+
const sampleDataDeploymentStack = new SampleDataDeploymentStack(app, 'SampleDataDeploymentStack', {
66+
env: env,
67+
scriptsBucket: s3Stack.scriptsBucket,
68+
dataBucket: s3Stack.dataBucket,
69+
});

cdk.json

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
{
2+
"app": "npx ts-node --prefer-ts-exts bin/serverless_rsql_etl_framework.ts",
3+
"watch": {
4+
"include": [
5+
"**"
6+
],
7+
"exclude": [
8+
"README.md",
9+
"cdk*.json",
10+
"**/*.d.ts",
11+
"**/*.js",
12+
"tsconfig.json",
13+
"package*.json",
14+
"yarn.lock",
15+
"node_modules",
16+
"test"
17+
]
18+
},
19+
"context": {
20+
"@aws-cdk/aws-apigateway:usagePlanKeyOrderInsensitiveId": true,
21+
"@aws-cdk/core:stackRelativeExports": true,
22+
"@aws-cdk/aws-rds:lowercaseDbIdentifier": true,
23+
"@aws-cdk/aws-lambda:recognizeVersionProps": true,
24+
"@aws-cdk/aws-cloudfront:defaultSecurityPolicyTLSv1.2_2021": true,
25+
"@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true,
26+
"@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true,
27+
"@aws-cdk/core:checkSecretUsage": true,
28+
"@aws-cdk/aws-iam:minimizePolicies": true,
29+
"@aws-cdk/core:target-partitions": [
30+
"aws",
31+
"aws-cn"
32+
]
33+
}
34+
}

cdk.sh

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/usr/bin/env bash
2+
if [[ $# -ge 2 ]]; then
3+
export CDK_DEPLOY_ACCOUNT=$1
4+
export CDK_DEPLOY_REGION=$2
5+
shift; shift
6+
7+
exit $?
8+
else
9+
echo 1>&2 "Provide account and region as first two args."
10+
echo 1>&2 "Additional args are passed through to cdk deploy."
11+
exit 1
12+
fi

jest.config.js

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
module.exports = {
2+
testEnvironment: 'node',
3+
roots: ['<rootDir>/test'],
4+
testMatch: ['**/*.test.ts'],
5+
transform: {
6+
'^.+\\.tsx?$': 'ts-jest'
7+
}
8+
};

lib/amazonlinux-rsql/.odbc.ini

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
[ODBC]
2+
Trace=no
3+
4+
[etl]
5+
Driver=/opt/amazon/redshiftodbc/lib/64/libamazonredshiftodbc64.so
6+
Database=demo
7+
DbUser=etl
8+
ClusterID=redshiftblogdemo
9+
Region=eu-west-1
10+
IAM=1

lib/amazonlinux-rsql/Dockerfile

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
FROM amazonlinux:2
2+
3+
ENV AMAZON_REDSHIFT_ODBC_VERSION=1.4.52.1000
4+
ENV AMAZON_REDSHIFT_RSQL_VERSION=1.0.4
5+
6+
RUN yum install -y openssl unixODBC gettext awscli && \
7+
yum clean all
8+
9+
RUN rpm -i \
10+
https://s3.amazonaws.com/redshift-downloads/drivers/odbc/${AMAZON_REDSHIFT_ODBC_VERSION}/AmazonRedshiftODBC-64-bit-${AMAZON_REDSHIFT_ODBC_VERSION}-1.x86_64.rpm \
11+
https://s3.amazonaws.com/redshift-downloads/amazon-redshift-rsql/${AMAZON_REDSHIFT_RSQL_VERSION}/AmazonRedshiftRsql-${AMAZON_REDSHIFT_RSQL_VERSION}-1.x86_64.rpm
12+
13+
COPY .odbc.ini .odbc.ini
14+
COPY fetch_and_run.sh /usr/local/bin/fetch_and_run.sh
15+
16+
ENV ODBCINI=.odbc.ini
17+
ENV ODBCSYSINI=/opt/amazon/redshiftodbc/Setup
18+
ENV AMAZONREDSHIFTODBCINI=/opt/amazon/redshiftodbc/lib/64/amazon.redshiftodbc.ini
19+
20+
ENTRYPOINT ["/usr/local/bin/fetch_and_run.sh"]

lib/amazonlinux-rsql/fetch_and_run.sh

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/bin/bash
2+
3+
# This scripts expects the following env variables to be set:
4+
# BATCH_SCRIPT_LOCATION - full S3 path to RSQL script to run
5+
# DATA_BUCKET_NAME - S3 bucket name with the data
6+
# COPY_IAM_ROLE_ARN - IAM role ARN that will be used to copy the data from S3 to Redshift
7+
8+
PATH="/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin"
9+
10+
if [ -z "${BATCH_SCRIPT_LOCATION}" ] || [ -z "${DATA_BUCKET_NAME}" ] || [ -z "${COPY_IAM_ROLE_ARN}" ]; then
11+
echo "BATCH_SCRIPT_LOCATION/DATA_BUCKET_NAME/COPY_IAM_ROLE_ARN not set. No script to run."
12+
exit 1
13+
fi
14+
15+
# download script to a temp file
16+
TEMP_SCRIPT_FILE=$(mktemp)
17+
aws s3 cp ${BATCH_SCRIPT_LOCATION} ${TEMP_SCRIPT_FILE}
18+
19+
# execute script
20+
# envsubst will replace ${COPY_IAM_ROLE_ARN} and ${COPY_IAM_ROLE_ARN} placeholders with actual values
21+
envsubst < ${TEMP_SCRIPT_FILE} | rsql -D etl
22+
23+
exit $?

lib/batch-stack.ts

+97
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
import { Stack, StackProps } from 'aws-cdk-lib';
2+
import { Construct } from 'constructs';
3+
import * as batch from 'aws-cdk-lib/aws-batch';
4+
import * as redshift from 'aws-cdk-lib/aws-redshift';
5+
import * as ec2 from 'aws-cdk-lib/aws-ec2';
6+
import * as s3 from 'aws-cdk-lib/aws-s3';
7+
import * as iam from 'aws-cdk-lib/aws-iam';
8+
import * as ecr from 'aws-cdk-lib/aws-ecr';
9+
10+
interface BatchProps extends StackProps {
11+
vpc: ec2.Vpc;
12+
scriptsBucket: s3.Bucket;
13+
redshift: redshift.CfnCluster;
14+
ecrRepository: ecr.Repository;
15+
}
16+
17+
export class BatchStack extends Stack {
18+
readonly computeEnvironment: batch.CfnComputeEnvironment;
19+
readonly jobQueue: batch.CfnJobQueue;
20+
readonly jobDefinition: batch.CfnJobDefinition;
21+
readonly batchExecutionRole: iam.Role;
22+
readonly batchJobRole: iam.Role;
23+
24+
constructor(scope: Construct, id: string, props: BatchProps) {
25+
super(scope, id, props);
26+
27+
this.batchExecutionRole = new iam.Role(this, 'DemoBatchExecutionRole', {
28+
assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
29+
managedPolicies: [iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonECSTaskExecutionRolePolicy')],
30+
});
31+
32+
this.batchJobRole = new iam.Role(this, 'DemoBatchJobRole', {
33+
assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
34+
});
35+
36+
// batch job role needs:
37+
// 1. permissions to call GetClusterCredentials for our Redshift cluster
38+
this.batchJobRole.addToPolicy(new iam.PolicyStatement({
39+
sid: 'DescribeClusters',
40+
resources: [
41+
`arn:aws:redshift:${Stack.of(this).region}:${Stack.of(this).account}:cluster:${props.redshift.clusterIdentifier}`,
42+
],
43+
actions: ['redshift:DescribeClusters'],
44+
}));
45+
this.batchJobRole.addToPolicy(new iam.PolicyStatement({
46+
sid: 'GetRedshiftClusterCredentials',
47+
resources: [
48+
`arn:aws:redshift:${Stack.of(this).region}:${Stack.of(this).account}:dbname:${props.redshift.clusterIdentifier}/demo`,
49+
`arn:aws:redshift:${Stack.of(this).region}:${Stack.of(this).account}:dbuser:${props.redshift.clusterIdentifier}/etl`,
50+
],
51+
actions: ['redshift:GetClusterCredentials'],
52+
}));
53+
// 2. permissions to read scripts from S3
54+
props.scriptsBucket.grantRead(this.batchJobRole);
55+
56+
this.computeEnvironment = new batch.CfnComputeEnvironment(this, 'DemoComputeEnv', {
57+
computeEnvironmentName: 'DemoComputeEnv',
58+
type: 'MANAGED',
59+
computeResources: {
60+
subnets: props.vpc.isolatedSubnets.map(s => s.subnetId),
61+
securityGroupIds: [props.vpc.vpcDefaultSecurityGroup],
62+
maxvCpus: 256,
63+
type: 'FARGATE'
64+
}
65+
});
66+
67+
this.jobQueue = new batch.CfnJobQueue(this, 'ETLJobQueue', {
68+
jobQueueName: 'ETLJobQueue',
69+
computeEnvironmentOrder: [{
70+
computeEnvironment: this.computeEnvironment.ref,
71+
order: 1,
72+
}],
73+
priority: 1
74+
});
75+
76+
this.jobDefinition = new batch.CfnJobDefinition(this, 'RSQLETLJobDefinition', {
77+
jobDefinitionName: 'RSQLETLJobDefinition',
78+
type: 'container',
79+
containerProperties: {
80+
image: props.ecrRepository.repositoryUri,
81+
executionRoleArn: this.batchExecutionRole.roleArn,
82+
jobRoleArn: this.batchJobRole.roleArn,
83+
fargatePlatformConfiguration: {
84+
platformVersion: 'LATEST',
85+
},
86+
resourceRequirements: [{
87+
type: 'VCPU',
88+
value: '0.25',
89+
}, {
90+
type: 'MEMORY',
91+
value: '512',
92+
}]
93+
},
94+
platformCapabilities: ['FARGATE'],
95+
});
96+
}
97+
}

lib/ecr-stack.ts

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
import { Stack, StackProps } from 'aws-cdk-lib';
2+
import { Construct } from 'constructs';
3+
import * as ecr from 'aws-cdk-lib/aws-ecr';
4+
5+
export class EcrRepositoryStack extends Stack {
6+
readonly repository: ecr.Repository;
7+
8+
constructor(scope: Construct, id: string, props?: StackProps) {
9+
super(scope, id, props);
10+
11+
this.repository = new ecr.Repository(this, 'amazonlinux-rsql', {
12+
imageScanOnPush: true,
13+
encryption: ecr.RepositoryEncryption.KMS,
14+
});
15+
}
16+
}

0 commit comments

Comments
 (0)