Apache Iceberg + Trino Performance Evaluation

This project sets up a development environment for evaluating Apache Iceberg with Trino compared to traditional Parquet for storing and querying MultiQC data directly from your S3 bucket.

System Architecture

The setup includes:

Direct connection to your AWS S3 bucket (s3://megaqc-test/)
Trino as the SQL query engine
Apache Iceberg for table format
Hive Metastore for schema registry
PostgreSQL for the metastore backend
Jupyter Notebook for running queries and evaluations

Getting Started

Prerequisites

Docker and Docker Compose
Git
AWS credentials with access to the s3://megaqc-test/ bucket

Installation

Clone this repository:

git clone <repository-url>
cd example-timeline

Start the Docker containers:
```
docker compose up -d
```
Wait for all services to start. You can check the status with:
```
docker compose ps
```

Accessing Services

Jupyter Notebook: http://localhost:8888
Trino UI: http://localhost:8080

Running the Evaluation

Open Jupyter Notebook at http://localhost:8888
Navigate to notebooks/iceberg_evaluation.ipynb
Run all cells to execute the performance comparison

Performance Evaluation

The notebook compares:

Storage Time: How long it takes to write data to Parquet vs. Iceberg
Query Performance:
- Filtering by metric name across all runs
- Filtering by run_id and module_id

Project Structure

docker-compose.yml - Container configuration
trino/etc/ - Trino configuration files
notebooks/ - Jupyter notebooks for evaluation
exploring/ - Original Parquet test notebooks
.env - Environment variables for AWS credentials

Customization

Adjust the dataset size in the notebook by changing NUM_RUNS, NUM_MODULES, etc.
Modify query patterns to test different access patterns

Potential Benefits of Iceberg

Schema evolution capabilities
Time travel (querying data as of a specific point in time)
Better handling of small file problems
Transactional consistency
Partition evolution

Troubleshooting

If Trino fails to start, check the logs with docker-compose logs trino-coordinator
If connections to AWS S3 fail, verify your credentials in the .env file
If the Hive Metastore isn't accessible, check PostgreSQL is running correctly

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
README.md		README.md
timeline.ipynb		timeline.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache Iceberg + Trino Performance Evaluation

System Architecture

Getting Started

Prerequisites

Installation

Accessing Services

Running the Evaluation

Performance Evaluation

Project Structure

Customization

Potential Benefits of Iceberg

Troubleshooting

Additional Resources

About

Uh oh!

Releases

Packages

Languages

MultiQC/example-timeline

Folders and files

Latest commit

History

Repository files navigation

Apache Iceberg + Trino Performance Evaluation

System Architecture

Getting Started

Prerequisites

Installation

Accessing Services

Running the Evaluation

Performance Evaluation

Project Structure

Customization

Potential Benefits of Iceberg

Troubleshooting

Additional Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages