This project sets up a development environment for evaluating Apache Iceberg with Trino compared to traditional Parquet for storing and querying MultiQC data directly from your S3 bucket.
The setup includes:
- Direct connection to your AWS S3 bucket (s3://megaqc-test/)
- Trino as the SQL query engine
- Apache Iceberg for table format
- Hive Metastore for schema registry
- PostgreSQL for the metastore backend
- Jupyter Notebook for running queries and evaluations
- Docker and Docker Compose
- Git
- AWS credentials with access to the s3://megaqc-test/ bucket
-
Clone this repository:
git clone <repository-url> cd example-timeline
-
Start the Docker containers:
docker compose up -d
-
Wait for all services to start. You can check the status with:
docker compose ps
- Jupyter Notebook: http://localhost:8888
- Trino UI: http://localhost:8080
- Open Jupyter Notebook at http://localhost:8888
- Navigate to
notebooks/iceberg_evaluation.ipynb
- Run all cells to execute the performance comparison
The notebook compares:
- Storage Time: How long it takes to write data to Parquet vs. Iceberg
- Query Performance:
- Filtering by metric name across all runs
- Filtering by run_id and module_id
docker-compose.yml
- Container configurationtrino/etc/
- Trino configuration filesnotebooks/
- Jupyter notebooks for evaluationexploring/
- Original Parquet test notebooks.env
- Environment variables for AWS credentials
- Adjust the dataset size in the notebook by changing
NUM_RUNS
,NUM_MODULES
, etc. - Modify query patterns to test different access patterns
- Schema evolution capabilities
- Time travel (querying data as of a specific point in time)
- Better handling of small file problems
- Transactional consistency
- Partition evolution
- If Trino fails to start, check the logs with
docker-compose logs trino-coordinator
- If connections to AWS S3 fail, verify your credentials in the
.env
file - If the Hive Metastore isn't accessible, check PostgreSQL is running correctly