Skip to content

Commit 9f1588a

Browse files
LeonLuttenbergerkukushkingjaidisido
authored
docs: Include our ADRs in GitHub (#2215)
Co-authored-by: kukushking <[email protected]> Co-authored-by: jaidisido <[email protected]>
1 parent 7939b0e commit 9f1588a

15 files changed

+2781
-2453
lines changed

Diff for: .adr-dir

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
adr

Diff for: adr/0001-record-architecture-decisions.md

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# 1. Record architecture decisions
2+
3+
Date: 2023-03-08
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
We need to record the architectural decisions made on this project.
12+
13+
## Decision
14+
15+
We will use Architecture Decision Records, as [described by Michael Nygard](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions).
16+
17+
## Consequences
18+
19+
See Michael Nygard's article, linked above. For a lightweight ADR toolset, see Nat Pryce's [adr-tools](https://github.com/npryce/adr-tools).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# 2. Handling unsupported arguments in distributed mode
2+
3+
Date: 2023-03-09
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
Many of the API functions allow the user to pass their own `boto3` session, which will then be used by all the underlying `boto3` calls. With distributed computing, one of the limitations we have is that we cannot pass the `boto3` session to the worker nodes.
12+
13+
Boto3 session are not thread-safe, and therefore cannot be passed to Ray workers. The credentials behind a `boto3` session cannot be sent to Ray workers either, since sending credentials over the network is considered a security risk.
14+
15+
This raises the question of what to do when, in distributed mode, the customer passes arguments that are normally supported, but aren’t supported in distributed mode.
16+
17+
## Decision
18+
19+
When a user passes arguments that are unsupported by distributed mode, the function should fail immediately.
20+
21+
The main alternative to this approach would be if a parameter such as a `boto3` session is passed, we should use it where possible. This could result in a situation where, when reading Parquet files from S3, the process of listing the files uses the `boto3` session whereas the reading of the Parquet files doesn’t. This could result in inconsistent behavior, as part of the function uses the extra parameters while the other part of it doesn’t.
22+
23+
Another alternative would simply be to ignore the unsupported parameters, while potentially outputting a warning. The main issue with this approach is that if a customer tells our API functions to use certain parameters, they expect those parameters to be used. By ignoring them, the the AWS SDK for pandas API would be doing something different from what the customer asked, without properly notifying them, and would thus lose the customer’s trust.
24+
25+
## Consequences
26+
27+
In [PR#2501](https://github.com/aws/aws-sdk-pandas/pull/2051), the `validate_distributed_kwargs` annotation was introduced which can check for the presence of arguments that are unsupported in the distributed mode.
28+
29+
The annotation has also been applied for arguments such as `s3_additional_kwargs` and `version_id` when reading/writing data on S3.
30+
+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# 3. Use TypedDict to group similar parameters
2+
3+
Date: 2023-03-10
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
*AWS SDK for pandas* API methods contain many parameters which are related to a specific behaviour or setting. For example, methods which have an option to update the Glue AWScatalog, such as `to_csv` and `to_parquet`, contain a list of parameters that define the settings for the table in AWS Glue. These settings include the table description, column comments, the table type, etc.
12+
13+
As a consequence, some of our functions have grown to include dozens of parameters. When reading the function signatures, it can be unclear which parameters are related to which functionality. For example, it's not immediately obvious that the parameter `column_comments` in `s3.to_parquet` only writes the column comments into the AWS Glue catalog, and not to S3.
14+
15+
## Decision
16+
17+
Parameters that are related to similar functionality will be replaced by a single parameter of type [TypedDict](https://peps.python.org/pep-0589/). This will allow us to reduce the amount of parameters for our API functions, and also make it clearer that certain parameters are only related to specific functionalities.
18+
19+
For example, parameters related to Athena cache settings will be extracted into a parameter of type `AthenaCacheSettings`, parameters related to Ray settings will be extracted into `RayReadParquetSettings`, etc.
20+
21+
The usage of `TypedDict` allows the user to define the parameters as regular dictionaries with string keys, while empowering type checkers such as `mypy`. Alternately, implementations such as `AthenaCacheSettings` can be instantiated as classes.
22+
23+
### Alternatives
24+
25+
The main alternative that was considered was the idea of using `dataclass` instead of `TypedDict`. The advantage of this alternative would be that default values for parameters could be defined directly in the class signature, rather than needing to be defined in the function which uses the parameter.
26+
27+
On the other hand, the main issue with using `dataclass` is that it would require the customer figure out which class needs to be imported. With `TypedDict`, this is just one of the options; the parameters can simply be passed as a typical Python dictionary.
28+
29+
This alternative was discussed in more detail as part of [PR#1855](https://github.com/aws/aws-sdk-pandas/pull/1855#issuecomment-1353618099).
30+
31+
## Consequences
32+
33+
Subclasses of `TypedDict` such as `GlueCatalogParameters`, `AthenaCacheSettings`, `AthenaUNLOADSettings`, `AthenaCTASSettings` and `RaySettings` have been created. They are defined in the `wrangler.typing` module.
34+
35+
These parameters grouping can used in either of the following two ways:
36+
```python
37+
wr.athena.read_sql_query(
38+
"SELECT * FROM ...",,
39+
ctas_approach=True,
40+
athena_cache_settings={"max_cache_seconds": 900},
41+
)
42+
43+
wr.athena.read_sql_query(
44+
"SELECT * FROM ...",,
45+
ctas_approach=True,
46+
athena_cache_settings=wr.typing.AthenaCacheSettings(
47+
max_cache_seconds=900,
48+
),
49+
)
50+
```
51+
52+
Many of our functions signatures have been changes to take advantage of this refactor. Many of these are breaking changes which will be released as part of the next major version: `3.0.0`.

Diff for: adr/0004-no-alter-iam-permissions.md

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# 4. AWS SDK for pandas does not alter IAM permissions
2+
3+
Date: 2023-03-15
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
AWS SDK for pandas requires permissions to execute AWS API calls. Permissions are granted using AWS Identity and
12+
Access Management Policies that are attached to IAM entities - users or roles.
13+
14+
## Decision
15+
16+
AWS SDK for pandas does not alter (create, update, delete) IAM permissions policies attached to the IAM entities.
17+
18+
## Consequences
19+
20+
It is users responsibility to ensure IAM entities they are using to execute the calls have the required permissions.

Diff for: adr/0005-move-dependencies-to-optional.md

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# 5. Move dependencies to optional
2+
3+
Date: 2023-03-15
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
AWS SDK for pandas relies on external dependencies in some of its modules. These include `redshift-connector`, `gremlinpython` and `pymysql` to cite a few.
12+
13+
In versions 2.x and below, most of these packages were set as required, meaning they were installed regardless of whether the user actually needed them. This has introduced two major risks and issues as the number of dependencies increased:
14+
1. **Security risk**: Unused dependencies increase the attack surface to manage. Users must scan them and ensure that they are kept up to date even though they don't need them
15+
2. **Dependency hell**: Users must resolve dependencies for packages that they are not using. It can lead to dependency hell and prevent critical updates related to security patches and major bugs
16+
17+
## Decision
18+
19+
A breaking change is introduced in version 3.x where the number of required dependencies is reduced to the most important ones, namely:
20+
* boto3
21+
* pandas
22+
* numpy
23+
* pyarrow
24+
* typing-extensions
25+
26+
## Consequences
27+
28+
All other dependencies are moved to optional and must be installed by the user separately using pip install `awswrangler[dependency]`. For instance, the command to use the redshift APIs is `pip install awswrangler[redshift]`. Failing to do so raises an exception informing the user that the package is missing and how to install it

Diff for: adr/0006-deprecate-s3-merge-upsert-table.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# 6. Deprecate wr.s3.merge_upsert_table
2+
3+
Date: 2023-03-15
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
AWS SDK for pandas `wr.s3.merge_upsert_table` is used to perform upsert (update else insert) onto an existing AWS Glue
12+
Data Catalog table. It is a much simplified version of upsert functionality that is supported natively by Apache Hudi
13+
and Athena Iceberg tables, and does not, for example, handle partitioned datasets.
14+
15+
## Decision
16+
17+
To avoid poor user experience `wr.s3.merge_upsert_table` is deprecated and will be removed in 3.0 release.
18+
19+
## Consequences
20+
21+
In [PR#2076](https://github.com/aws/aws-sdk-pandas/pull/2076), `wr.s3.merge_upsert_table` function was removed.

Diff for: adr/0007-design-of-engine-and-memory-format.md

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# 7. Design of engine and memory format
2+
3+
Date: 2023-03-16
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
Ray and Modin are the two frameworks used to support running `awswrangler` APIs at scale. Adding them to the codebase requires significant refactoring work. The original approach considered was to handle both distributed and non-distributed code within the same modules. This quickly turned out to be undesirable as it affected the readability, maintainability and scalability of the codebase.
12+
13+
## Decision
14+
15+
Version 3.x of the library introduces two new constructs, `engine` and `memory_format`, which are designed to address the aforementioned shortcomings of the original approach, but also provide additional functionality.
16+
17+
Currently `engine` takes one of two values: `python` (default) or `ray`, but additional engines could be onboarded in the future. The value is determined at import based on installed dependencies. The user can override this value with `wr.engine.set("engine_name")`. Likewise, `memory_format` can be set to `pandas` (default) or `modin` and overridden with `wr.memory_format.set("memory_format_name")`.
18+
19+
A custom dispatcher is used to register functions based on the execution and memory format values. For instance, if the `ray` engine is detected at import, then methods distributed with Ray are used instead of the default AWS SDK for pandas code.
20+
21+
## Consequences
22+
23+
__The good__:
24+
25+
*Clear separation of concerns*: Distributed methods live outside non-distributed code, eliminating ugly if conditionals, allowing both to scale independently and making them easier to maintain in the future
26+
27+
*Better dispatching*: Adding a new engine/memory format is as simple as creating a new directory with its methods and registering them with the custom dispatcher based on the value of the engine or memory format
28+
29+
*Custom engine/memory format classes*: Give more flexibility than config when it comes to interacting with the engine and managing its state (initialising, registering, get/setting...)
30+
31+
__The bad__:
32+
33+
*Managing state*: Adding a custom dispatcher means that we must maintain its state. For instance, unregistering methods when a user sets a different engine (e.g. moving from ray to dask at execution time) is currently unsupported
34+
35+
*Detecting the engine*: Conditionals are simpler/easier when it comes to detecting an engine. With a custom dispatcher, the registration and dispatching process is more opaque/convoluted. For example, there is a higher risk of not realising that we are using a given engine vs another
36+
37+
__The ugly__:
38+
39+
*Unused arguments*: Each method registered with the dispatcher must accept the union of both non-distributed and distributed arguments, even though some would be unused. As the list of supported engines grows, so does the number of unused arguments. It also means that we must maintain the same list of arguments across the different versions of the method

Diff for: docs/source/_ext/copy_adr.py

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import shutil
2+
from pathlib import Path
3+
4+
5+
def setup(app):
6+
file_dir = Path(__file__).parent
7+
8+
source_dir = file_dir.joinpath("../../../adr").resolve()
9+
destination_dir = file_dir.joinpath("../adr/").resolve()
10+
11+
for file in source_dir.glob("*.md"):
12+
shutil.copy(file, destination_dir)

Diff for: docs/source/adr.rst

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Architectural Decision Records
2+
==============================
3+
4+
A collection of records for "architecturally significant" decisions:
5+
those that affect the structure, non-functional characteristics, dependencies, interfaces, or construction techniques.
6+
7+
.. note:: You can also find all ADRs on `GitHub <https://github.com/aws/aws-sdk-pandas/tree/main/docs/adr>`_.
8+
9+
.. toctree::
10+
:maxdepth: 1
11+
:glob:
12+
13+
adr/*

Diff for: docs/source/adr/.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*
2+
!.gitignore

Diff for: docs/source/conf.py

+2
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,9 @@
4444
"sphinx.ext.napoleon",
4545
"nbsphinx",
4646
"nbsphinx_link",
47+
"myst_parser",
4748
"copy_tutorials",
49+
"copy_adr",
4850
"IPython.sphinxext.ipython_console_highlighting",
4951
]
5052

Diff for: docs/source/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ Read The Docs
6868
install
6969
scale
7070
tutorials
71+
adr
7172
api
7273
Community Resources <https://github.com/aws/aws-sdk-pandas#community-resources>
7374
Logging <https://github.com/aws/aws-sdk-pandas#logging>

0 commit comments

Comments
 (0)