Skip to content

Commit 3ffb4f6

Browse files
committed
csv and excel
0 parents  commit 3ffb4f6

11 files changed

+933
-0
lines changed

.gitignore

+176
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
*.tmp*
10+
11+
# Distribution / packaging
12+
.Python
13+
build/
14+
develop-eggs/
15+
dist/
16+
downloads/
17+
eggs/
18+
.eggs/
19+
lib/
20+
lib64/
21+
parts/
22+
sdist/
23+
var/
24+
wheels/
25+
share/python-wheels/
26+
*.egg-info/
27+
.installed.cfg
28+
*.egg
29+
MANIFEST
30+
31+
# PyInstaller
32+
# Usually these files are written by a python script from a template
33+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
34+
*.manifest
35+
*.spec
36+
37+
# Installer logs
38+
pip-log.txt
39+
pip-delete-this-directory.txt
40+
41+
# Unit test / coverage reports
42+
htmlcov/
43+
.tox/
44+
.nox/
45+
.coverage
46+
.coverage.*
47+
.cache
48+
nosetests.xml
49+
coverage.xml
50+
*.cover
51+
*.py,cover
52+
.hypothesis/
53+
.pytest_cache/
54+
cover/
55+
56+
# Translations
57+
*.mo
58+
*.pot
59+
60+
# Django stuff:
61+
*.log
62+
local_settings.py
63+
db.sqlite3
64+
db.sqlite3-journal
65+
66+
# Flask stuff:
67+
instance/
68+
.webassets-cache
69+
70+
# Scrapy stuff:
71+
.scrapy
72+
73+
# Sphinx documentation
74+
docs/_build/
75+
76+
# PyBuilder
77+
.pybuilder/
78+
target/
79+
80+
# Jupyter Notebook
81+
.ipynb_checkpoints
82+
83+
# IPython
84+
profile_default/
85+
ipython_config.py
86+
87+
# pyenv
88+
# For a library or package, you might want to ignore these files since the code is
89+
# intended to run in multiple environments; otherwise, check them in:
90+
# .python-version
91+
92+
# pipenv
93+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
94+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
95+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
96+
# install all needed dependencies.
97+
#Pipfile.lock
98+
99+
# UV
100+
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
101+
# This is especially recommended for binary packages to ensure reproducibility, and is more
102+
# commonly ignored for libraries.
103+
#uv.lock
104+
105+
# poetry
106+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
107+
# This is especially recommended for binary packages to ensure reproducibility, and is more
108+
# commonly ignored for libraries.
109+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
110+
#poetry.lock
111+
112+
# pdm
113+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
114+
#pdm.lock
115+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
116+
# in version control.
117+
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
118+
.pdm.toml
119+
.pdm-python
120+
.pdm-build/
121+
122+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
123+
__pypackages__/
124+
125+
# Celery stuff
126+
celerybeat-schedule
127+
celerybeat.pid
128+
129+
# SageMath parsed files
130+
*.sage.py
131+
132+
# Environments
133+
.env
134+
.venv
135+
env/
136+
venv/
137+
ENV/
138+
env.bak/
139+
venv.bak/
140+
141+
# Spyder project settings
142+
.spyderproject
143+
.spyproject
144+
145+
# Rope project settings
146+
.ropeproject
147+
148+
# mkdocs documentation
149+
/site
150+
151+
# mypy
152+
.mypy_cache/
153+
.dmypy.json
154+
dmypy.json
155+
156+
# Pyre type checker
157+
.pyre/
158+
159+
# pytype static type analyzer
160+
.pytype/
161+
162+
# Cython debug symbols
163+
cython_debug/
164+
165+
# PyCharm
166+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
167+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
168+
# and can be added to the global gitignore or merged into this file. For a more nuclear
169+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
170+
#.idea/
171+
172+
# Ruff stuff:
173+
.ruff_cache/
174+
175+
# PyPI configuration file
176+
.pypirc

README.md

+111
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# CSV Analyzer with Grouping
2+
3+
A Python utility for loading, analyzing, and grouping CSV files based on common columns. This tool is particularly useful when you need to process multiple CSV files and group their data based on a specific column while maintaining the original column structure.
4+
5+
## Features
6+
7+
- Load CSV files from a directory or specific file paths
8+
- Group data by any common column across CSV files
9+
- Preserve original column names without modification
10+
- Track source files in the output
11+
- Handle both matched and unmatched files separately
12+
- Export results to organized CSV files
13+
14+
## Requirements
15+
16+
- Python 3.6+
17+
- pandas
18+
- pathlib
19+
20+
## Installation
21+
22+
1. Clone this repository or copy the `csv_analyzer.py` file to your project
23+
2. Install required dependencies:
24+
25+
```bash
26+
pip install pandas
27+
```
28+
29+
## Usage
30+
31+
### Basic Usage
32+
33+
```python
34+
from csv_analyzer import CSVAnalyzerGrouping
35+
36+
# Initialize the analyzer
37+
analyzer = CSVAnalyzerGrouping()
38+
39+
# Load CSV files from a directory
40+
analyzer.load_from_directory("path/to/your/csvs")
41+
42+
# Or load specific CSV files
43+
analyzer.load_from_files(["file1.csv", "file2.csv"])
44+
45+
# Group data by a specific column
46+
result = analyzer.grouped_data_by_column("category")
47+
48+
# Export the results
49+
output_dir = ".tmp"
50+
analyzer.export_matched_data(output_dir, result, "grouped_by_category")
51+
analyzer.export_unmatched_data(output_dir, result)
52+
```
53+
54+
### Output Structure
55+
56+
The tool will create:
57+
58+
- A combined CSV file containing all grouped data with original columns plus a source_file column
59+
- The source_file column will always be the last column in the output
60+
- Original column names are preserved without any aggregation suffixes
61+
62+
### Example Output Format
63+
64+
For input CSV files containing columns: `name,category,link,tag,label,id,x_path`
65+
66+
The output will maintain the same structure with source_file added as the last column:
67+
68+
```
69+
name,category,link,tag,label,id,x_path,source_file
70+
```
71+
72+
## Methods
73+
74+
### `load_from_directory(path: str)`
75+
76+
Loads all CSV files from the specified directory.
77+
78+
### `load_from_files(files: List[str])`
79+
80+
Loads specific CSV files from the provided file paths.
81+
82+
### `grouped_data_by_column(column_name: str)`
83+
84+
Groups data by the specified column for files that contain it.
85+
86+
### `export_matched_data(output_dir: str, dataset: Dict, output_prefix: str)`
87+
88+
Exports matched (grouped) data to a single combined CSV file.
89+
90+
### `export_unmatched_data(output_dir: str, dataset: Dict, output_prefix: str)`
91+
92+
Exports unmatched data to separate CSV files.
93+
94+
## Error Handling
95+
96+
The tool includes comprehensive error handling for:
97+
98+
- Invalid directory paths
99+
- File reading errors
100+
- Grouping operation failures
101+
- Export errors
102+
103+
Each operation provides clear feedback through console messages.
104+
105+
## Contributing
106+
107+
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
108+
109+
## License
110+
111+
This project is licensed under the MIT License - see the LICENSE file for details.

__tests__/testdata/data01.csv

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
name,category,link,tag,label,id
2+
data01-name-1,data01-category-1,data01-l-1,data01-t-1,data01-lab-1,data01-id-1
3+
data01-name-1,data01-category-1,data01-l-1,data01-t-1,data01-lab-1,data01-id-1
4+
data01-name-2,data01-category-2,data01-l-2,data01-t-2,data01-lab-2,data01-id-2

__tests__/testdata/data02.csv

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
name,category,link,tag,label,id,x_path
2+
data02-name-1,data02-category-1,data02-l-1,data02-t-1,data02-lab-1,data02-id-1,x
3+
data02-name-2,data02-category-2,data02-l-2,data02-t-2,data02-lab-2,data02-id-2,y

__tests__/testdata/data03.csv

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
name,category,link,tag,label,id
2+
data03-name-1,data03-category-1,data03-l-1,data03-t-1,data03-lab-1,data03-id-1
3+
data03-name-2,data03-category-2,data03-l-2,data03-t-2,data03-lab-2,data03-id-2

__tests__/testdata/track01.csv

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
name,description,link,tag,label,id
2+
track01-name-1,track01-d-1,track01-l-1,track01-t-1,track01-lab-1,track01-id-1
3+
track01-name-2,track01-d-2,track01-l-2,track01-t-2,track01-lab-2,track01-id-2
4+
track01-name-3,track01-d-3,track01-l-3,track01-t-3,track01-lab-3,track01-id-3
5+
track01-name-3,track01-d-4,track01-l-4,track01-t-4,track01-lab-4,track01-id-4

__tests__/testdata/track02.csv

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
name,description,link,tag,label,id
2+
track02-name-1,track02-d-1,track02-l-1,track02-t-1,track02-lab-1,track02-id-1
3+
track02-name-2,track02-d-2,track02-l-2,track02-t-2,track02-lab-2,track02-id-2

0 commit comments

Comments
 (0)