This multilanguage project leverages Python, Bash, and JavaScript to build a lightweight Extract, Transform, and Load (ETL) pipeline that automates the process of:
- Extracting data from Excel sheets (with special handling for checked rows)
- Generating hyperlinks for PDF files in the Excel workbook
- Downloading PDF files via browser automation using Puppeteer
- Uploading and linking PDFs stored in SharePoint
The pipeline has been designed for efficient processing of multiple sheets and files while preserving advanced Excel formatting and ensuring secure, authenticated access via SharePoint.
- Project Structure
- Prerequisites
- Installation
- Configuration
- Usage
- Security Considerations
- Troubleshooting
- Future Improvements
- License
excel_sharepoint_etl/
├── .env # Environment configuration (e.g., SharePoint URLs, local paths)
├── .gitignore # Files and folders to ignore in version control
├── config.json # Confidential configuration file (ignored by Git)
├── cookies.json # Cookies required for authentication
├── add_trusted_location.reg # Optional: Windows Registry file to trust the project folder
├── scripts/ # Bash and orchestration scripts
│ ├── run_all.sh # Main pipeline script that executes the entire process
│ ├── check_env.sh # (Optional) Script to verify environment setup
│ └── bootstrap.sh # Sets up the virtual environment and installs dependencies
├── python/ # Python scripts for data processing and hyperlink generation
│ ├── add_hyperlinks.py # Creates hyperlink sheets for individual Excel sheets
│ ├── load_config.py # Alternative or updated version for hyperlink generation across all sheets
│ └── extract_checked.py # Extracts checked rows from Excel and generates CSV files
├── js/ # JavaScript code for browser automation (Puppeteer)
│ └── download_pdfs.js # Downloads PDFs from URLs listed in CSV files
├── data/ # Data generated and used during the pipeline
│ ├── csv/ # CSV files (e.g., to_download_*.csv) generated from Excel data
│ ├── pdfs/ # Downloaded PDF files stored in respective subfolders
│ └── failed_downloads.csv # Log file for any PDF download errors
├── logs/ # Detailed logs of pipeline executions
└── requirements.txt # Python dependency list
- Python 3.8+: Required for running the Python scripts.
- Node.js (v12+): For running the Puppeteer-based PDF download script.
- SharePoint Account: Ensure your account has the necessary permissions to access the configured document libraries and folders.
- OneDrive/SharePoint Sync Client (optional): For ensuring local files are synced with SharePoint.
- Git: To clone the repository (optional).
- jq: A lightweight command-line JSON processor (used by run_all.sh to read config.json).
-
Clone the Repository:
git clone https://your-repo-url.git cd download_link_pdf
-
Set Up the Python Environment:
Run the bootstrap script to set up a virtual environment and install dependencies:
./scripts/bootstrap.sh
-
Install Node Dependencies:
Navigate to the
js/
directory and install Puppeteer (if not already installed):cd js npm install cd ..
-
Environment Variables (.env file):
Create a
.env
file in the project root (or update the existing one) with your configuration. Key variables include:# Excel file to update EXCEL_PATH=/path/to/your/excel_file.xlsx # Local directories for PDFs (each corresponding to a specific sheet) PDF_FOLDER_1=/path/to/local/pdf_folder_1 PDF_FOLDER_2=/path/to/local/pdf_folder_2 PDF_FOLDER_3=/path/to/local/pdf_folder_3 PDF_FOLDER_4=/path/to/local/pdf_folder_4 PDF_FOLDER_5=/path/to/local/pdf_folder_5 # SharePoint configuration for PDF access SHAREPOINT_BASE_URL=https://sharepointbaseurl.sharepoint.com SHAREPOINT_LIBRARY_URL=https://sharepointbaseurl.sharepoint.com/link/to/SharePoint/Library SHAREPOINT_PDF_FOLDER=/link/to/SharePoint/PDF/Folder # (Optional) Other configurations such as cookies file for Puppeteer COOKIES_FILE=cookies.json
Note:
- The
SHAREPOINT_LIBRARY_URL
should be in plain text with spaces URL‑encoded (i.e.,%20
for spaces). SHAREPOINT_PDF_FOLDER
is the common relative path where all PDF subfolders reside. The individual subfolder names are defined in the confidential configuration.
- The
-
Confidential Configuration (config.json):
To protect sensitive information, move confidential mappings out of your source code. Create a file named config.json in the project root with contents similar to:
{ "SHEET_CONFIGS": { "Sheet 1 - Rows": { "local_pdf_dir": "/path/to/local/pdf_folder_1", "sp_subfolder": "Sheet 1 Folder" }, "Sheet 2 - Rows": { "local_pdf_dir": "/path/to/local/pdf_folder_2", "sp_subfolder": "Sheet 2 Folder" }, "Sheet 3 - Rows": { "local_pdf_dir": "/path/to/local/pdf_folder_3", "sp_subfolder": "Sheet 3 Folder" }, "Sheet 4 - Rows": { "local_pdf_dir": "/path/to/local/pdf_folder_4", "sp_subfolder": "Sheet 4 Folder" }, "Sheet 5 - Rows": { "local_pdf_dir": "/path/to/local/pdf_folder_5", "sp_subfolder": "Sheet 5 Folder" } } }
Important:
- Add config.json to your
.gitignore
so that confidential data isn’t pushed to GitHub. - Optionally, include a config.example.json in the repository for reference with dummy values.
- Add config.json to your
-
SharePoint Folder Structure:
Ensure that the PDFs are organized into subfolders under the common PDF folder specified in
SHAREPOINT_PDF_FOLDER
. Your sheets will map to these subfolders.
The main pipeline is orchestrated by the run_all.sh
script in the scripts/
folder. It performs the following steps for each configured sheet:
-
Extract Data:
Runs the Python scriptextract_checked.py
to extract rows with a "checked" status into a CSV file. -
Download PDFs:
Uses the Node.js scriptdownload_pdfs.js
to download PDFs from the URLs listed in the CSV file. -
Generate Hyperlink Sheets:
Calls the Python scriptadd_hyperlinks.py
to create new sheets in the Excel workbook that contain hyperlinks pointing to the PDFs stored on SharePoint.
To run the full pipeline:
./scripts/run_all.sh
-
Extract Checked Rows:
python3 python/extract_checked.py "Sheet Name"
-
Download PDFs:
node js/download_pdfs.js "/path/to/output/folder" "/path/to/csv_file.csv"
-
Add Hyperlinks to Excel:
python3 python/add_hyperlinks.py
A Tkinter-based GUI is available for users who prefer a clickable interface. The GUI application is located in the gui/
folder.
To launch the GUI:
-
Ensure you’ve set up the project as described above.
-
Navigate to the gui/ directory and run the application:
cd gui python3 app.py
The GUI provides a button to run the pipeline and a log display area to show the output in real-time.
-
Access Control:
Ensure that your SharePoint document libraries and folders have restricted access to only your team. Regularly audit permissions. -
Confidential Data:
Sensitive configuration data (such as sheet mappings) is stored in a separateconfig.json
file that is excluded from version control. -
Environment Variables:
Keep your.env
file out of version control and consider using a secrets manager if needed. -
HTTPS:
All communications use HTTPS to protect data in transit.
-
Hyperlink Issues in Excel:
If hyperlinks do not open the correct PDF, verify that your SharePoint environment variables (SHAREPOINT_BASE_URL
,SHAREPOINT_LIBRARY_URL
, andSHAREPOINT_PDF_FOLDER
) are set correctly and that the folder structure in SharePoint matches your configuration. -
Expired Cookies / CSRF Errors:
Ensure your SharePoint session is active or update your cookies file (COOKIES_FILE
) to include valid session cookies if encountering authentication issues. -
File Not Found:
If a hyperlink cell does not generate a URL, check that the local PDF exists in the designated folder and that the naming convention in Excel matches the file name exactly. -
Config Issues:
If the sheet names do not match or the configuration isn’t read properly, verify that yourconfig.json
is correctly formatted and that you’re usingjq
to parse it.
-
Enhanced Error Reporting:
Improve logging and error reporting to provide more detailed diagnostics for failed downloads or file access issues. -
User Interface Enhancements:
Expand the GUI to include additional controls, progress indicators, and error displays for a richer user experience. -
Automated Sync Checks:
Add a mechanism to verify that PDFs have successfully synced to SharePoint before generating hyperlinks.
This project is licensed under the MIT License. See the LICENSE file for details.
Developed by Dylan Picart at Partnership With Children.
For questions or contributions, contact [email protected].