This repository contains code for a feature based machine learning tool for authenticating ancient DNA samples. The pipeline uses FALCON scores and quantitative features (relative size, CG content, N content) extracted from the DNA sequences.
- Normalized Relative Compression: Utilizes FALCON to generate similarity scores between the input DNA sequence and a reference database of DNA sequences.
- Quantitative Features: Extracts features like relative size, CG content, and N content from the DNA sequences.
- Machine Learning Models: Trains and evaluates different machine learning models (XGBoost, KNN, Neural Network, SVM, Gaussian Naive Bayes) for ancient DNA authentication.
- Binary Classification: Predicts whether a sample is above or below a certain age threshold.
- Performance Evaluation: Includes metrics like accuracy, precision, recall, F1-score, AUROC, and AUPRC.
- Python 3.x
- scikit-learn
- xgboost
- pandas
- matplotlib
- joblib
- FALCON (needs to be installed separately)
git clone [email protected]:viromelab/amtDNA-Authenticator.git
cd amtDNA-Authenticator
pip install -r requirements.txt
pip install -e .
git clone https://github.com/cobilab/falcon.git
cd falcon/src/
cmake .
make
cp FALCON ../../
cd ../../
cd ./scripts/
chmod +x build_examples.sh
source build_examples.sh
cd ..
It will build the examples for each mode of the program in the output folder examples
. After running, the examples will be ready to run. Therefore, you do not need to prepare data, it is provided in the example in the correct format.
-
Create a
multifasta.fa
file containing ancient DNA sequences for training, validation and testing. Place it in a first directory. This will be yourcontext
directory. -
Create a
multifasta.fa
file containing ancient DNA sequences for authentication. Place it in a second directory. This will be yourauth
directory. -
DNA sequences headers must have the format
>ID_AGE
.
You can jump to any step in the pipeline, since all data necessary to run in any mode is already available in the examples folders (e.g., data generated in step "2. Run the extract-features mode" that is needed to run step "3. Run the train mode" is already available in the context directory of the training example).
This mode will process the multifasta in the context folder, capitalizing the characters to ensure uniformity, removing duplicate samples, and dividing the data into training, validation and testing sets.
cd examples/example_process_multifasta/
authpipe --verbose process-multifasta --context_path context
This command will load the data processed in the processing-multifasta mode and stored in the context folder, processing it further to extract features for the input vector of the machine learning model.
Step 2 can take several hours.
cd examples/example_extract_features/
authpipe --verbose extract-features --context_path ./context
- Use the flag --falcon_verbose [-f] to show FALCON verbose
This command will train XGBoost model instances on the extracted features with thresholds from 100 to 6000 years old and window size of 100 years. At the end of the run, it will plot the results.
cd examples/example_train/
authpipe --verbose train --context_path ./context --lbound 100 --rbound 6000 --window 100 --model XGB -p
- Change
--model
argument to use a different machine learning model. - The
--window
,--rbound
, and--lbound
arguments control the age thresholds used for binary classification. - Use flag
--plot_results
[-p] to plot results from training phase
This command authenticates new amtDNA sequences in multi-FASTA format, classifying them as ANCIENT/MODERN with respect to a 2000 years threshold, based on the model instances trained in the previous step.
cd examples/example_authenticate_multifasta/
authpipe --verbose authenticate --context_path ./context --auth_path ./auth --threshold 2000 --model XGB
- You can choose the model and threshold accordingly to the existent in the
context/models
folder.
This command authenticates a new amtDNA sequence in a single FASTA format, classifying it as ANCIENT/MODERN with respect to a 2000 years threshold, based on the model instances trained in the previous step.
cd examples/example_authenticate_single_fasta/
authpipe --verbose authenticate --context_path ./context --auth_path ./auth --threshold 2000 --model XGB --single_path I2473.fa
- You can choose the model and threshold accordingly to the existent in the
context/models
folder.
- The trained model is saved in the
models
directory inside the context directory passed in command line. - Performance metrics and plots are generated using pyplot and can be saved locally.
- Authentication files are saved in the
PATH-TO-AUTH-DIRECTORY
folder.
GPLv3 License
Copyright (c) [2025] [Denis Yamunaque]
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.