Barzooka

Barzooka screens publications and detects different graph types. The following graph types can be detected:

bar graphs of continuous data
bar graphs of counts or proportions
bar graphs with dot plots
pie charts
dot plots
box plots
histograms
violin plots
flow charts

Description

Barzooka is based on a deep convolutional network trained using the fastai python package (https://docs.fast.ai/). It screens a publication on the page level and can detect multiple graph types per page.

Authors

Nico Riedel, Vladislav Nachev ([email protected]), Robert Schulz, Vartan Kazezian, Tracey Weissgerber - QUEST Center for Responsible Research, Berlin Institute of Health (BIH) at Charité – Universitätsmedizin Berlin, Berlin, Germany

Installation

It is recommended to set up a separate environment first, e.g. using conda:

conda create -n env_name python=3.10
conda activate env_name

If not pre-installed, install the poppler library that is used to convert the PDF files to images for screening

conda install poppler

Then run the following command from the package folder

pip install .

Usage

Barzooka consist of a model file (.pkl) that stores the trained network parameters and a python class file with a simple interface. As Barzooka is trained using fastai, make sure you have installed fastai.

Create Barzooka object and load model file:

import barzooka
b = barzooka.Barzooka()

Predict from single image:

b.predict_from_img('./barzooka/examples/img/box1.jpg')

Returns a list of all classes detectected in the image.

Predict from image folder:

b.predict_from_img_folder('./barzooka/examples/img/')

Use this function if you have a folder with images of individual pages. This returns the filenames and the detected classes per filename.

Predict from PDF file:

b.predict_from_file('./barzooka/examples/pdf/doc.pdf')

Returns a dict with the number of successfully detected pages for each class.

Predict from PDF folder:

b.predict_from_folder('./barzooka/examples/pdf/', 'results.csv', tmp_folder='./tmp/')

Use this function if you have a folder with PDFs. Each PDF is temporarily converted to page images (into the tpm_folder, which is created if it is not existing) and the prediction results for all pages of the PDF are combined. The results are saved in csv format under save_filename. The tmp images are deleted afterwards. For the PDF conversion the command line tool pdftocairo is used - please make sure that this is installed on your system.

The result files use the following encoding for the different classes:

class label	description
approp	Bar graphs of counts or proportions
bar	Bar graphs of continuous data
bardot	Bar graphs with dot plots
box	Box plot
dot	Dot plot
flowno	Flow chart without numbers
flowyes	Flow chart with numbers
hist	Histogram
other	Other non-text pages that do not fit any of the other graph types
pie	Pie chart
text	Pages with only text
violin	Violin plot

In case of the following error on Windows machines

raise NotImplementedError("cannot instantiate %r on your system")
NotImplementedError: cannot instantiate 'PosixPath' on your system

try to first execute the following lines to redirect the path:

import pathlib
temp = pathlib.PosixPath
pathlib.PosixPath = pathlib.WindowsPath

Performance

The algorithm was trained on a set of 37,784 pages derived from a set of biomedical open access publication from PubMed Central and eLife. The number of examples per class range from 837 (flowno) to 8615 (other). For internal validation, a set of 3812 pages gathered from the same sources that were not used for training were used. Additional, two separate validation datasets with 1107 bioRxiv preprints and 1000 publications of authors affiliated with Charité Universitätsmedizin – Berlin (only for flow & pie charts) were used.

Performance internal validation dataset:

Class	Manually labeled cases	False positives	precision	recall	F1 score
Bar graph of counts or proportions (appropriate)	407	65	0.84	0.86	0.85
Bar graph of continuous data (inappropriate)	671	35	0.95	0.91	0.93
Bar graph with dot plot	149	10	0.93	0.91	0.92
Dot plot	393	33	0.91	0.85	0.88
Box plot	368	29	0.92	0.88	0.90
Violin plot	340	13	0.96	0.96	0.95
Histogram	238	32	0.86	0.86	0.83
Flow chart	276	32	0.89	0.91	0.90
Pie chart	160	5	0.97	0.92	0.94

Performance bioRxiv validation dataset:

Class	Manually labeled cases	False positives	precision	recall	F1 score
Bar graph of counts or proportions (appropriate)	345	60	0.82	0.81	0.82
Bar graph of continuous data (inappropriate)	405	25	0.94	0.92	0.93
Bar graph with dot plot	74	37	0.63	0.86	0.73
Dot plot	257	51	0.80	0.80	0.80
Box plot	255	36	0.87	0.91	0.89
Violin plot	57	27	0.65	0.89	0.76
Histogram	198	66	0.72	0.85	0.78
Flow chart	20	26	0.40	0.85	0.54
Pie chart	71	12	0.83	0.85	0.84

Performance Charité validation dataset:

Class	Manually labeled cases	False positives	precision	recall	F1 score
Flow chart	123	20	0.84	0.87	0.86
Pie chart	38	4	0.89	0.87	0.88

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
barzooka		barzooka
docs		docs
.gitattributes		.gitattributes
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Barzooka

Description

Authors

Installation

Usage

Performance

About

Releases

Packages

Languages

License

teresacoliveira/barzooka

Folders and files

Latest commit

History

Repository files navigation

Barzooka

Description

Authors

Installation

Usage

Performance

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages