Skip to content

Commit 82cb317

Browse files
author
Gal Ben David
committed
Initial Commit
0 parents  commit 82cb317

12 files changed

+2164
-0
lines changed

.clang-format

+63
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
---
2+
BasedOnStyle: LLVM
3+
AccessModifierOffset: '0'
4+
AlignConsecutiveMacros: 'false'
5+
AlignConsecutiveAssignments: 'false'
6+
AlignConsecutiveDeclarations: 'false'
7+
AlignEscapedNewlines: Left
8+
AlignTrailingComments: 'true'
9+
AllowAllArgumentsOnNextLine: 'true'
10+
AllowAllConstructorInitializersOnNextLine: 'false'
11+
AllowAllParametersOfDeclarationOnNextLine: 'true'
12+
AllowShortBlocksOnASingleLine: 'false'
13+
AllowShortCaseLabelsOnASingleLine: 'false'
14+
AllowShortFunctionsOnASingleLine: None
15+
AllowShortIfStatementsOnASingleLine: Never
16+
AllowShortLambdasOnASingleLine: None
17+
AllowShortLoopsOnASingleLine: 'false'
18+
AlwaysBreakAfterReturnType: None
19+
AlwaysBreakBeforeMultilineStrings: 'false'
20+
AlwaysBreakTemplateDeclarations: 'Yes'
21+
BinPackArguments: 'false'
22+
BinPackParameters: 'false'
23+
BreakBeforeBraces: Attach
24+
BreakBeforeTernaryOperators: 'false'
25+
BreakConstructorInitializers: AfterColon
26+
BreakInheritanceList: AfterColon
27+
ColumnLimit: '0'
28+
CompactNamespaces: 'false'
29+
ConstructorInitializerAllOnOneLineOrOnePerLine: 'false'
30+
Cpp11BracedListStyle: 'true'
31+
FixNamespaceComments: 'false'
32+
IncludeBlocks: Regroup
33+
IndentCaseLabels: 'true'
34+
IndentPPDirectives: BeforeHash
35+
IndentWidth: '4'
36+
IndentWrappedFunctionNames: 'false'
37+
JavaScriptQuotes: Double
38+
KeepEmptyLinesAtTheStartOfBlocks: 'false'
39+
Language: Cpp
40+
MaxEmptyLinesToKeep: '1'
41+
NamespaceIndentation: All
42+
PointerAlignment: Middle
43+
SortIncludes: 'true'
44+
SpaceAfterCStyleCast: 'false'
45+
SpaceAfterLogicalNot: 'false'
46+
SpaceAfterTemplateKeyword: 'false'
47+
SpaceBeforeAssignmentOperators: 'true'
48+
SpaceBeforeCpp11BracedList: 'false'
49+
SpaceBeforeCtorInitializerColon: 'false'
50+
SpaceBeforeInheritanceColon: 'false'
51+
SpaceBeforeParens: Never
52+
SpaceBeforeRangeBasedForLoopColon: 'false'
53+
SpaceInEmptyParentheses: 'false'
54+
SpacesInAngles: 'false'
55+
SpacesInCStyleCastParentheses: 'false'
56+
SpacesInContainerLiterals: 'false'
57+
SpacesInParentheses: 'false'
58+
SpacesInSquareBrackets: 'false'
59+
Standard: Cpp11
60+
TabWidth: '4'
61+
UseTab: Never
62+
63+
...

.github/workflows/pythonpackage.yml

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
name: Build
2+
on:
3+
push:
4+
tags:
5+
- '*'
6+
7+
jobs:
8+
build:
9+
10+
runs-on: ubuntu-latest
11+
strategy:
12+
max-parallel: 4
13+
matrix:
14+
python-version: [3.7]
15+
16+
steps:
17+
- uses: actions/checkout@v1
18+
- name: Set up Python ${{ matrix.python-version }}
19+
uses: actions/setup-python@v1
20+
with:
21+
python-version: ${{ matrix.python-version }}
22+
- name: Install Ubuntu packages
23+
run: >-
24+
sudo apt install python3-dev g++-8;
25+
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 800 --slave /usr/bin/g++ g++ /usr/bin/g++-8;
26+
- name: Build a binary wheel and a source tarball
27+
run: >-
28+
python3 -m pip install --user --upgrade setuptools pybind11;
29+
python3 setup.py sdist;
30+
- name: Test module
31+
run: >-
32+
python3 setup.py test
33+
- name: Publish distribution 📦 to PyPI
34+
if: github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags')
35+
uses: pypa/gh-action-pypi-publish@master
36+
with:
37+
password: ${{ secrets.pypi_password }}

.gitignore

+133
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
pip-wheel-metadata/
24+
share/python-wheels/
25+
*.egg-info/
26+
.installed.cfg
27+
*.egg
28+
MANIFEST
29+
30+
# PyInstaller
31+
# Usually these files are written by a python script from a template
32+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
33+
*.manifest
34+
*.spec
35+
36+
# Installer logs
37+
pip-log.txt
38+
pip-delete-this-directory.txt
39+
40+
# Unit test / coverage reports
41+
htmlcov/
42+
.tox/
43+
.nox/
44+
.coverage
45+
.coverage.*
46+
.cache
47+
nosetests.xml
48+
coverage.xml
49+
*.cover
50+
*.py,cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
target/
76+
77+
# Jupyter Notebook
78+
.ipynb_checkpoints
79+
80+
# IPython
81+
profile_default/
82+
ipython_config.py
83+
84+
# pyenv
85+
.python-version
86+
87+
# pipenv
88+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91+
# install all needed dependencies.
92+
#Pipfile.lock
93+
94+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95+
__pypackages__/
96+
97+
# Celery stuff
98+
celerybeat-schedule
99+
celerybeat.pid
100+
101+
# SageMath parsed files
102+
*.sage.py
103+
104+
# Environments
105+
.env
106+
.venv
107+
env/
108+
venv/
109+
ENV/
110+
env.bak/
111+
venv.bak/
112+
113+
# Spyder project settings
114+
.spyderproject
115+
.spyproject
116+
117+
# Rope project settings
118+
.ropeproject
119+
120+
# mkdocs documentation
121+
/site
122+
123+
# mypy
124+
.mypy_cache/
125+
.dmypy.json
126+
dmypy.json
127+
128+
# Pyre type checker
129+
.pyre/
130+
131+
*.cppimporthash
132+
.rendered.*
133+
.vscode

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2019 Gal Ben David
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

MANIFEST.in

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
include README.md
2+
include images/logo.png
3+
graft tests
4+
recursive-include src *

README.md

+126
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
<p align="center">
2+
<a href="https://github.com/wavenator/PySubstringSearch">
3+
<img src="https://raw.githubusercontent.com/wavenator/PySubstringSearch/master/images/logo.png" alt="Logo">
4+
</a>
5+
<h3 align="center">
6+
Python library for fast substring/pattern search written in C++ leveraging Suffix Array Algorithm
7+
</h3>
8+
</p>
9+
10+
![license](https://img.shields.io/badge/MIT-License-blue)
11+
![Python](https://img.shields.io/badge/Python-3.6%20%7C%203.7%20%7C%203.8-blue)
12+
![Build](https://github.com/wavenator/PySubstringSearch/workflows/Build/badge.svg)
13+
[![PyPi](https://img.shields.io/pypi/v/PySubstringSearch.svg)](https://pypi.org/project/PySubstringSearch/)
14+
15+
## Table of Contents
16+
17+
- [Table of Contents](#table-of-contents)
18+
- [About The Project](#about-the-project)
19+
- [Built With](#built-with)
20+
- [Performance](#performance)
21+
- [High number of results](#high-number-of-results)
22+
- [Low number of results](#low-number-of-results)
23+
- [Prerequisites](#prerequisites)
24+
- [Installation](#installation)
25+
- [Usage](#usage)
26+
- [License](#license)
27+
- [Contact](#contact)
28+
29+
30+
## About The Project
31+
32+
PySubstringSearch is a library intended for searching over an index file for substring patterns. The library is written in C++ to achieve speed and efficiency. The library also uses [Msufsort](https://github.com/michaelmaniscalco/msufsort) suffix array construction library for string indexing. The created index consists of the original text and a 32bit suffix array structs. The library relies on a proprietary container protocol to hold the original text along with the index in chunks of 512mb to evade the limitation of the Suffix Array Construction implementation.
33+
34+
35+
### Built With
36+
37+
* [Msufsort](https://github.com/michaelmaniscalco/msufsort)
38+
39+
40+
### Performance
41+
42+
Test was measured on a file containing 500MB of text
43+
44+
#### High number of results
45+
| Library | Function | Time | #Results | Improvement Factor |
46+
| ------------- | ------------- | ------------- | ------------- | ------------- |
47+
| [ripgrepy](https://pypi.org/project/ripgrepy/) | Ripgrepy('text', '500mb').run().as_string | 82.1 ms ± 1.15 ms per loop | 10737 | 1.0x |
48+
| [PySubstringSearch](https://github.com/wavenator/PySubstringSearch) | reader.search('text') | 2.31 ms ± 142 µs per loop | 10737 | 35.5x |
49+
50+
#### Low number of results
51+
| Library | Function | Time | #Results | Improvement Factor |
52+
| ------------- | ------------- | ------------- | ------------- | ------------- |
53+
| [ripgrepy](https://pypi.org/project/ripgrepy/) | Ripgrepy('text', '500mb').run().as_string | 101 ms ± 526 µs per loop | 251 | 1.0x |
54+
| [PySubstringSearch](https://github.com/wavenator/PySubstringSearch) | reader.search('text') | 55.9 µs ± 464 ns per loop | 251 | 1803.0x |
55+
56+
### Prerequisites
57+
58+
In order to compile this package you should have GCC & Python development package installed.
59+
* Fedora
60+
```sh
61+
sudo dnf install python3-devel gcc-c++
62+
```
63+
* Ubuntu 18.04
64+
```sh
65+
sudo apt install python3-dev g++-8
66+
```
67+
68+
### Installation
69+
70+
```sh
71+
pip3 install PySubstringSearch
72+
```
73+
74+
75+
76+
## Usage
77+
78+
Create an index
79+
```python
80+
import pysubstringsearch
81+
82+
# creating a new index file
83+
# if a file with this name is already exists, it will be overwritten
84+
writer = pysubstringsearch.Writer(
85+
index_file_path='output.idx',
86+
)
87+
88+
# adding entries to the new index
89+
writer.add_entry('some short string')
90+
writer.add_entry('another but now a longer string')
91+
writer.add_entry('more text to add')
92+
93+
# making sure the data is dumped to the file
94+
writer.finalize()
95+
```
96+
97+
Search a substring within an index
98+
```python
99+
import pysubstringsearch
100+
101+
# opening an index file for searching
102+
reader = pysubstringsearch.Reader(
103+
index_file_path='output.idx',
104+
)
105+
106+
# lookup for a substring
107+
reader.search('short')
108+
>>> ['some short string']
109+
110+
# lookup for a substring
111+
reader.search('string')
112+
>>> ['some short string', 'another but now a longer string']
113+
```
114+
115+
116+
117+
## License
118+
119+
Distributed under the MIT License. See `LICENSE` for more information.
120+
121+
122+
## Contact
123+
124+
Gal Ben David - [email protected]
125+
126+
Project Link: [https://github.com/wavenator/PySubstringSearch](https://github.com/wavenator/PySubstringSearch)

images/logo.png

59.2 KB
Loading

0 commit comments

Comments
 (0)