Skip to content

Commit 1d8828c

Browse files
committed
update dependency
1 parent d549fdf commit 1d8828c

File tree

5 files changed

+213
-26
lines changed

5 files changed

+213
-26
lines changed

.moban.d/README.rst

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,17 @@
1-
{% extends "BASIC-README.rst.jj2" %}
2-
3-
{%block constraint%}
4-
{%endblock%}
1+
{% extends "README.rst.jj2" %}
52

63
{%block features %}
7-
**{{name}}** does {{description}}.
4+
{%include "feature.rst"%}
85
{%endblock%}
6+
7+
{% block write_to_file %}
8+
{% endblock %}
9+
10+
{% block write_to_memory %}
11+
{% endblock %}
12+
13+
{% block pyexcel_write_to_file%}
14+
{% endblock %}
15+
16+
{% block pyexcel_write_to_memory%}
17+
{% endblock %}

README.rst

Lines changed: 192 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
================================================================================
2-
pyexcel-pdfr - Let you focus on data, instead of file formats
2+
pyexcel-pdfr - Let you focus on data, instead of pdf format
33
================================================================================
44

55
.. image:: https://raw.githubusercontent.com/pyexcel/pyexcel.github.io/master/images/patreon.png
@@ -17,6 +17,30 @@ pyexcel-pdfr - Let you focus on data, instead of file formats
1717
.. image:: https://readthedocs.org/projects/pyexcel-pdfr/badge/?version=latest
1818
:target: http://pyexcel-pdfr.readthedocs.org/en/latest/
1919

20+
21+
Known constraints
22+
==================
23+
24+
Fonts, colors and charts are not supported.
25+
26+
Installation
27+
================================================================================
28+
29+
You can install it via pip:
30+
31+
.. code-block:: bash
32+
33+
$ pip install pyexcel-pdfr
34+
35+
36+
or clone it and install it:
37+
38+
.. code-block:: bash
39+
40+
$ git clone https://github.com/pyexcel/pyexcel-pdfr.git
41+
$ cd pyexcel-pdfr
42+
$ python setup.py install
43+
2044
Support the project
2145
================================================================================
2246

@@ -32,35 +56,183 @@ With your financial support, I will be able to invest
3256
a little bit more time in coding, documentation and writing interesting posts.
3357

3458

35-
36-
Introduction
59+
Usage
3760
================================================================================
38-
**pyexcel-pdfr** does Read tables in pdf files as tabular data.
3961

62+
As a standalone library
63+
--------------------------------------------------------------------------------
4064

65+
.. testcode::
66+
:hide:
4167

42-
Installation
43-
================================================================================
44-
You can install it via pip:
68+
>>> import os
69+
>>> import sys
70+
>>> if sys.version_info[0] < 3:
71+
... from StringIO import StringIO
72+
... else:
73+
... from io import BytesIO as StringIO
74+
>>> PY2 = sys.version_info[0] == 2
75+
>>> if PY2 and sys.version_info[1] < 7:
76+
... from ordereddict import OrderedDict
77+
... else:
78+
... from collections import OrderedDict
4579

46-
.. code-block:: bash
4780

48-
$ pip install pyexcel-pdfr
81+
Read from an pdf file
82+
********************************************************************************
4983

84+
Here's the sample code:
5085

51-
or clone it and install it:
86+
.. code-block:: python
5287
53-
.. code-block:: bash
88+
>>> from pyexcel_pdf import get_data
89+
>>> data = get_data("your_file.pdf")
90+
>>> import json
91+
>>> print(json.dumps(data))
92+
{"Sheet 1": [[1, 2, 3], [4, 5, 6]], "Sheet 2": [["row 1", "row 2", "row 3"]]}
5493
55-
$ git clone https://github.com/pyexcel/pyexcel-pdfr.git
56-
$ cd pyexcel-pdfr
57-
$ python setup.py install
5894
5995
6096
61-
Development guide
97+
Read from an pdf from memory
98+
********************************************************************************
99+
100+
Continue from previous example:
101+
102+
.. code-block:: python
103+
104+
>>> # This is just an illustration
105+
>>> # In reality, you might deal with pdf file upload
106+
>>> # where you will read from requests.FILES['YOUR_PDF_FILE']
107+
>>> data = get_data(io)
108+
>>> print(json.dumps(data))
109+
{"Sheet 1": [[1, 2, 3], [4, 5, 6]], "Sheet 2": [[7, 8, 9], [10, 11, 12]]}
110+
111+
112+
Pagination feature
113+
********************************************************************************
114+
115+
116+
117+
Let's assume the following file is a huge pdf file:
118+
119+
.. code-block:: python
120+
121+
>>> huge_data = [
122+
... [1, 21, 31],
123+
... [2, 22, 32],
124+
... [3, 23, 33],
125+
... [4, 24, 34],
126+
... [5, 25, 35],
127+
... [6, 26, 36]
128+
... ]
129+
>>> sheetx = {
130+
... "huge": huge_data
131+
... }
132+
>>> save_data("huge_file.pdf", sheetx)
133+
134+
And let's pretend to read partial data:
135+
136+
.. code-block:: python
137+
138+
>>> partial_data = get_data("huge_file.pdf", start_row=2, row_limit=3)
139+
>>> print(json.dumps(partial_data))
140+
{"huge": [[3, 23, 33], [4, 24, 34], [5, 25, 35]]}
141+
142+
And you could as well do the same for columns:
143+
144+
.. code-block:: python
145+
146+
>>> partial_data = get_data("huge_file.pdf", start_column=1, column_limit=2)
147+
>>> print(json.dumps(partial_data))
148+
{"huge": [[21, 31], [22, 32], [23, 33], [24, 34], [25, 35], [26, 36]]}
149+
150+
Obvious, you could do both at the same time:
151+
152+
.. code-block:: python
153+
154+
>>> partial_data = get_data("huge_file.pdf",
155+
... start_row=2, row_limit=3,
156+
... start_column=1, column_limit=2)
157+
>>> print(json.dumps(partial_data))
158+
{"huge": [[23, 33], [24, 34], [25, 35]]}
159+
160+
.. testcode::
161+
:hide:
162+
163+
>>> os.unlink("huge_file.pdf")
164+
165+
166+
As a pyexcel plugin
167+
--------------------------------------------------------------------------------
168+
169+
No longer, explicit import is needed since pyexcel version 0.2.2. Instead,
170+
this library is auto-loaded. So if you want to read data in pdf format,
171+
installing it is enough.
172+
173+
174+
Reading from an pdf file
175+
********************************************************************************
176+
177+
Here is the sample code:
178+
179+
.. code-block:: python
180+
181+
>>> import pyexcel as pe
182+
>>> sheet = pe.get_book(file_name="your_file.pdf")
183+
>>> sheet
184+
Sheet 1:
185+
+---+---+---+
186+
| 1 | 2 | 3 |
187+
+---+---+---+
188+
| 4 | 5 | 6 |
189+
+---+---+---+
190+
Sheet 2:
191+
+-------+-------+-------+
192+
| row 1 | row 2 | row 3 |
193+
+-------+-------+-------+
194+
195+
196+
197+
198+
Reading from a IO instance
199+
********************************************************************************
200+
201+
You got to wrap the binary content with stream to get pdf working:
202+
203+
.. code-block:: python
204+
205+
>>> # This is just an illustration
206+
>>> # In reality, you might deal with pdf file upload
207+
>>> # where you will read from requests.FILES['YOUR_PDF_FILE']
208+
>>> pdffile = "another_file.pdf"
209+
>>> with open(pdffile, "rb") as f:
210+
... content = f.read()
211+
... r = pe.get_book(file_type="pdf", file_content=content)
212+
... print(r)
213+
...
214+
Sheet 1:
215+
+---+---+---+
216+
| 1 | 2 | 3 |
217+
+---+---+---+
218+
| 4 | 5 | 6 |
219+
+---+---+---+
220+
Sheet 2:
221+
+-------+-------+-------+
222+
| row 1 | row 2 | row 3 |
223+
+-------+-------+-------+
224+
225+
226+
227+
228+
License
62229
================================================================================
63230

231+
New BSD License
232+
233+
Developer guide
234+
==================
235+
64236
Development steps for code changes
65237

66238
#. git clone https://github.com/pyexcel/pyexcel-pdfr.git
@@ -132,8 +304,9 @@ Acceptance criteria
132304
#. Agree on NEW BSD License for your contribution
133305

134306

307+
.. testcode::
308+
:hide:
135309

136-
License
137-
================================================================================
138-
139-
New BSD License
310+
>>> import os
311+
>>> os.unlink("your_file.pdf")
312+
>>> os.unlink("another_file.pdf")

pyexcel-pdfr.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,7 @@ nick_name: "pdf"
44
version: "0.0.1"
55
current_version: "0.0.1"
66
release: "0.0.1"
7-
dependencies: []
7+
file_type: "pdf"
8+
dependencies:
9+
- pdftables
810
description: "Read tables in pdf files as tabular data"

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
pdftables

setup.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,13 @@
4040
]
4141

4242
INSTALL_REQUIRES = [
43+
'pdftables',
4344
]
4445

4546

4647
PACKAGES = find_packages(exclude=['ez_setup', 'examples', 'tests'])
47-
EXTRAS_REQUIRE = {}
48+
EXTRAS_REQUIRE = {
49+
}
4850

4951

5052
def read_files(*files):

0 commit comments

Comments
 (0)