1
1
================================================================================
2
- pyexcel-pdfr - Let you focus on data, instead of file formats
2
+ pyexcel-pdfr - Let you focus on data, instead of pdf format
3
3
================================================================================
4
4
5
5
.. image :: https://raw.githubusercontent.com/pyexcel/pyexcel.github.io/master/images/patreon.png
@@ -17,6 +17,30 @@ pyexcel-pdfr - Let you focus on data, instead of file formats
17
17
.. image :: https://readthedocs.org/projects/pyexcel-pdfr/badge/?version=latest
18
18
:target: http://pyexcel-pdfr.readthedocs.org/en/latest/
19
19
20
+
21
+ Known constraints
22
+ ==================
23
+
24
+ Fonts, colors and charts are not supported.
25
+
26
+ Installation
27
+ ================================================================================
28
+
29
+ You can install it via pip:
30
+
31
+ .. code-block :: bash
32
+
33
+ $ pip install pyexcel-pdfr
34
+
35
+
36
+ or clone it and install it:
37
+
38
+ .. code-block :: bash
39
+
40
+ $ git clone https://github.com/pyexcel/pyexcel-pdfr.git
41
+ $ cd pyexcel-pdfr
42
+ $ python setup.py install
43
+
20
44
Support the project
21
45
================================================================================
22
46
@@ -32,35 +56,183 @@ With your financial support, I will be able to invest
32
56
a little bit more time in coding, documentation and writing interesting posts.
33
57
34
58
35
-
36
- Introduction
59
+ Usage
37
60
================================================================================
38
- **pyexcel-pdfr ** does Read tables in pdf files as tabular data.
39
61
62
+ As a standalone library
63
+ --------------------------------------------------------------------------------
40
64
65
+ .. testcode ::
66
+ :hide:
41
67
42
- Installation
43
- ================================================================================
44
- You can install it via pip:
68
+ >>> import os
69
+ >>> import sys
70
+ >>> if sys.version_info[0 ] < 3 :
71
+ ... from StringIO import StringIO
72
+ ... else :
73
+ ... from io import BytesIO as StringIO
74
+ >>> PY2 = sys.version_info[0 ] == 2
75
+ >>> if PY2 and sys.version_info[1 ] < 7 :
76
+ ... from ordereddict import OrderedDict
77
+ ... else :
78
+ ... from collections import OrderedDict
45
79
46
- .. code-block :: bash
47
80
48
- $ pip install pyexcel-pdfr
81
+ Read from an pdf file
82
+ ********************************************************************************
49
83
84
+ Here's the sample code:
50
85
51
- or clone it and install it:
86
+ .. code-block :: python
52
87
53
- .. code-block :: bash
88
+ >> > from pyexcel_pdf import get_data
89
+ >> > data = get_data(" your_file.pdf" )
90
+ >> > import json
91
+ >> > print (json.dumps(data))
92
+ {" Sheet 1" : [[1 , 2 , 3 ], [4 , 5 , 6 ]], " Sheet 2" : [[" row 1" , " row 2" , " row 3" ]]}
54
93
55
- $ git clone https://github.com/pyexcel/pyexcel-pdfr.git
56
- $ cd pyexcel-pdfr
57
- $ python setup.py install
58
94
59
95
60
96
61
- Development guide
97
+ Read from an pdf from memory
98
+ ********************************************************************************
99
+
100
+ Continue from previous example:
101
+
102
+ .. code-block :: python
103
+
104
+ >> > # This is just an illustration
105
+ >> > # In reality, you might deal with pdf file upload
106
+ >> > # where you will read from requests.FILES['YOUR_PDF_FILE']
107
+ >> > data = get_data(io)
108
+ >> > print (json.dumps(data))
109
+ {" Sheet 1" : [[1 , 2 , 3 ], [4 , 5 , 6 ]], " Sheet 2" : [[7 , 8 , 9 ], [10 , 11 , 12 ]]}
110
+
111
+
112
+ Pagination feature
113
+ ********************************************************************************
114
+
115
+
116
+
117
+ Let's assume the following file is a huge pdf file:
118
+
119
+ .. code-block :: python
120
+
121
+ >> > huge_data = [
122
+ ... [1 , 21 , 31 ],
123
+ ... [2 , 22 , 32 ],
124
+ ... [3 , 23 , 33 ],
125
+ ... [4 , 24 , 34 ],
126
+ ... [5 , 25 , 35 ],
127
+ ... [6 , 26 , 36 ]
128
+ ... ]
129
+ >> > sheetx = {
130
+ ... " huge" : huge_data
131
+ ... }
132
+ >> > save_data(" huge_file.pdf" , sheetx)
133
+
134
+ And let's pretend to read partial data:
135
+
136
+ .. code-block :: python
137
+
138
+ >> > partial_data = get_data(" huge_file.pdf" , start_row = 2 , row_limit = 3 )
139
+ >> > print (json.dumps(partial_data))
140
+ {" huge" : [[3 , 23 , 33 ], [4 , 24 , 34 ], [5 , 25 , 35 ]]}
141
+
142
+ And you could as well do the same for columns:
143
+
144
+ .. code-block :: python
145
+
146
+ >> > partial_data = get_data(" huge_file.pdf" , start_column = 1 , column_limit = 2 )
147
+ >> > print (json.dumps(partial_data))
148
+ {" huge" : [[21 , 31 ], [22 , 32 ], [23 , 33 ], [24 , 34 ], [25 , 35 ], [26 , 36 ]]}
149
+
150
+ Obvious, you could do both at the same time:
151
+
152
+ .. code-block :: python
153
+
154
+ >> > partial_data = get_data(" huge_file.pdf" ,
155
+ ... start_row = 2 , row_limit = 3 ,
156
+ ... start_column = 1 , column_limit = 2 )
157
+ >> > print (json.dumps(partial_data))
158
+ {" huge" : [[23 , 33 ], [24 , 34 ], [25 , 35 ]]}
159
+
160
+ .. testcode ::
161
+ :hide:
162
+
163
+ >>> os.unlink(" huge_file.pdf" )
164
+
165
+
166
+ As a pyexcel plugin
167
+ --------------------------------------------------------------------------------
168
+
169
+ No longer, explicit import is needed since pyexcel version 0.2.2. Instead,
170
+ this library is auto-loaded. So if you want to read data in pdf format,
171
+ installing it is enough.
172
+
173
+
174
+ Reading from an pdf file
175
+ ********************************************************************************
176
+
177
+ Here is the sample code:
178
+
179
+ .. code-block :: python
180
+
181
+ >> > import pyexcel as pe
182
+ >> > sheet = pe.get_book(file_name = " your_file.pdf" )
183
+ >> > sheet
184
+ Sheet 1 :
185
+ + -- -+ -- -+ -- -+
186
+ | 1 | 2 | 3 |
187
+ + -- -+ -- -+ -- -+
188
+ | 4 | 5 | 6 |
189
+ + -- -+ -- -+ -- -+
190
+ Sheet 2 :
191
+ + ------ -+ ------ -+ ------ -+
192
+ | row 1 | row 2 | row 3 |
193
+ + ------ -+ ------ -+ ------ -+
194
+
195
+
196
+
197
+
198
+ Reading from a IO instance
199
+ ********************************************************************************
200
+
201
+ You got to wrap the binary content with stream to get pdf working:
202
+
203
+ .. code-block :: python
204
+
205
+ >> > # This is just an illustration
206
+ >> > # In reality, you might deal with pdf file upload
207
+ >> > # where you will read from requests.FILES['YOUR_PDF_FILE']
208
+ >> > pdffile = " another_file.pdf"
209
+ >> > with open (pdffile, " rb" ) as f:
210
+ ... content = f.read()
211
+ ... r = pe.get_book(file_type = " pdf" , file_content = content)
212
+ ... print (r)
213
+ ...
214
+ Sheet 1 :
215
+ + -- -+ -- -+ -- -+
216
+ | 1 | 2 | 3 |
217
+ + -- -+ -- -+ -- -+
218
+ | 4 | 5 | 6 |
219
+ + -- -+ -- -+ -- -+
220
+ Sheet 2 :
221
+ + ------ -+ ------ -+ ------ -+
222
+ | row 1 | row 2 | row 3 |
223
+ + ------ -+ ------ -+ ------ -+
224
+
225
+
226
+
227
+
228
+ License
62
229
================================================================================
63
230
231
+ New BSD License
232
+
233
+ Developer guide
234
+ ==================
235
+
64
236
Development steps for code changes
65
237
66
238
#. git clone https://github.com/pyexcel/pyexcel-pdfr.git
@@ -132,8 +304,9 @@ Acceptance criteria
132
304
#. Agree on NEW BSD License for your contribution
133
305
134
306
307
+ .. testcode ::
308
+ :hide:
135
309
136
- License
137
- ================================================================================
138
-
139
- New BSD License
310
+ >>> import os
311
+ >>> os.unlink(" your_file.pdf" )
312
+ >>> os.unlink(" another_file.pdf" )
0 commit comments