Skip to content

Commit b458e51

Browse files
author
Pedro
committed
github impoort
0 parents  commit b458e51

File tree

8 files changed

+305
-0
lines changed

8 files changed

+305
-0
lines changed

LICENSE

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
The MIT License (MIT)
2+
3+
Copyright Pedro (c) 2014
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.
22+

README.md

+55
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# tq
2+
3+
tq is command line utility that performs performs an HTML element selection on HTML content passed to the stdin. Using css selectors that everybody knows.
4+
5+
Since input comes from stdin and output is sent to stout. It can easily be be used inside traditional UNIX pipelines to extract content from webpages and html files.
6+
7+
tq provides extra formating options such as json-encoding or newlines squashing, so it can pley nicely with everyones favourite command line tooling.
8+
9+
10+
## Instalation
11+
12+
sudo pip install https://github.com/plainas/tq/zipball/stable
13+
14+
15+
## Example usage
16+
17+
Get headlines from hacker news
18+
19+
curl https://news.ycombinator.com/news | tq -tj ".title a"
20+
21+
Get the title of an html document stored in a file
22+
23+
cat mydocument.html | tq -t title
24+
25+
Get all the images from a webpage
26+
27+
TODO: add this example
28+
29+
30+
Notice that tq doesn't provide a way to make http requests or read files. You can use your favorite HTTP client, or provide the html source from any source you want.
31+
32+
For a modern, user friendly http client, check httpie. Or you can just use curl, wget, netcat, etc.
33+
34+
## Command options
35+
36+
* `selector`
37+
A css selector
38+
39+
* `-t, --text`
40+
Outputs only the inner text of the selected elements.
41+
42+
* `-q, --squash`
43+
Squash lines.
44+
45+
* `-s, --squash-space`
46+
Squash spaces.
47+
48+
* `-j, --json-lines`
49+
JSON encode each match.
50+
51+
* `-J, --json`
52+
Output as json array of strings.
53+
54+
* `-v, --version`
55+
Prints tq version

bin/tq

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
#!/usr/bin/python3
2+
import tq
3+
4+
if __name__ == '__main__':
5+
tq.main()

doc/compile_manpage.sh

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/bin/sh
2+
3+
# ronn is used to turn the markdown into a manpage.
4+
# Get ronn at https://github.com/rtomayko/ronn
5+
# Alternately, since ronn is a Ruby gem, you can just
6+
# `gem install ronn`
7+
8+
ronn --roff tq.1.md

doc/tq.1

+55
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
.\" generated with Ronn/v0.7.3
2+
.\" http://github.com/rtomayko/ronn/tree/0.7.3
3+
.
4+
.TH "TQ" "1" "September 2015" "" ""
5+
.
6+
.SH "NAME"
7+
\fBtq\fR \- Terminal based HTML query tool
8+
.
9+
.SH "SYNOPSIS"
10+
cat file\.html | \fBtq\fR [\fIoptions\fR] SELECTOR
11+
.
12+
.SH "DESCRIPTION"
13+
Perform a css query with SELECTOR on an html document passed to the standard input\.
14+
.
15+
.SH "OPTIONS"
16+
.
17+
.IP "\(bu" 4
18+
\fBselector\fR A css selector
19+
.
20+
.IP "\(bu" 4
21+
\fB\-t, \-\-text\fR Outputs only the inner text of the selected elements\.
22+
.
23+
.IP "\(bu" 4
24+
\fB\-q, \-\-squash\fR Squash lines\.
25+
.
26+
.IP "\(bu" 4
27+
\fB\-s, \-\-squash\-space\fR Squash spaces\.
28+
.
29+
.IP "\(bu" 4
30+
\fB\-j, \-\-json\-lines\fR JSON encode each match\.
31+
.
32+
.IP "\(bu" 4
33+
\fB\-J, \-\-json\fR Output as json array of strings\.
34+
.
35+
.IP "\(bu" 4
36+
\fB\-v, \-\-version\fR Prints tq version
37+
.
38+
.IP "" 0
39+
.
40+
.SH "EXAMPLES"
41+
.
42+
.SS "Get headlines from hacker news"
43+
curl https://news\.ycombinator\.com/news | tq \-tj "\.title a"
44+
.
45+
.SS "Download a gallery of nice forest pictures from flickr"
46+
curl \-s \'https://www\.flickr\.com/photos/tgerus/galleries/72157622468645106/\' | tq "\.gallery\-photos img"
47+
.
48+
.SH "AUTHORS"
49+
\fBtq\fR was written by Pedro \fIpedro@example\.com\fR\.
50+
.
51+
.SH "DISTRIBUTION"
52+
The latest version of tq may be downloaded from https://github\.com/plainas/tq
53+
.
54+
.SH "SEE ALSO"
55+
curl(1), wget(1), jq(1)

doc/tq.1.md

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
tq(1) -- Terminal based HTML query tool
2+
=============================================
3+
4+
## SYNOPSIS
5+
6+
cat file.html | `tq` [<options>] SELECTOR
7+
8+
## DESCRIPTION
9+
10+
Perform a css query with SELECTOR on an html document passed to the standard input.
11+
12+
## OPTIONS
13+
14+
* `selector`
15+
A css selector
16+
17+
* `-t, --text`
18+
Outputs only the inner text of the selected elements.
19+
20+
* `-q, --squash`
21+
Squash lines.
22+
23+
* `-s, --squash-space`
24+
Squash spaces.
25+
26+
* `-j, --json-lines`
27+
JSON encode each match.
28+
29+
* `-J, --json`
30+
Output as json array of strings.
31+
32+
* `-v, --version`
33+
Prints tq version
34+
35+
36+
## EXAMPLES
37+
38+
39+
### Get headlines from hacker news
40+
41+
curl https://news.ycombinator.com/news | tq -tj ".title a"
42+
43+
### Download a gallery of nice forest pictures from flickr
44+
45+
curl -s 'https://www.flickr.com/photos/tgerus/galleries/72157622468645106/' | tq ".gallery-photos img"
46+
47+
48+
## AUTHORS
49+
50+
`tq` was written by Pedro <[email protected]>.
51+
52+
## DISTRIBUTION
53+
The latest version of tq may be downloaded from https://github.com/plainas/tq
54+
55+
## SEE ALSO
56+
57+
curl(1), wget(1), jq(1)

setup.py

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/usr/bin/python3
2+
3+
from distutils.core import setup
4+
5+
setup(
6+
name='tq',
7+
version='0.1',
8+
description='comand line css selector',
9+
author='Pedro',
10+
author_email='[email protected]',
11+
url='https://github.com/plainas/tq',
12+
packages= ['tq'],
13+
scripts=['bin/tq'],
14+
install_requires=["beautifulsoup4=4.4.0"]
15+
)

tq/__init__.py

+88
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
"""
2+
Test non unicode input with:
3+
curl https://www.flashback.org/| ./tq.py -Jt ".td_forum"
4+
5+
Test unicode input
6+
curl https://news.ycombinator.com/news| ./tq.py -Jt ".title a"
7+
8+
curl https://www.flashback.org/t2494391| ./tq.py -j ".post_message"
9+
10+
"""
11+
12+
#TODO: use add_mutually_exclusive_group()
13+
#TODO: help2man proved ineficient. Copy and import this script and use it. Instructions are in the source, pretty straight forward
14+
15+
#https://github.com/pwman3/pwman3/blob/d718a01fa8038893e42416b59cdfcda3935fe878/build_manpage.py
16+
17+
18+
import sys
19+
from bs4 import BeautifulSoup
20+
import argparse
21+
import json
22+
import codecs
23+
import io
24+
25+
version = "0.0.1"
26+
27+
#parser = argparse.ArgumentParser(description="Performs a css selection on an HTML document.", prog= "TQ", usage='curl url | tq [options]')
28+
parser = argparse.ArgumentParser(description="Performs a css selection on an HTML document.", prog= "tq")
29+
parser.add_argument("selector", help="A css selector")
30+
parser.add_argument("-t", "--text", action="store_true", help="Outputs only the inner text of the selected elements.")
31+
parser.add_argument("-q", "--squash", action="store_true", help="Squash lines.")
32+
parser.add_argument("-s", "--squash-space", action="store_true", help="Squash spaces.")
33+
parser.add_argument("-j", "--json-lines", action="store_true", help="JSON encode each match.")
34+
parser.add_argument("-J", "--json", action="store_true", help="Output as json array of strings.")
35+
parser.add_argument("-v", "--version", action="store_true", help=version)
36+
37+
args = parser.parse_args()
38+
39+
40+
def get_parser(formatter_class=argparse.HelpFormatter):
41+
"""
42+
this is here just to be picked up by build_manpage
43+
"""
44+
return parser
45+
46+
47+
48+
def main():
49+
50+
if args.version:
51+
print(version)
52+
system.exit()
53+
54+
if not args.selector:
55+
system.exit("ERROR! No selector")
56+
57+
if args.json and args.json_lines:
58+
sys.exit("ERROR! --json and --json-lines options cannot be used simultaniously")
59+
60+
61+
def get_els(css_selector):
62+
#input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8', errors='ignore')
63+
input_stream = io.TextIOWrapper(sys.stdin.buffer, errors='ignore')
64+
soup = BeautifulSoup(input_stream, "html.parser")
65+
return soup.select(css_selector)
66+
67+
68+
selected_els = get_els(args.selector)
69+
70+
if args.text:
71+
selected_els = [el.get_text() for el in selected_els]
72+
73+
if args.squash:
74+
selected_els = [el.replace('\n', ' ').el('\r', '') for el in selected_els]
75+
76+
if args.squash_space:
77+
selected_els = [' '.join( el.split(' ') ) for el in selected_els]
78+
79+
if args.json or args.json_lines:
80+
selected_els = [json.dumps(str(el_text)) for el_text in selected_els]
81+
82+
83+
if args.json:
84+
sys.stdout.write(json.dumps(selected_els, indent=1))
85+
sys.stdout.write("\n")
86+
else:
87+
for el_text in selected_els:
88+
sys.stdout.write(str(el_text) + "\n")

0 commit comments

Comments
 (0)