Skip to content

Commit f04dba0

Browse files
committed
Edit README and add examples
1 parent 000dade commit f04dba0

File tree

8 files changed

+239
-19
lines changed

8 files changed

+239
-19
lines changed

README.md

Lines changed: 200 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,200 @@
1-
# metadata-scraper
1+
<div align="center">
2+
3+
# metadata-scraper
4+
5+
[![GitHub](https://img.shields.io/github/license/mashape/apistatus.svg)](https://github.com/BetaHuhn/metadata-scraper/blob/master/LICENSE) ![David](https://img.shields.io/david/betahuhn/metadata-scraper) [![npm](https://img.shields.io/npm/v/metadata-scraper)](https://www.npmjs.com/package/metadata-scraper)
6+
7+
A Javascript library for scraping/parsing metadata from a web page.
8+
9+
</div>
10+
11+
## 👋 Introduction
12+
13+
[metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) is a Javascript library which scrapes/parses metadata from web pages. You only need to supply it with a URL or an HTML string and it will use different rules to find the most relevant metadata like:
14+
15+
- Title
16+
- Description
17+
- Favicons/Images
18+
- Language
19+
- Keywords
20+
- Author
21+
- and more (full list [below](#))
22+
23+
24+
## 🚀 Get started
25+
26+
Install [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) via npm:
27+
```shell
28+
npm install metadata-scraper
29+
```
30+
31+
## 📚 Usage
32+
33+
Import `metadata-scraper` and pass it an URL or options object:
34+
35+
```js
36+
const getMetaData = require('metadata-scraper')
37+
38+
const url = 'https://github.com/BetaHuhn/metadata-scraper'
39+
40+
getMetaData(url).then((data) => {
41+
console.log(data)
42+
})
43+
```
44+
45+
Or with `async`/`await`:
46+
47+
```js
48+
const getMetaData = require('metadata-scraper')
49+
50+
async function run() {
51+
const url = 'https://github.com/BetaHuhn/metadata-scraper'
52+
const data = await getMetaData(url)
53+
console.log(data)
54+
}
55+
56+
run()
57+
```
58+
59+
This will return:
60+
61+
```js
62+
{
63+
title: 'BetaHuhn/metadata-scraper',
64+
description: 'A Javascript library for scraping/parsing metadata from a web page.',
65+
language: 'en',
66+
url: 'https://github.com/BetaHuhn/metadata-scraper',
67+
provider: 'GitHub',
68+
twitter: '@github',
69+
image: 'https://avatars1.githubusercontent.com/u/51766171?s=400&v=4',
70+
icon: 'https://github.githubassets.com/favicons/favicon.svg'
71+
}
72+
```
73+
74+
## ⚙️ Configuration
75+
76+
You can change the behaviour of [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) by passing an options object:
77+
78+
```js
79+
const getMetaData = require('../lib')
80+
81+
const options = {
82+
url: 'https://github.com/BetaHuhn/metadata-scraper', // URL of web page
83+
maxRedirects: 0, // Maximum number of redirects to follow (default: 5)
84+
ua: 'MyApp', // User-Agent header
85+
timeout: 1000, // Request timeout in milliseconds (default: 10000ms)
86+
forceImageHttps: false, // Force all image URLs to use https (default: true)
87+
customRules: {} // more info below
88+
}
89+
90+
getMetaData(options).then((data) => {
91+
console.log(data)
92+
})
93+
```
94+
95+
You can specify the URL by either passing it as the first parameter, or by setting it in the options object.
96+
97+
## 📖 Examples
98+
99+
Here are some examples on how to use [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper):
100+
101+
### Basic
102+
103+
Pass an URL as the first parameter and [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) automatically scrapes it and returns everything it finds:
104+
105+
```js
106+
const url = 'https://github.com/BetaHuhn/metadata-scraper'
107+
const data = await getMetaData(url)
108+
```
109+
110+
Example file located at [examples/basic.js](/examples/basic.js).
111+
112+
---
113+
114+
### HTML String
115+
116+
If you already have an HTML string and don't want [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) to make an http request, specify it in the options object:
117+
118+
```js
119+
const html = `
120+
<meta name="og:title" content="Example">
121+
<meta name="og:description" content="This is an example.">
122+
`
123+
124+
const options {
125+
html: html,
126+
url: 'https://example.com' // Optional URL to make relative image paths absolute
127+
}
128+
129+
const data = await getMetaData(options)
130+
```
131+
132+
Example file located at [examples/html.js](/examples/html.js).
133+
134+
---
135+
136+
### Custom Rules
137+
138+
Look at the `rules.ts` file in the `src` directory to see all rules which will be used.
139+
140+
You can expand [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) easily by specifying custom rules:
141+
142+
```js
143+
const options = {
144+
url: 'https://github.com/BetaHuhn/metadata-scraper',
145+
customRules: {
146+
name: {
147+
rules: [
148+
[ 'meta[name="customName"][content]', (element) => element.getAttribute('content') ]
149+
],
150+
processor: (text) => text.toLowerCase()
151+
}
152+
}
153+
}
154+
155+
const data = await getMetaData(options)
156+
```
157+
158+
`customRules` needs to contain one or more objects, where the key (name above) is the key which later gets returned. You can then specify different rules for that item in the rules array.
159+
160+
The first item is the query which gets inserted into the browsers querySelector function, and the second item is a function which gets passed the HTML element:
161+
162+
```js
163+
[ 'querySelector', (element) => element.innerText ]
164+
```
165+
166+
You can also specify a `processor` function which will process/transform the result of one of the matched rules:
167+
168+
```js
169+
{
170+
processor: (text) => text.toLowerCase()
171+
}
172+
```
173+
174+
If you find a useful metatag/rule, let me know and I will add them (or create a PR yourself).
175+
176+
Example file located at [examples/custom.js](/examples/custom.js).
177+
178+
## 💻 Development
179+
180+
Issues and PRs are very welcome!
181+
182+
Please check out the [contributing guide](CONTRIBUTING.md) before you start.
183+
184+
This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). To see differences with previous versions refer to the [CHANGELOG](CHANGELOG.md).
185+
186+
## ❔ About
187+
188+
This library was developed by me ([@betahuhn](https://github.com/BetaHuhn)) in my free time. If you want to support me:
189+
190+
[![Donate via PayPal](https://img.shields.io/badge/paypal-donate-009cde.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=394RTSBEEEFEE)
191+
192+
### Credits
193+
194+
The loader is based on [file-loader](https://github.com/webpack-contrib/file-loader).
195+
196+
## License
197+
198+
Copyright 2020 Maximilian Schiller
199+
200+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

examples/basic.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
const getMetaData = require('../lib')
22

33
const run = async function() {
4-
const url = 'https://www.sueddeutsche.de/politik/usa-joe-biden-ron-klain-1.5113555'
4+
const url = 'https://github.com/BetaHuhn/metadata-scraper'
55
const data = await getMetaData(url)
66
console.log(data)
77
}

examples/custom.js

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
const getMetaData = require('../lib')
2+
3+
const options = {
4+
customRules: {
5+
title: {
6+
rules: [
7+
[ 'meta[property="customTitle"][content]', (element) => element.getAttribute('content') ]
8+
],
9+
processor: (text) => text.toLowerCase()
10+
}
11+
}
12+
}
13+
14+
const run = async function() {
15+
const data = await getMetaData('https://github.com/BetaHuhn/metadata-scraper', options)
16+
console.log(data)
17+
}
18+
19+
run()

examples/html.js

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@ const run = async function() {
55
<meta name="og:title" content="Example">
66
<meta name="og:description" content="This is an example.">
77
`
8-
const data = await getMetaData(html, { html: true, url: 'https://example.com' })
8+
9+
const data = await getMetaData({ html: html, url: 'https://example.com' })
910
console.log(data)
1011
}
1112

examples/options.js

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,16 @@
11
const getMetaData = require('../lib')
22

33
const options = {
4+
url: 'https://github.com/BetaHuhn/metadata-scraper',
45
maxRedirects: 0, // default: 5
5-
ua: 'MyApp', // default: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
6+
ua: 'MyApp',
67
timeout: 1000, // default: 10000
78
forceImageHttps: false, // default: true
8-
customRules: {
9-
title: {
10-
rules: [
11-
[ 'meta[property="customTitle"][content]', (element) => element.getAttribute('content') ]
12-
],
13-
processor: (text) => text.toLowerCase()
14-
}
15-
}
9+
customRules: {}
1610
}
1711

1812
const run = async function() {
19-
const url = 'https://www.sueddeutsche.de/politik/usa-joe-biden-ron-klain-1.5113555'
20-
const data = await getMetaData(url, options)
13+
const data = await getMetaData(options)
2114
console.log(data)
2215
}
2316

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "metadata-scraper",
33
"version": "0.1.0",
4-
"description": "",
4+
"description": "A Javascript library for scraping/parsing metadata from a web page. ",
55
"main": "lib/index.js",
66
"types": "lib/index.d.ts",
77
"scripts": {

src/index.ts

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,16 @@ const runRule = function(ruleSet: RuleSet, doc: Document, context: Context) {
5555
return undefined
5656
}
5757

58-
const getMetaData = async function(url: string, inputOptions: Partial<Options> = {}) {
58+
const getMetaData = async function(input: string | Partial<Options>, inputOptions: Partial<Options> = {}) {
59+
60+
let url
61+
if (typeof input === 'object') {
62+
inputOptions = input
63+
url = input.url || ''
64+
} else {
65+
url = input
66+
}
67+
5968
const options = Object.assign({}, defaultOptions, inputOptions)
6069

6170
const rules: Record<string, RuleSet> = { ...metaDataRules }
@@ -78,8 +87,7 @@ const getMetaData = async function(url: string, inputOptions: Partial<Options> =
7887
})
7988
html = response.body
8089
} else {
81-
html = url
82-
url = options.url || ''
90+
html = options.html
8391
}
8492

8593
const metadata: MetaData = {}

src/types.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ export interface Options {
3333
ua?: string
3434
timeout?: number
3535
forceImageHttps?: boolean
36-
html?: boolean
36+
html?: string
3737
url?: string
3838
customRules?: Record<string, RuleSet>
3939
}

0 commit comments

Comments
 (0)