|
1 |
| -# metadata-scraper |
| 1 | +<div align="center"> |
| 2 | + |
| 3 | +# metadata-scraper |
| 4 | + |
| 5 | +[](https://github.com/BetaHuhn/metadata-scraper/blob/master/LICENSE)  [](https://www.npmjs.com/package/metadata-scraper) |
| 6 | + |
| 7 | +A Javascript library for scraping/parsing metadata from a web page. |
| 8 | + |
| 9 | +</div> |
| 10 | + |
| 11 | +## 👋 Introduction |
| 12 | + |
| 13 | +[metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) is a Javascript library which scrapes/parses metadata from web pages. You only need to supply it with a URL or an HTML string and it will use different rules to find the most relevant metadata like: |
| 14 | + |
| 15 | +- Title |
| 16 | +- Description |
| 17 | +- Favicons/Images |
| 18 | +- Language |
| 19 | +- Keywords |
| 20 | +- Author |
| 21 | +- and more (full list [below](#)) |
| 22 | + |
| 23 | + |
| 24 | +## 🚀 Get started |
| 25 | + |
| 26 | +Install [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) via npm: |
| 27 | +```shell |
| 28 | +npm install metadata-scraper |
| 29 | +``` |
| 30 | + |
| 31 | +## 📚 Usage |
| 32 | + |
| 33 | +Import `metadata-scraper` and pass it an URL or options object: |
| 34 | + |
| 35 | +```js |
| 36 | +const getMetaData = require('metadata-scraper') |
| 37 | + |
| 38 | +const url = 'https://github.com/BetaHuhn/metadata-scraper' |
| 39 | + |
| 40 | +getMetaData(url).then((data) => { |
| 41 | + console.log(data) |
| 42 | +}) |
| 43 | +``` |
| 44 | + |
| 45 | +Or with `async`/`await`: |
| 46 | + |
| 47 | +```js |
| 48 | +const getMetaData = require('metadata-scraper') |
| 49 | + |
| 50 | +async function run() { |
| 51 | + const url = 'https://github.com/BetaHuhn/metadata-scraper' |
| 52 | + const data = await getMetaData(url) |
| 53 | + console.log(data) |
| 54 | +} |
| 55 | + |
| 56 | +run() |
| 57 | +``` |
| 58 | + |
| 59 | +This will return: |
| 60 | + |
| 61 | +```js |
| 62 | +{ |
| 63 | + title: 'BetaHuhn/metadata-scraper', |
| 64 | + description: 'A Javascript library for scraping/parsing metadata from a web page.', |
| 65 | + language: 'en', |
| 66 | + url: 'https://github.com/BetaHuhn/metadata-scraper', |
| 67 | + provider: 'GitHub', |
| 68 | + twitter: '@github', |
| 69 | + image: 'https://avatars1.githubusercontent.com/u/51766171?s=400&v=4', |
| 70 | + icon: 'https://github.githubassets.com/favicons/favicon.svg' |
| 71 | +} |
| 72 | +``` |
| 73 | + |
| 74 | +## ⚙️ Configuration |
| 75 | + |
| 76 | +You can change the behaviour of [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) by passing an options object: |
| 77 | + |
| 78 | +```js |
| 79 | +const getMetaData = require('../lib') |
| 80 | + |
| 81 | +const options = { |
| 82 | + url: 'https://github.com/BetaHuhn/metadata-scraper', // URL of web page |
| 83 | + maxRedirects: 0, // Maximum number of redirects to follow (default: 5) |
| 84 | + ua: 'MyApp', // User-Agent header |
| 85 | + timeout: 1000, // Request timeout in milliseconds (default: 10000ms) |
| 86 | + forceImageHttps: false, // Force all image URLs to use https (default: true) |
| 87 | + customRules: {} // more info below |
| 88 | +} |
| 89 | + |
| 90 | +getMetaData(options).then((data) => { |
| 91 | + console.log(data) |
| 92 | +}) |
| 93 | +``` |
| 94 | + |
| 95 | +You can specify the URL by either passing it as the first parameter, or by setting it in the options object. |
| 96 | + |
| 97 | +## 📖 Examples |
| 98 | + |
| 99 | +Here are some examples on how to use [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper): |
| 100 | + |
| 101 | +### Basic |
| 102 | + |
| 103 | +Pass an URL as the first parameter and [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) automatically scrapes it and returns everything it finds: |
| 104 | + |
| 105 | +```js |
| 106 | +const url = 'https://github.com/BetaHuhn/metadata-scraper' |
| 107 | +const data = await getMetaData(url) |
| 108 | +``` |
| 109 | + |
| 110 | +Example file located at [examples/basic.js](/examples/basic.js). |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +### HTML String |
| 115 | + |
| 116 | +If you already have an HTML string and don't want [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) to make an http request, specify it in the options object: |
| 117 | + |
| 118 | +```js |
| 119 | +const html = ` |
| 120 | + <meta name="og:title" content="Example"> |
| 121 | + <meta name="og:description" content="This is an example."> |
| 122 | +` |
| 123 | + |
| 124 | +const options { |
| 125 | + html: html, |
| 126 | + url: 'https://example.com' // Optional URL to make relative image paths absolute |
| 127 | +} |
| 128 | + |
| 129 | +const data = await getMetaData(options) |
| 130 | +``` |
| 131 | + |
| 132 | +Example file located at [examples/html.js](/examples/html.js). |
| 133 | + |
| 134 | +--- |
| 135 | + |
| 136 | +### Custom Rules |
| 137 | + |
| 138 | +Look at the `rules.ts` file in the `src` directory to see all rules which will be used. |
| 139 | + |
| 140 | +You can expand [metadata-scraper](https://github.com/BetaHuhn/metadata-scraper) easily by specifying custom rules: |
| 141 | + |
| 142 | +```js |
| 143 | +const options = { |
| 144 | + url: 'https://github.com/BetaHuhn/metadata-scraper', |
| 145 | + customRules: { |
| 146 | + name: { |
| 147 | + rules: [ |
| 148 | + [ 'meta[name="customName"][content]', (element) => element.getAttribute('content') ] |
| 149 | + ], |
| 150 | + processor: (text) => text.toLowerCase() |
| 151 | + } |
| 152 | + } |
| 153 | +} |
| 154 | + |
| 155 | +const data = await getMetaData(options) |
| 156 | +``` |
| 157 | + |
| 158 | +`customRules` needs to contain one or more objects, where the key (name above) is the key which later gets returned. You can then specify different rules for that item in the rules array. |
| 159 | + |
| 160 | +The first item is the query which gets inserted into the browsers querySelector function, and the second item is a function which gets passed the HTML element: |
| 161 | + |
| 162 | +```js |
| 163 | +[ 'querySelector', (element) => element.innerText ] |
| 164 | +``` |
| 165 | + |
| 166 | +You can also specify a `processor` function which will process/transform the result of one of the matched rules: |
| 167 | + |
| 168 | +```js |
| 169 | +{ |
| 170 | + processor: (text) => text.toLowerCase() |
| 171 | +} |
| 172 | +``` |
| 173 | + |
| 174 | +If you find a useful metatag/rule, let me know and I will add them (or create a PR yourself). |
| 175 | + |
| 176 | +Example file located at [examples/custom.js](/examples/custom.js). |
| 177 | + |
| 178 | +## 💻 Development |
| 179 | + |
| 180 | +Issues and PRs are very welcome! |
| 181 | + |
| 182 | +Please check out the [contributing guide](CONTRIBUTING.md) before you start. |
| 183 | + |
| 184 | +This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). To see differences with previous versions refer to the [CHANGELOG](CHANGELOG.md). |
| 185 | + |
| 186 | +## ❔ About |
| 187 | + |
| 188 | +This library was developed by me ([@betahuhn](https://github.com/BetaHuhn)) in my free time. If you want to support me: |
| 189 | + |
| 190 | +[](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=394RTSBEEEFEE) |
| 191 | + |
| 192 | +### Credits |
| 193 | + |
| 194 | +The loader is based on [file-loader](https://github.com/webpack-contrib/file-loader). |
| 195 | + |
| 196 | +## License |
| 197 | + |
| 198 | +Copyright 2020 Maximilian Schiller |
| 199 | + |
| 200 | +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
0 commit comments