1
1
## CHANGES
2
2
3
3
v0.8.1
4
+
4
5
- Logging and Behavior Tweaks by @ikreymer in https://github.com/webrecorder/browsertrix-crawler/pull/229
5
6
- Fix typos by @stavares843 in https://github.com/webrecorder/browsertrix-crawler/pull/232
6
7
- Add crawl log to WACZ by @ikreymer in https://github.com/webrecorder/browsertrix-crawler/pull/231
7
8
8
9
v0.8.0
10
+
9
11
- Switch to Chrome/Chromium 109
10
12
- Convert to ESM module
11
13
- Add ad blocking via request interception (#173 )
@@ -25,11 +27,13 @@ v0.8.0
25
27
- update behaviors to 0.4.1, rename 'Behavior line' -> 'Behavior log' by @ikreymer in https://github.com/webrecorder/browsertrix-crawler/pull/223
26
28
27
29
v0.7.1
30
+
28
31
- Fix for warcio.js by @ikreymer in #178
29
32
- Guard against pre-existing user/group by @edsu in #176
30
33
- Fix incorrect combineWARCs property in README.md by @Georift in #180
31
34
32
35
v0.7.0
36
+
33
37
- Update to Chrome/Chromium 101 - (0.7.0 Beta 0) by @ikreymer in #144
34
38
- Add --netIdleWait, bump dependencies (0.7.0-beta.2) by @ikreymer in #145
35
39
- Update README.md by @atomotic in #147
41
45
- Interrupt Handling Fixes by @ikreymer in #167
42
46
- Run in Docker as User by @edsu in #171
43
47
44
-
45
48
v0.6.0
46
49
47
50
- Add a --waitOnDone option, which has browsertrix crawler wait when finished (for use with Browsertrix Cloud)
56
59
- Fixes to interrupting a single instance in a shared state crawl
57
60
- force all cookies, including session cookies, to fixed duration in days, configurable via --cookieDays
58
61
59
-
60
62
v0.5.0
63
+
61
64
- Scope: support for ` scopeType: domain ` to include all subdomains and ignoring 'www.' if specified in the seed.
62
65
- Profiles: support loading remote profile from URL as well as local file
63
66
- Non-HTML Pages: Load non-200 responses in browser, even if non-html, fix waiting issues with non-HTML pages (eg. PDFs)
75
78
- Signing: Support for optional signing of WACZ
76
79
- Dependencies: update to latest pywb, wacz and browsertrix-behaviors packages
77
80
78
-
79
81
v0.4.4
82
+
80
83
- Page Block Rules Fix: 'request already handled' errors by avoiding adding duplicate handlers to same page.
81
84
- Page Block Rules Fix: await all continue/abort() calls and catch errors.
82
85
- Page Block Rules: Don't apply to top-level page, print warning and recommend scope rules instead.
@@ -86,18 +89,21 @@ v0.4.4
86
89
- README: Update old type -> scopeType, list new scope types.
87
90
88
91
v0.4.3
92
+
89
93
- BlockRules Fixes: When considering the 'inFrameUrl' for a navigation request for an iframe, use URL of parent frame.
90
94
- BlockRules Fixes: Always allow pywb proxy scripts.
91
95
- Logging: Improved debug logging for block rules (log blocked requests and conditional iframe requests) when 'debug' set in 'logging'
92
96
93
97
v0.4.2
98
+
94
99
- Compose/docs: Build latest image by default, update README to refer to latest image
95
100
- Fix typo in ` crawler.capturePrefix ` that resulted in ` directFetchCapture() ` always failing
96
101
- Tests: Update all tests to use ` test-crawls ` directory
97
102
- extractLinks() just extracts links from default selectors, allows custom driver to filter results
98
103
- loadPage() accepts a list of selector options with selector, extract, and isAttribute settings for further customization of link extraction
99
104
100
105
v0.4.1
106
+
101
107
- BlockRules Optimizations: don't intercept requests if no blockRules
102
108
- Profile Creation: Support extending existing profile by passing a --profile param to load on startup
103
109
- Profile Creation: Set default window size to 1600x900, add --windowSize param for setting custom size
@@ -107,6 +113,7 @@ v0.4.1
107
113
- CI: Build a multi-platform (amd64 and arm64) image on each release
108
114
109
115
v0.4.0
116
+
110
117
- YAML based config, specifyable via --config property or via stdin (with '--config stdin')
111
118
- Support for different scope types ('page', 'prefix', 'host', 'any', 'none') + crawl depth at crawl level
112
119
- Per-Seed scoping, including different scope types, or depth and include/exclude rules configurable per seed in 'seeds' list via YAML config
@@ -120,16 +127,17 @@ v0.4.0
120
127
- Update to latest pywb (2.5.0b4), browsertrix-behaviors (0.2.3), py-wacz (0.3.1)
121
128
122
129
v0.3.2
123
- - Added a ` --urlFile ` option: Allows users to specify a .txt file list of exact URLs to crawl (one URL per line).
124
130
131
+ - Added a ` --urlFile ` option: Allows users to specify a .txt file list of exact URLs to crawl (one URL per line).
125
132
126
133
v0.3.1
134
+
127
135
- Improved shutdown wait: Instead of waiting for 5 secs, wait until all pending requests are written to WARCs
128
136
- Bug fix: Use async APIs for combine WARC to avoid spurious issues with multiple crawls
129
137
- Behaviors Update to Behaviors to 0.2.1, with support for facebook pages
130
138
131
-
132
139
v0.3.0
140
+
133
141
- WARC Combining: ` --combineWARC ` and ` --rolloverSize ` flags for generating combined WARC at end of crawl, each WARC upto specified rolloverSize
134
142
- Profiles: Support for creating reusable browser profiles, stored as tarballs, and running crawl with a login profile (see README for more info)
135
143
- Behaviors: Switch to Browsertrix Behaviors v0.1.1 for in-page behaviors
0 commit comments