Skip to content

--allowHashUrls option silently does nothing #790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Mr0grog opened this issue Mar 8, 2025 · 0 comments
Open

--allowHashUrls option silently does nothing #790

Mr0grog opened this issue Mar 8, 2025 · 0 comments

Comments

@Mr0grog
Copy link

Mr0grog commented Mar 8, 2025

The crawler has a documented --allowHashUrls option, but it doesn’t appear to do anything. Searching the codebase, I can’t find any references to it except for the argument parser, so it doesn’t seem to actually be used.

I had expected this to allow a seed I’d listed with a hash URL to get captured. For example, using the following config:

scopeType: page
allowHashUrls: true

seeds:
  - 'https://www.eia.gov/naturalgas/ngqs/#?report=RP9&year1=2017&year2=2017&company=Name'

Is this something that’s just not hooked up, or maybe a vestigial feature that was supposed to be removed?

The workarounds I’m currently trying are:

scopeType: page
include: ['.*']

or:

scopeType: custom
depth: 0
include: ['.*']

or:

scopeType: custom
depth: 0

seeds:
  - url: 'https://www.eia.gov/naturalgas/ngqs/#?report=RP9&year1=2017&year2=2017&company=Name'
    allowHash: true

(Side note: I’d hoped I could use allowHash on the seed with scopeType: page at the top level, but it looks like that scope type always prevents allowHash from being configured, which seems less than ideal.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

1 participant