Skip to content

[arca.live] Add extractor skeleton #7100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 14, 2025
Merged

[arca.live] Add extractor skeleton #7100

merged 10 commits into from
Mar 14, 2025

Conversation

hdk5
Copy link
Contributor

@hdk5 hdk5 commented Mar 2, 2025

This is a draft with only api calls so far.
The actual extraction part with parsing the response is not implemented. Help would be appreciated.

For specific post examples and caveats see https://github.com/hdk5/danbooru/blob/arca-live/test/unit/sources/arca_live_test.rb.
Board extractor is expected to work with search params, e.g. https://arca.live/b/bluearchive?target=nickname&keyword=horuhara.

As user pages only return few most recent results and therefore don't exactly make much sense as global search by name is preferred, I chose not to implement it.
If needed, the api for this is at https://arca.live/api/app/users/recent?nickname=...&publicId=... for https://arca.live/u/@nickname/publicId, where publicId is not always present.

return self._pagination(endpoint, params, "articles")

def post(self, post_id):
endpoint = "/api/app/view/article/breaking/" + str(post_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: breaking covers the whole site. Any post that is on specific board is also always on breaking. Or, at least I believe it is.

@mikf
Copy link
Owner

mikf commented Mar 2, 2025

Related issue: #5657

mikf added 9 commits March 11, 2025 17:51
compile and cache regex on demand
- extract 'data-originalurl' URLs if available
- replace URL query strings with 'type=orig'
- ignore emoticons by default
- include 'title' in filenames
- use 0.5-1.5s delay between requests
so it doesn't also match 'post' URLs
@mikf
Copy link
Owner

mikf commented Mar 13, 2025

Thanks for the initial code and all the resources you provided.

I made a few updates and most of the Danbooru tests now pass except the fake .mp4 GIF one, but I'm not sure I want it to do an extra HEAD request for every potential mp4 -> gif.

@mikf mikf merged commit d900e86 into mikf:master Mar 14, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants