Skip to content

feat: add utility for load and parse Sitemap and SitemapRequestLoader #1169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 45 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
6cf67ba
init sitemap
Mantisus Apr 21, 2025
c96572a
implementation
Mantisus Apr 22, 2025
e7063a5
update
Mantisus Apr 24, 2025
fcbca23
optimization uvicorn paths
Mantisus Apr 24, 2025
1c284ac
Merge branch 'master' into sitemap
Mantisus May 22, 2025
f0b089c
add tests
Mantisus May 30, 2025
43f204d
Merge branch 'master' into sitemap
Mantisus May 30, 2025
c2dbb73
integrate sitemap to robots.txt
Mantisus May 30, 2025
8eb1eaa
Merge branch 'master' into sitemap
Mantisus Jun 3, 2025
3279aa6
add implementation `SitemapRequestLoader`
Mantisus Jun 3, 2025
b1910f1
add tests
Mantisus Jun 3, 2025
65e4a38
update docs
Mantisus Jun 3, 2025
d432941
fix uvicorn path
Mantisus Jun 3, 2025
df554fd
Merge branch 'master' into sitemap
Mantisus Jun 9, 2025
4d61c12
unification echo_content
Mantisus Jun 9, 2025
04cd366
update endpoints
Mantisus Jun 9, 2025
a446bb1
clear extra property in `SitemapRequestLoader`
Mantisus Jun 9, 2025
a82985a
implimitation stream method
Mantisus Jun 10, 2025
987d5e9
Merge branch 'sitemap' into test-sitemap
Mantisus Jun 10, 2025
8a7aa4a
add chunk_size parameter for `iter_bytes`
Mantisus Jun 10, 2025
f7a93cb
Merge branch 'master' into stream-http-client
Mantisus Jun 10, 2025
594604f
add support timeout for stream
Mantisus Jun 10, 2025
0607be6
add test
Mantisus Jun 10, 2025
2be9383
update docsstrings
Mantisus Jun 10, 2025
95579fe
remove `chunk_size`
Mantisus Jun 11, 2025
1942822
iter_bytesread_stream
Mantisus Jun 11, 2025
4b52abf
Merge branch 'master' into stream-http-client
Mantisus Jun 11, 2025
59166e1
update for use `HttpClient` with `stream`
Mantisus Jun 11, 2025
b2913f7
Merge branch 'stream-http-client' into test-sitemap
Mantisus Jun 11, 2025
eaf185b
update with `read_stream`
Mantisus Jun 11, 2025
03b0317
add activate property
Mantisus Jun 12, 2025
f5948a0
Merge branch 'master' into test-sitemap
Mantisus Jun 12, 2025
a9aecf2
Merge branch 'master' into stream-http-client
Mantisus Jun 12, 2025
fac5ca1
add active property
Mantisus Jun 12, 2025
8abd28e
Update src/crawlee/crawlers/_playwright/_types.py
Mantisus Jun 12, 2025
3c53169
raise Error for `read_stream` if `stream` is consumed
Mantisus Jun 12, 2025
54ed3f8
update Error
Mantisus Jun 12, 2025
f933954
add reuse tests for context manager
Mantisus Jun 12, 2025
b501739
Use context manager in test
Mantisus Jun 18, 2025
9638c2d
Merge branch 'master' into stream-http-client
Mantisus Jun 18, 2025
2c16d99
Merge branch 'stream-http-client' into test-sitemap
Mantisus Jun 18, 2025
6e9149e
Merge branch 'master' into sitemap
Mantisus Jun 19, 2025
e50f4cc
update
Mantisus Jun 19, 2025
391af8e
add stream decoder for UTF-8
Mantisus Jun 24, 2025
5926211
add `SitemapRequestLoader` in docs guide
Mantisus Jun 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/guides/code_examples/request_loaders/sitemap_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import asyncio
import re

from crawlee.http_clients import HttpxHttpClient
from crawlee.request_loaders import SitemapRequestLoader


async def main() -> None:
# Create an HTTP client for fetching sitemaps
async with HttpxHttpClient() as http_client:
# Create a sitemap request loader with URL filtering
sitemap_loader = SitemapRequestLoader(
sitemap_urls=['https://crawlee.dev/sitemap.xml'],
http_client=http_client,
# Exclude all URLs that do not contain 'blog'
exclude=[re.compile(r'^((?!blog).)*$')],
max_buffer_size=500, # Buffer up to 500 URLs in memory
)

while request := await sitemap_loader.fetch_next_request():
# Do something with it...

# And mark it as handled.
await sitemap_loader.mark_request_as_handled(request)


if __name__ == '__main__':
asyncio.run(main())
20 changes: 18 additions & 2 deletions docs/guides/request_loaders.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import TabItem from '@theme/TabItem';
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import RlBasicExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/rl_basic_example.py';
import SitemapExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/sitemap_example.py';
import TandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/tandem_example.py';
import ExplicitTandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/tandem_example_explicit.py';

Expand All @@ -23,9 +24,10 @@ The [`request_loaders`](https://github.com/apify/crawlee-python/tree/master/src/
- <ApiLink to="class/RequestManager">`RequestManager`</ApiLink>: Extends `RequestLoader` with write capabilities.
- <ApiLink to="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink>: Combines a read-only `RequestLoader` with a writable `RequestManager`.

And one specific request loader:
And specific request loaders:

- <ApiLink to="class/RequestList">`RequestList`</ApiLink>: A lightweight implementation of request loader for managing a static list of URLs.
- <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink>: A request loader that reads URLs from XML sitemaps with filtering capabilities.

Below is a class diagram that illustrates the relationships between these components and the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>:

Expand Down Expand Up @@ -83,6 +85,11 @@ class RequestList {
_methods_()
}

class SitemapRequestLoader {
_attributes_
_methods_()
}

class RequestManagerTandem {
_attributes_
_methods_()
Expand All @@ -97,6 +104,7 @@ RequestManager <|-- RequestQueue

RequestLoader <|-- RequestManager
RequestLoader <|-- RequestList
RequestLoader <|-- SitemapRequestLoader
RequestManager <|-- RequestManagerTandem
```

Expand All @@ -112,6 +120,14 @@ Here is a basic example of working with the <ApiLink to="class/RequestList">`Req
{RlBasicExample}
</RunnableCodeBlock>

## Sitemap request loader

The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> is a specialized request loader that reads URLs from XML sitemaps. It's particularly useful when you want to crawl a website systematically by following its sitemap structure. The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> provides streaming processing of sitemaps, which ensures efficient memory usage without loading the entire sitemap into memory.

<RunnableCodeBlock className="language-python" language="python">
{SitemapExample}
</RunnableCodeBlock>

## Request manager

The <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> extends `RequestLoader` with write capabilities. In addition to reading requests, a request manager can add or reclaim them. This is important for dynamic crawling projects, where new URLs may emerge during the crawl process. Or when certain requests may failed and need to be retried. For more details refer to the <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> API reference.
Expand Down Expand Up @@ -139,4 +155,4 @@ This sections describes the combination of the <ApiLink to="class/RequestList">`

## Conclusion

This guide explained the `request_loaders` sub-package, which extends the functionality of the `RequestQueue` with additional tools for managing URLs. You learned about the `RequestLoader`, `RequestManager`, and `RequestManagerTandem` classes, as well as the `RequestList` class. You also saw examples of how to work with these classes in practice. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
This guide explained the `request_loaders` sub-package, which extends the functionality of the `RequestQueue` with additional tools for managing URLs. You learned about the `RequestLoader`, `RequestManager`, and `RequestManagerTandem` classes, as well as the `RequestList` and `SitemapRequestLoader` classes. You also saw examples of how to work with these classes in practice. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
22 changes: 20 additions & 2 deletions src/crawlee/_utils/robots.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from protego import Protego
from yarl import URL

from crawlee._utils.sitemap import Sitemap
from crawlee._utils.web import is_status_code_client_error

if TYPE_CHECKING:
Expand All @@ -15,9 +16,13 @@


class RobotsTxtFile:
def __init__(self, url: str, robots: Protego) -> None:
def __init__(
self, url: str, robots: Protego, http_client: HttpClient | None = None, proxy_info: ProxyInfo | None = None
) -> None:
self._robots = robots
self._original_url = URL(url).origin()
self._http_client = http_client
self._proxy_info = proxy_info

@classmethod
async def from_content(cls, url: str, content: str) -> Self:
Expand Down Expand Up @@ -56,7 +61,7 @@ async def load(cls, url: str, http_client: HttpClient, proxy_info: ProxyInfo | N

robots = Protego.parse(body.decode('utf-8'))

return cls(url, robots)
return cls(url, robots, http_client=http_client, proxy_info=proxy_info)

def is_allowed(self, url: str, user_agent: str = '*') -> bool:
"""Check if the given URL is allowed for the given user agent.
Expand All @@ -83,3 +88,16 @@ def get_crawl_delay(self, user_agent: str = '*') -> int | None:
"""
crawl_delay = self._robots.crawl_delay(user_agent)
return int(crawl_delay) if crawl_delay is not None else None

async def parse_sitemaps(self) -> Sitemap:
"""Parse the sitemaps from the robots.txt file and return a `Sitemap` instance."""
sitemaps = self.get_sitemaps()
if not self._http_client:
raise ValueError('HTTP client is required to parse sitemaps.')

return await Sitemap.load(sitemaps, self._http_client, self._proxy_info)

async def parse_urls_from_sitemaps(self) -> list[str]:
"""Parse the sitemaps in the robots.txt file and return a list URLs."""
sitemap = await self.parse_sitemaps()
return sitemap.urls
Loading
Loading