This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.
Feature: Configurable content source for markdown generation
Changes Made:
- Added
content_source: str = "cleaned_html"
parameter toMarkdownGenerationStrategy
class - Updated
DefaultMarkdownGenerator
to accept and pass the content source parameter - Renamed the
cleaned_html
parameter toinput_html
in thegenerate_markdown
method - Modified
AsyncWebCrawler.aprocess_html
to select the appropriate HTML source based on the generator's config - Added
preprocess_html_for_schema
import inasync_webcrawler.py
Implementation Details:
- Added a new
content_source
parameter to specify which HTML input to use for markdown generation - Options include: "cleaned_html" (default), "raw_html", and "fit_html"
- Used a dictionary dispatch pattern in
aprocess_html
to select the appropriate HTML source - Added proper error handling with fallback to cleaned_html if content source selection fails
- Ensured backward compatibility by defaulting to "cleaned_html" option
Files Modified:
crawl4ai/markdown_generation_strategy.py
: Added content_source parameter and updated the method signaturecrawl4ai/async_webcrawler.py
: Added HTML source selection logic and updated imports
Examples:
- Created
docs/examples/content_source_example.py
demonstrating how to use the new parameter
Challenges:
- Maintaining backward compatibility while reorganizing the parameter flow
- Ensuring proper error handling for all content source options
- Making the change with minimal code modifications
Why This Feature: The content source selection feature allows users to choose which HTML content to use as input for markdown generation:
- "cleaned_html" - Uses the post-processed HTML after scraping strategy (original behavior)
- "raw_html" - Uses the original raw HTML directly from the web page
- "fit_html" - Uses the preprocessed HTML optimized for schema extraction
This feature provides greater flexibility in how users generate markdown, enabling them to:
- Capture more detailed content from the original HTML when needed
- Use schema-optimized HTML when working with structured data
- Choose the approach that best suits their specific use case
Feature: Comprehensive stress testing framework using arun_many
and the dispatcher system to evaluate performance, concurrency handling, and identify potential issues under high-volume crawling scenarios.
Changes Made:
- Created a dedicated stress testing framework in the
benchmarking/
(or similar) directory. - Implemented local test site generation (
SiteGenerator
) with configurable heavy HTML pages. - Added basic memory usage tracking (
SimpleMemoryTracker
) using platform-specific commands (avoidingpsutil
dependency for this specific test). - Utilized
CrawlerMonitor
fromcrawl4ai
for rich terminal UI and real-time monitoring of test progress and dispatcher activity. - Implemented detailed result summary saving (JSON) and memory sample logging (CSV).
- Developed
run_benchmark.py
to orchestrate tests with predefined configurations. - Created
run_all.sh
as a simple wrapper forrun_benchmark.py
.
Implementation Details:
- Generates a local test site with configurable pages containing heavy text and image content.
- Uses Python's built-in
http.server
for local serving, minimizing network variance. - Leverages
crawl4ai
'sarun_many
method for processing URLs. - Utilizes
MemoryAdaptiveDispatcher
to manage concurrency via themax_sessions
parameter (note: memory adaptation features requirepsutil
, not used bySimpleMemoryTracker
). - Tracks memory usage via
SimpleMemoryTracker
, recording samples throughout test execution to a CSV file. - Uses
CrawlerMonitor
(which uses therich
library) for clear terminal visualization and progress reporting directly from the dispatcher. - Stores detailed final metrics in a JSON summary file.
Files Created/Updated:
stress_test_sdk.py
: Main stress testing implementation usingarun_many
.benchmark_report.py
: (Assumed) Report generator for comparing test results.run_benchmark.py
: Test runner script with predefined configurations.run_all.sh
: Simple bash script wrapper forrun_benchmark.py
.USAGE.md
: Comprehensive documentation on usage and interpretation (updated).
Testing Approach:
- Creates a controlled, reproducible test environment with a local HTTP server.
- Processes URLs using
arun_many
, allowing the dispatcher to manage concurrency up tomax_sessions
. - Optionally logs per-batch summaries (when not in streaming mode) after processing chunks.
- Supports different test sizes via
run_benchmark.py
configurations. - Records memory samples via platform commands for basic trend analysis.
- Includes cleanup functionality for the test environment.
Challenges:
- Ensuring proper cleanup of HTTP server processes.
- Getting reliable memory tracking across platforms without adding heavy dependencies (
psutil
) to this specific test script. - Designing
run_benchmark.py
to correctly pass arguments tostress_test_sdk.py
.
Why This Feature:
The high volume stress testing solution addresses critical needs for ensuring Crawl4AI's arun_many
reliability:
- Provides a reproducible way to evaluate performance under concurrent load.
- Allows testing the dispatcher's concurrency control (
max_session_permit
) and queue management. - Enables performance tuning by observing throughput (
URLs/sec
) under differentmax_sessions
settings. - Creates a controlled environment for testing
arun_many
behavior. - Supports continuous integration by providing deterministic test conditions for
arun_many
.
Design Decisions:
- Chose local site generation for reproducibility and isolation from network issues.
- Utilized the built-in
CrawlerMonitor
for real-time feedback, leveraging itsrich
integration. - Implemented optional per-batch logging in
stress_test_sdk.py
(when not streaming) to provide chunk-level summaries alongside the continuous monitor. - Adopted
arun_many
with aMemoryAdaptiveDispatcher
as the core mechanism for parallel execution, reflecting the intended SDK usage. - Created
run_benchmark.py
to simplify running standard test configurations. - Used
SimpleMemoryTracker
to provide basic memory insights without requiringpsutil
for this particular test runner.
Future Enhancements to Consider:
- Create a separate test variant that does use
psutil
to specifically stress the memory-adaptive features of the dispatcher. - Add support for generated JavaScript content.
- Add support for Docker-based testing with explicit memory limits.
- Enhance
benchmark_report.py
to provide more sophisticated analysis of performance and memory trends from the generated JSON/CSV files.
Changes Made:
- Corrected
run_benchmark.py
andstress_test_sdk.py
to use--max-sessions
instead of the incorrect--workers
parameter, accurately reflecting dispatcher configuration. - Updated
run_benchmark.py
argument handling to correctly pass all relevant custom parameters (including--stream
,--monitor-mode
, etc.) tostress_test_sdk.py
. - (Assuming changes in
benchmark_report.py
) Applied dark theme to benchmark reports for better readability. - (Assuming changes in
benchmark_report.py
) Improved visualization code to eliminate matplotlib warnings. - Updated
run_benchmark.py
to provide clickablefile://
links to generated reports in the terminal output. - Updated
USAGE.md
with comprehensive parameter descriptions reflecting the final script arguments. - Updated
run_all.sh
wrapper to correctly invokerun_benchmark.py
with flexible arguments.
Details of Changes:
-
Parameter Correction (
--max-sessions
):- Identified the fundamental misunderstanding where
--workers
was used incorrectly. - Refactored
stress_test_sdk.py
to accept--max-sessions
and configure theMemoryAdaptiveDispatcher
'smax_session_permit
accordingly. - Updated
run_benchmark.py
argument parsing and command construction to use--max-sessions
. - Updated
TEST_CONFIGS
inrun_benchmark.py
to usemax_sessions
.
- Identified the fundamental misunderstanding where
-
Argument Handling (
run_benchmark.py
):- Improved logic to collect all command-line arguments provided to
run_benchmark.py
. - Ensured all relevant arguments (like
--stream
,--monitor-mode
,--port
,--use-rate-limiter
, etc.) are correctly forwarded when callingstress_test_sdk.py
as a subprocess.
- Improved logic to collect all command-line arguments provided to
-
Dark Theme & Visualization Fixes (Assumed in
benchmark_report.py
):- (Describes changes assumed to be made in the separate reporting script).
-
Clickable Links (
run_benchmark.py
):- Added logic to find the latest HTML report and PNG chart in the
benchmark_reports
directory afterbenchmark_report.py
runs. - Used
pathlib
to generate correctfile://
URLs for terminal output.
- Added logic to find the latest HTML report and PNG chart in the
-
Documentation Improvements (
USAGE.md
):- Rewrote sections to explain
arun_many
, dispatchers, and--max-sessions
. - Updated parameter tables for all scripts (
stress_test_sdk.py
,run_benchmark.py
). - Clarified the difference between batch and streaming modes and their effect on logging.
- Updated examples to use correct arguments.
- Rewrote sections to explain
Files Modified:
stress_test_sdk.py
: Changed--workers
to--max-sessions
, added new arguments, usedarun_many
.run_benchmark.py
: Changed argument handling, updated configs, callsstress_test_sdk.py
.run_all.sh
: Updated to callrun_benchmark.py
correctly.USAGE.md
: Updated documentation extensively.benchmark_report.py
: (Assumed modifications for dark theme and viz fixes).
Testing:
- Verified that
--max-sessions
correctly limits concurrency via theCrawlerMonitor
output. - Confirmed that custom arguments passed to
run_benchmark.py
are forwarded tostress_test_sdk.py
. - Validated clickable links work in supporting terminals.
- Ensured documentation matches the final script parameters and behavior.
Why These Changes:
These refinements correct the fundamental approach of the stress test to align with crawl4ai
's actual architecture and intended usage:
- Ensures the test evaluates the correct components (
arun_many
,MemoryAdaptiveDispatcher
). - Makes test configurations more accurate and flexible.
- Improves the usability of the testing framework through better argument handling and documentation.
Future Enhancements to Consider:
- Add support for generated JavaScript content to test JS rendering performance
- Implement more sophisticated memory analysis like generational garbage collection tracking
- Add support for Docker-based testing with memory limits to force OOM conditions
- Create visualization tools for analyzing memory usage patterns across test runs
- Add benchmark comparisons between different crawler versions or configurations
Changes Made:
- Fixed custom parameter handling in run_benchmark.py
- Applied dark theme to benchmark reports for better readability
- Improved visualization code to eliminate matplotlib warnings
- Added clickable links to generated reports in terminal output
- Enhanced documentation with comprehensive parameter descriptions
Details of Changes:
-
Custom Parameter Handling Fix
- Identified bug where custom URL count was being ignored in run_benchmark.py
- Rewrote argument handling to use a custom args dictionary
- Properly passed parameters to the test_simple_stress.py command
- Added better UI indication of custom parameters in use
-
Dark Theme Implementation
- Added complete dark theme to HTML benchmark reports
- Applied dark styling to all visualization components
- Used Nord-inspired color palette for charts and graphs
- Improved contrast and readability for data visualization
- Updated text colors and backgrounds for better eye comfort
-
Matplotlib Warning Fixes
- Resolved warnings related to improper use of set_xticklabels()
- Implemented correct x-axis positioning for bar charts
- Ensured proper alignment of bar labels and data points
- Updated plotting code to use modern matplotlib practices
-
Documentation Improvements
- Created comprehensive USAGE.md with detailed instructions
- Added parameter documentation for all scripts
- Included examples for all common use cases
- Provided detailed explanations for interpreting results
- Added troubleshooting guide for common issues
Files Modified:
tests/memory/run_benchmark.py
: Fixed custom parameter handlingtests/memory/benchmark_report.py
: Added dark theme and fixed visualization warningstests/memory/run_all.sh
: Added clickable links to reportstests/memory/USAGE.md
: Created comprehensive documentation
Testing:
- Verified that custom URL counts are now correctly used
- Confirmed dark theme is properly applied to all report elements
- Checked that matplotlib warnings are no longer appearing
- Validated clickable links to reports work in terminals that support them
Why These Changes: These improvements address several usability issues with the stress testing system:
- Better parameter handling ensures test configurations work as expected
- Dark theme reduces eye strain during extended test review sessions
- Fixing visualization warnings improves code quality and output clarity
- Enhanced documentation makes the system more accessible for future use
Future Enhancements:
- Add additional visualization options for different types of analysis
- Implement theme toggle to support both light and dark preferences
- Add export options for embedding reports in other documentation
- Create dedicated CI/CD integration templates for automated testing
Feature: MHTML snapshot capture of crawled pages
Changes Made:
- Added
capture_mhtml: bool = False
parameter toCrawlerRunConfig
class - Added
mhtml: Optional[str] = None
field toCrawlResult
model - Added
mhtml_data: Optional[str] = None
field toAsyncCrawlResponse
class - Implemented
capture_mhtml()
method inAsyncPlaywrightCrawlerStrategy
class to capture MHTML via CDP - Modified the crawler to capture MHTML when enabled and pass it to the result
Implementation Details:
- MHTML capture uses Chrome DevTools Protocol (CDP) via Playwright's CDP session API
- The implementation waits for page to fully load before capturing MHTML content
- Enhanced waiting for JavaScript content with requestAnimationFrame for better JS content capture
- We ensure all browser resources are properly cleaned up after capture
Files Modified:
crawl4ai/models.py
: Added the mhtml field to CrawlResultcrawl4ai/async_configs.py
: Added capture_mhtml parameter to CrawlerRunConfigcrawl4ai/async_crawler_strategy.py
: Implemented MHTML capture logiccrawl4ai/async_webcrawler.py
: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtml
Testing:
- Created comprehensive tests in
tests/20241401/test_mhtml.py
covering:- Capturing MHTML when enabled
- Ensuring mhtml is None when disabled explicitly
- Ensuring mhtml is None by default
- Capturing MHTML on JavaScript-enabled pages
Challenges:
- Had to improve page loading detection to ensure JavaScript content was fully rendered
- Tests needed to be run independently due to Playwright browser instance management
- Modified test expected content to match actual MHTML output
Why This Feature: The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:
- Offline viewing of captured pages
- Creating permanent snapshots of web content for archival
- Ensuring consistent content for later analysis, even if the original site changes
Future Enhancements to Consider:
- Add option to save MHTML to file
- Support for filtering what resources get included in MHTML
- Add support for specifying MHTML capture options
Feature: Comprehensive capturing of network requests/responses and browser console messages during crawling
Changes Made:
- Added
capture_network_requests: bool = False
andcapture_console_messages: bool = False
parameters toCrawlerRunConfig
class - Added
network_requests: Optional[List[Dict[str, Any]]] = None
andconsole_messages: Optional[List[Dict[str, Any]]] = None
fields to bothAsyncCrawlResponse
andCrawlResult
models - Implemented event listeners in
AsyncPlaywrightCrawlerStrategy._crawl_web()
to capture browser network events and console messages - Added proper event listener cleanup in the finally block to prevent resource leaks
- Modified the crawler flow to pass captured data from AsyncCrawlResponse to CrawlResult
Implementation Details:
- Network capture uses Playwright event listeners (
request
,response
, andrequestfailed
) to record all network activity - Console capture uses Playwright event listeners (
console
andpageerror
) to record console messages and errors - Each network event includes metadata like URL, headers, status, and timing information
- Each console message includes type, text content, and source location when available
- All captured events include timestamps for chronological analysis
- Error handling ensures even failed capture attempts won't crash the main crawling process
Files Modified:
crawl4ai/models.py
: Added new fields to AsyncCrawlResponse and CrawlResultcrawl4ai/async_configs.py
: Added new configuration parameters to CrawlerRunConfigcrawl4ai/async_crawler_strategy.py
: Implemented capture logic using event listenerscrawl4ai/async_webcrawler.py
: Added data transfer from AsyncCrawlResponse to CrawlResult
Documentation:
- Created detailed documentation in
docs/md_v2/advanced/network-console-capture.md
- Added feature to site navigation in
mkdocs.yml
- Updated CrawlResult documentation in
docs/md_v2/api/crawl-result.md
- Created comprehensive example in
docs/examples/network_console_capture_example.py
Testing:
- Created
tests/general/test_network_console_capture.py
with tests for:- Verifying capture is disabled by default
- Testing network request capturing
- Testing console message capturing
- Ensuring both capture types can be enabled simultaneously
- Checking correct content is captured in expected formats
Challenges:
- Initial implementation had synchronous/asynchronous mismatches in event handlers
- Needed to fix type of property access vs. method calls in handlers
- Required careful cleanup of event listeners to prevent memory leaks
Why This Feature: The network and console capture feature provides deep visibility into web page activity, enabling:
- Debugging complex web applications by seeing all network requests and errors
- Security analysis to detect unexpected third-party requests and data flows
- Performance profiling to identify slow-loading resources
- API discovery in single-page applications
- Comprehensive analysis of web application behavior
Future Enhancements to Consider:
- Option to filter captured events by type, domain, or content
- Support for capturing response bodies (with size limits)
- Aggregate statistics calculation for performance metrics
- Integration with visualization tools for network waterfall analysis
- Exporting captures in HAR format for use with external tools