Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not.
It is based on:
- a web page segmentation algorithm
- a combination of structural and visual comparison methods embedded in a statistical discriminative model
- a visual similarity measure designed for Web pages that improves change detection
- a supervised feature selection method adapted to Web archiving
We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not.
A list of benefits Pagelyzer brings to the end user:
- Crawler optimization e.g discovering new crawl strategies based on patterns.
- Quality assurance for crawlers, for example, by comparing the live version of the page with the one just crawled.
- Detecting format obsolescence due to evolving technologies, is the rendering of web pages visually identical comparing different versions of a browser or different browsers?
- Archive maintenance: different operations like format migration can change the archived page versions renderings.
Pagelyzer is for:
- Content holders, such as national libraries
- Institutions interested in web archiving
- Web archivists
- Archive managers
- Digital preservation specialists
The rendering issues due to format obsolescence within the archive technological landscape can be detected by the comparison of web pages. This scenario aims at finding possible solutions such as using image comparison to detect rendering errors within web archives and compare reference snapshots of web pages in different browser versions.
The quality of web crawlers can also be tested by comparing the crawled version of a page with the version online immediately after crawling.
Responsible for this Workpackage:
- Matthieu CORD/UPMC
- Stéphane GANÇARSKI/UPMC
Contributors:
- Andrés Sanoja [email protected] (web page segmentation, content based)
- Marc Law [email protected] ( supervised framework, image based, svm, marcalizer)
- Zeynep Pehlivan [email protected] (general info)
Contributors no longer working for the project:
- Alexis Lechervy [email protected]
- Myriam Ben Saad [email protected]
- Carlos Sureda [email protected]
- Jordi Creus [email protected]
LIP6 / Université Pierre et Marie Curie
Some parts of this package are adapted from the BrowserShot project developed by IM, France. https://github.com/sbarton/browser-shot-tool-mapred
-
Sanoja A, Gançarski S. Block-o-Matic: A Web Page Segmentation Framework. Paper accepted for oral presentation in the International Conference on Multimedia Computing and Systems (ICMCS'14). Morroco, April 2014.
-
Sanoja A, Gançarski S. "Block-o-Matic: a Web Page Segmentation Tool and its Evaluation". BDA. Nantes, France. 2013.http://hal.archives-ouvertes.fr/hal-00881693/
-
Saad M B, Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009
-
Sanoja A, Gançarski S. "Yet another Web Page Segmentation Tool". Proceedings iPRES 2012. Toronto. Canada, 2012
-
Pehlivan Z., Saad M B, Gançarski S. "Understanding Web Pages Changes." DEXA (1) 2010: 1-15
-
Teva Law M, Sureda C, Thome N, Gançarski S, Cord M. "Structural and Visual Similarity Learning for Web Page Archiving", Workshop CBMI 2012
Blog Posts:
-
SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine http://www.openplanetsfoundation.org/blogs/2014-02-07-scape-qa-tool-technologies-behind-pagelyzer-i-support-vector-machine
-
SCAPE QA Tool: Technologies behind Pagelyzer - II Web Page Segmentation http://www.openplanetsfoundation.org/blogs/2014-02-12-scape-qa-tool-technologies-behind-pagelyzer-ii-web-page-segmentation
- This work was partially supported by the SCAPE project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137)