# Puppeteer Ripper
Puppeteer ripper is an alternative for Phantom and Jsoup rippers.
It uses Puppeteer(headless Chrome API for NodeJS) as a core technology.
This article describes the process of installation, configuration, and usage of the Puppeteer ripper.
# Installation
To have the mrf-puppeteer
command on your machine:
Pull MarfeelXP
and execute in terminal the following.
mrf-env -R
# Available commands
mrf-puppeteer extract
executes Puppeteer Ripper to extract tenant datamrf-puppeteer launch
starts headless Chrome instance. By defaultmrf-puppeteer extract
command will try to connect to the browser instance. If connection fails, it will launch it's own instance of browser.
# mrf-puppeteer extract
command flags
To see all supported flags use mrf-puppeteer --help
command.
# --uri
Required. Followed by web page URI which the instance of headless Chrome will connect to perform content extraction.
mrf-puppeteer extract --uri http://tenant.com/section1`
# --scriptPath
Required. Absolute path to WhiteCollar script which will be injected to the page.
mrf-puppeteer extract --scriptPath ~/path/to/whiteCollarScript.js
# --dev
Activates dev mode, will open the browser UI allowing to visually follow the command execution process. Useful for debugging purposes since you will get access to all injected JavaScript on the page.
mrf-puppeteer extract --dev
# --metadataProviderFiles
Injects metadata provider files into the page. It must be followed by a comma-separated list of absolute paths to the metadata provider files.
mrf-puppeteer extract --metadataProviderFiles ~/path/to/metadata/provider1.js,~/path/to/metadata/provider2.js
TIP
Only for local testing, in production the metadatas will always be available.
# --enableExternalScriptRequests
Enables external scripts to load.
By default mrf-puppeteer
will prevent external scripts(from other domains) to load.
WARNING
This flag should only be used in very rare edge cases.WhiteListedDomains
should be enough.
Before adding it, communicate with the content-platform chapter.
mrf-puppeteer extract --enableExternalScriptRequests true
# --whiteListedDomains
Enables external scripts to be loaded by domain. It must be followed by a comma-separated list of domains.
Useful when the page uses libraries located on CDN servers (e.g. jQuery) which are required for making page to function correctly.
mrf-puppeteer extract --whiteListedDomains domain1.com,domain2.com
# --enableImageRequests
Enables image file requests since by default in mrf-puppeteer
they are disabled for performance reasons.
mrf-puppeteer extract --enableImageRequests true
# --enableFontRequests
Enables font file requests since by default in mrf-puppeteer
they are disabled for performance reasons.
mrf-puppeteer extract --enableFontRequests true
# --enableStylesheetRequests
Enable stylesheet file requests, by default in mrf-puppeteer
they are enabled.
# --pageSelectorPrev
The css selector to detect the link to the previous page ; default is [rel='prev']
# --pageSelectorNext
The css selector to detect the link to the next page ; default is [rel='next']
mrf-puppeteer extract --enableStylesheetRequests true
# --sortItemsByDOM
Use DOM order to sort the items. By default, the order is based on relevance.
mrf-puppeteer extract --sortItemsByDOM true
# --waitUntil
Configures the point on which the extraction starts.
domcontentloaded
by default, the headless Chrome instance waits for the DOM to be loaded before starting the extraction process.
Possible values:
domcontentloaded
: When theDOMContentLoaded
event is fired.load
: When theload
event is fired.networkidle0
: When there are no more than 0 network connections for at least 500ms.networkidle2
: When there are no more than 2 network connections for at least 500ms.
mrf-puppeteer extract --waitUntil networkidle2
# --userAgent
Configures the userAgent to be use in the extraction. By default, the headless Chrome instance will set the a default userAgent.
Possible values:
mobile
: It will set the mobile userAgent.marfeel
: It will set as a userAgent Marfeel-crawler.- Custom String: any other value is used as-is as user agent.
mrf-puppeteer extract --userAgent "mobile"
or for a custom string
mrf-puppeteer extract --userAgent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
# Enable Puppeteer Ripper
To enable Puppeteer ripper for a tenant you need to set feedRipper property
in the configuration of definiton.json
:
"configuration": {
...
"feedRipper": "puppeteerRipper",
...
}
or in the section definition to enable it for a specific section:
"sectionDefinitions": [
{
...
"feedDefinitions": [
{
...
"alibabaDefinition" : {
"configuration" : {
"feedRipper": "puppeteerRipper"
}
}
...
}
]
}
]
# Puppeteer Ripper Flags in Production
To enable the mentioned flags use the following syntax in the configuration of definition.json
:
"puppeteerRipper:<FLAG>": "<VALUE>"
"configuration": {
...
"puppeteerRipper:whiteListedDomains":"randomDomain.com"
...
}
# User Interaction Library
User Interaction Library is an API registered on the window
object and accessible through window.userInteraction
, enabling puppetter to use the scrollPage
function.
Allows Puppeteer to simulate a user scrolling to the end of the document so the all content (even lazy-loaded one) is loaded by the time extraction starts.
In order to enable it, scrollPage
has to be called in the setup of WhiteCollar.
Example of usage
async setupFunction(callback) {
await window.userInteraction.scrollPage(350, 200, 10);
return callback();
}
# scrollPage function
(opens new window)
window.userInteraction.scrollPage(pageScrollAwaitPeriod, articleLoadAwaitInterval, maxArticleLoadScrolls): Promise
Scrolls to the bottom of the dom, waits for new content to load and checks if more content was loaded. If it's the case, repeats the iteration.
Configuration parameters:
# pageScrollAwaitPeriod
Defines the await time (for articles to load) after the first scroll.
Default value: 350
ms
# articleLoadAwaitInterval
Defines the await time (for articles to load) once the document end is reached.
Default value: 2000
ms
# maxArticleLoadScrolls
Defines the maximum number of content loads allowed.
Default value: 10