# Extraction flags
The following are all the flags that can be defined in a Tenant's definition.json regarding extraction.
# allowJavascriptLoad, alibabaWaitPageOpen, allowExternalJavascriptLoad
WARNING
Before using these flags (allowJavascriptLoad, alibabaWaitPageOpen, allowExternalJavascriptLoad), make sure you fully understand that they load unknown javascript scripts, and this might seriously affect Marfeel behaviour and even prevent it to work.
Used on section pages that are loaded with JavaScript. Each flag is, in order, a more efficient but more costly way of extracting pages rendered with javascript.
Include the following flags one by one, checking each individually. If unsuccessful, try with the next one until you use the three simultaneously.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"allowJavascriptLoad": "true",
"alibabaWaitPageOpen":"true",
"allowExternalJavascriptLoad":"true",
...
},
...
}
# articlePathLastParts
If an article path is short, this flag is used to define the minimum words of the last part of the article and check whether it's an article or not.
- Type:
number
- Default:
1
if the path only has 2 parts or less, and2
for longer paths.
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"articlePathLastParts": 3,
...
},
...
}
WARNING
Before using these flags (articlePathParts and articlePathLastParts), make sure you fully understand how they work.
These flags are meant to be used in URLs following a pattern and NEVER for a single URL case. If you want a specific URL to be detected as a section instead of an article you might want to develop a static section for this specific case.
If you still need to use these flags, keep in mind that you are changing how this tenant is identifying articles and sections, so please make sure to test several articles and sections to check everything is still working properly.
# articlePathParts
Defines the patterns to use to check whether a page is an article or not.
- Type:
number
- Default:
4
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"articlePathParts": 2,
...
},
...
}
# authorBio (MarfeelPress-specific)
Used only on MarfeelPress Tenants. Adds the author's bio in the article details.
- Type:
string
- Format: One of:
bottom
- The bio is added before the content of the articletop
- The bio is added after the content of the article
Example:
{
...,
"title" : "Title of the awesome example site",
"uri" : "www.example.com",
"configuration" : {
...,
"authorBio" : "bottom",
...
},
...
}
MarfeelPress specific
This flag is only active with the MarfeelPressFetcher
.
# blacklist
Avoids the extraction of elements from article pages. See more information in the documentation about blacklist and whitelist.
- Type:
string
- Default:
undefined
- Template: comma-separated list
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"blacklist": "desktop-footer,==off-phones",
...
},
...
}
# blacklistedUrlPatterns
Defines blacklisting content based on URL patterns.
WARNING
The patterns only check against the path, not the domain or the protocol.
It defines an AntPathMatcher (opens new window).
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"blacklistedUrlPatterns": "/*/example-url-pattern.shtml**",
...
},
...
}
WARNING
When blacklisting a whole section, validate it's not defined within definition.json or if it is, it's of type EXTERNAL
.
When the tenant is using MarfeelCDN, the pattern has to be blacklisted at CDN level.
# boilerEnableSecureConnections
Enables the secure connections on the Boilerpipe for articles.
This flag is not necessary if:
The definition already has the
hasHttps
flag set to trueThe first section from
sectionDefinition
useshttps
.Type:
boolean
Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"boilerEnableSecureConnections": "true",
...
},
...
}
# boilerpipeFetcher
Adds a custom RSS fetcher for the Boilerpipe. Review the fetchers article to know more about it.
- Type:
string
- Default:
htmlFetcher
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"boilerpipeFetcher": "tenantRssFetcher",
...
},
...
}
# boilerpipeIgnoreInlineImageDimensions
Sets the order of the getImageDimension methods during the extraction to:
- QueryParam
- FromPath
- File headers
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"boilerpipeIgnoreInlineImageDimensions": true,
...
},
...
}
WARNING
Remove this flag completely to disable it.
Setting it to false
won't work.
# boilerpipeIgnoreImageNameDimensions
Sets the order of the getImageDimension methods during the extraction to:
- CustomWidthAndHeightAttr
- WidthAndHeightAttr
- StylesAttr
- File headers
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"boilerpipeIgnoreImageNameDimensions": "true",
...
},
...
}
WARNING
Remove this flag completely to disable it.
Setting it to false
won't work.
# boilerpipeUserAgent
Specifies the User Agent that Boilerpipe has to use to browse the site's HTML as rendered in a specified device.
- Type:
string
- Default:
Mozilla/5.0 (Macintosh; In tel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 MarfeelMan
- Format: must be a valid user-agent, it will be used as-is, appending
MarfeelMan
at the end.
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"boilerpipeUserAgent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36",
...
},
...
}
# cleanerFetcherBlacklist
Defines the blacklist for the cleaner fetcher if it's selected as the Boilerpipe fetcher.
- Type:
string
- Default:
undefined
- Format: comma-separated list of DOMString
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"cleanerFetcherBlacklist": ".aside-inner, .block.comments",
...
},
...
}
# cronRefresh
Defines the frequency of section reloads according to cronmaker (opens new window).
- Type:
string
- Default: Every hour (at a random exact time to balance the load on our servers)
- Format: Must be a cron expression (opens new window).
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"cronRefresh": "0 0/3 * 1/1 * ? *",
...
},
...
}
TIP
To configure for a specific section, refer to this article
# customTagActions
Transforms an HTML tag into another element.
- Type:
string
- Default:
undefined
- Format: One of:
GenericVideoAttrElement
ImageElement
PinterestElement
CustomStyleElement
IgnorableElement
IframeElement
DIVElement
ScriptMetadataElement
NextPageTagAction
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"customTagActions": "PICTURE:DIVElement"
...
},
...
}
In this case, it will transform the PICTURE HTML elements into DIV elements.
# defaultTopMediaMediaSelectorStrategy
Selects the Top Media based on a specified option. The available options are included in MediaSelector.java (opens new window) in Gutenberg.
- Type:
string
- Default:
FORCE_DETAIL
- Format: One of:
HINT_OR_DETAIL
- The image is extracted from section pages. If not there, it is extracted from article pages.DETAIL_OR_HINT
- This is the default value. With this strategy, Marfeel first tries to extract the image from article pages, before moving on to section pages.FORCE_DETAIL
- The image is extracted from article pages.FORCE_HINT
- The image is extracted from section pages.TOPMEDIA
- The image is extracted from the top media.META_OR_DETAILS
- The meta image is extracted. If not there, the image is extracted from article pagesHINT_OR_META
- The image is extracted from section pages. If not there, the meta image is extracted.DETAIL_OR_HINT_OR_META
- The image is extracted first from article pages, then section pages, and then the meta.
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"defaultTopMediaMediaSelectorStrategy": "DETAIL_OR_HINT",
...
},
...
},
...
}
# defaultMediaSelectorStrategy
Defines how the image used in section pages is selected. If only a certain group of articles needs it, it is recommended to use forceStrategy
in the whiteCollar instead.
- Type:
string
- Default:
HINT_OR_DETAIL
- Format: Same values as
defaultTopMediaMediaSelectorStrategy
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"defaultMediaSelectorStrategy": "META_OR_DETAILS",
...
},
...
},
...
}
# detailsExcerpt (MarfeelPress-specific)
Used only on MarfeelPress Tenants. Adds the excerpt returned by the boilerpipePressExtractor.
- Type:
boolean
- Default:
false
Example:
{
...,
"title" : "Title of the awesome example site",
"uri" : "www.example.com",
"configuration" : {
...,
"detailsExcerpt" : "true",
...
},
...
}
MarfeelPress specific
This flag is only active with the MarfeelPressFetcher
.
# detailItemsProcessor
Allows choosing a different item processor. When a webpage is slow or there is a lot of content to extract, using the throttled processor makes the process more robust.
- Type:
string
- Default:
detailItemsProcessor
- Format: Must be an implementation of DetailItemsProcessor (opens new window), such as
throttledDetailItemsProcessor
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"detailItemsProcessor": "throttledDetailItemsProcessor",
...
},
...
}
# disableAMPCacheForImages
If set to true, the src of the image will be AMP_CACHE_URL_imageURL
where the AMP_CACHE_URL
is https://cdn.amproject.org/i/ (opens new window)
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"disableAMPCacheForImages": true,
...
},
...
},
...
}
WARNING
Remove this flag completely to disable it.
Setting it to false
won't work.
# disabledConsumerInvalidation
- Type:
boolean
- Default:
false
Disables default article invalidation configuration when true.
Default invalidation
When a consumer gets a request for an article older than 24h, it extracts it again.
Not necesary with MarfeelPress since that content is refreshed via API calls.
# pageNumberStartsFromZero
- Type:
boolean
- Default:
false
This pagination flag alters section page number calculation. When set to true, it will consider that the page number starts from zero instead of one.
Example:
page 1 -> https://www.diariodemorelos.com/noticias/categories/virales
page 2 -> https://www.diariodemorelos.com/noticias/categories/virales?page=1
page 3 -> https://www.diariodemorelos.com/noticias/categories/virales?page=2
In this example, by default the pagination links will show the wrong label and trigger a page links inconsistency exception and an alarm. Setting pageNumberStartsFromZero to true allow us to support this behaviour properly.
Example:
{
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
"pageNumberStartsFromZero": "true"
},
}
# disableDefaultPagePattern
- Type:
boolean
- Default:
false
This flag disables the default page pattern set by Gutenberg: "/page/([0-9]+)/?"
.
Only page patterns defined, in tenant's definition configuration and specific section definition pagePatterns, will be applied.
# disablePhantomDiskCache
When enabled, disables cache when using PhantomJs.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"disablePhantomDiskCache": true,
...
},
...
}
# disableProxyScripts
When set to true, scripts do not go through the cache.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"disableProxyScripts": true,
...
},
...
},
...
}
# disableSectionValidation
Disables section validation, which would normally avoid duplicated sections.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"disableSectionValidation": true,
...
},
...
}
WARNING
Remove this flag completely to disable it.
Setting it to false
won't work.
# dynamicItemContentConfiguration
Extracts the specified content block from the DOM of the client. Later it can be consumed from any JSP file that you specify.
- Type:
string
- Default:
undefined
- Format:
;
-separated list of a CSS selector followed by the widget name:.exampleSelector,exampleWidget;#someID > div,otherWidget
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"dynamicItemContentConfiguration": ".generic-widget > .discounts,dynamicContentWidget;.news-related,newsRelatedWidget",
...
},
...
}
How to use it: this is an example of how you would get the html of the specified content block previously selected in the jsp file. Following the example .generic-widget > .discounts,dynamicContentWidget
:
<%@taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<c:set var="detailWidgetDiscounts" value="${item.getDetailItem().getWidget(null, '', 'dynamicContentWidget')}" scope="request" />
<c:if test="${detailWidgetDiscounts != null}">
${detailWidgetDiscounts.getHtml()}
</c:if>
For AMP, you will also need to add the jigsaw:ampTranslator tag:
<%@taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<%@taglib prefix="jigsaw" uri="http://dev.marfeel.com/jsp/mrf/jigsaw" %>
<c:set var="detailWidgetDiscounts" value="${item.getDetailItem().getWidget(null, '', 'dynamicContentWidget')}" scope="request" />
<c:if test="${detailWidgetDiscounts != null}">
<jigsaw:ampTranslator>
${detailWidgetDiscounts.getHtml()}
</jigsaw:ampTranslator>
</c:if>
See a real tenant example on GitHub (opens new window).
# dynamicSectionAllowedQueryParams
Avoids stripping the specified query parameters when extracting a section. Use it if a section path is different depending on query parameters. This flag is expected to only work with dynamic sections, but it is possible to use it with the default section type too.
{
"name" : "section_name",
"title" : "Section title",
"configuration" : {
"dynamicSectionAllowedQueryParams" : "tag"
},
...
}
Be mindful of the query parameter
Don't use this flag for any query parameter. If you plan to keep a parameter called page
or p
, if the value is a number, look out!
You might be re-creating section pagination!
# enableUnsecureMedia
By default, Marfeel forces all media source to go through https
. With this flag, Marfeel keeps http
if it is in the original site.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"enableUnsecureMedia": true,
...
},
...
},
...
}
# extractImagesFromNoScript
Enables the extraction of images located inside <noscript>
tags.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"extractImagesFromNoScript": "true",
...
},
...
}
# extractionQueryParams
This flag adds parameters to the extraction query.
- Type:
string
- Default:
undefined
- Format: query string
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"extractionQueryParams": "mrfCacheBuster={timestamp}&key=value",
...
},
...
}
{timestamp}
is automatically replaced by a timestamp at extraction time.
Partial Deprecation
Using this flag only as a cache buster is deprecated. Use the mrfCacheBuster
flag for this purpose.
# mrfCacheBuster
This flag adds the mrfCacheBuster=${actualTimestamp}
query parameter to the extraction query. It is a simplified way of using the extractionQueryParams flag and it is recommended that you use mrfCacheBuster
instead of extractionQueryParams
when you are including the timestamp parameter only to avoid cache issues.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"mrfCacheBuster": true,
...
},
...
}
# feedRipper
It defines the way articles are extracted.
In the case of rssRipper
, the uri in sectionDefinitions
must be in XML format. New tenants should not use this option.
- Type:
string
- Default:
whiteCollarRipper
- Format: One of:
marfeelPressRipper
whiteCollarRipper
jsoupRipper
rssRipper
puppeteerRipper
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"feedRipper": "jsoupRipper",
...
},
...
}
# galleryBlackList
Prevents an image from being processed as a gallery image in an article page, and treats it as part of the article's textual content. This is especially useful for images that are links or buttons.
- Type:
string
- Default:
undefined
- Format: DOMString
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"galleryBlackList": ".author img,[src*='gravatar']",
...
},
...
}
# greedyWhitelist
It prioritizes whitelist over blacklist. If set to true
, an element that is both blacklisted and whitelisted will show in article pages.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"greedyWhitelist": "true",
...
},
...
}
# hasHttps
This flag is used when editing the code of a Tenant to avoid discrepancies between HTTPS sites and the local environment.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"hasHttps": "true",
...
},
...
}
# ignoreSSLErrors
Ignores SSL errors on the PhantomJS command.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"ignoreSSLErrors": true,
...
},
...
}
WARNING
Remove this flag completely to disable it.
Setting it to false
won't work.
# imageCaptionFromAttributes
Specifies the attribute name from the HTML element to be used for the image caption.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"imageCaptionFromAttributes": "data-source-name",
...
},
...
}
# imageResizer
This flag removes the mrf-detailsMedia
and mrf-rDetailsMedia
classes from an image and adds mrf-noResizeImage
.
- Type:
string
- Default:
undefined
- Format: DOMString
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"imageResizer": ".journalist-photo",
...
},
...
}
It is used with the imageResizer($width, $height);
mixin in custom.scss
to set a new size. For example:
@include imageResizer(78px, 78px);
# imageRulerSizeAttribute
Custom attribute to get the image dimensions from. Useful for example with some lazy-loading strategies.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"imageRulerSizeAttribute": "data-mrf-width,data-mrf-height",
...
},
...
}
# imageSrcAttribute
The attribute to use to get image sources, instead of src
. Useful for example with some lazy-loading strategies.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"imageSrcAttribute": "href",
...
},
...
}
# imageSrcSetAttribute
If the images of a Tenant have invalid srcset
links but have valid links inside a data-srcset
attribute, this flag can be used.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"imageSrcSetAttribute": "data-lazy-srcset",
...
},
...
}
# itemExtractorType
Chooses between premium (paid content) and Boilerpipe extractors.
- Type:
string
- Default:
boilerpipeExtractor
- Format: OneOf:
boilerpipeExtractor
boilerpipePressExtractor
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"itemExtractorType": "boilerpipePressExtractor",
...
},
...
}
# boilerpipeCharset
This flag allows you to parametrize the charset used to extract the details HTML. Use it when the Marfeel version is showing wrong text or wrong characters, and a possible cause could be using a different charset that the tenant has on desktop.
Values: String. Defaults to UTF-8
.
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"boilerpipeCharset": "iso-8859-1",
...
},
...
}
# jsoupCharset
Same as boilerpipeCharset
, it applies to the JSOUP ripper.
- Type:
string
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"jsoupCharset": "iso-8859-1",
...
},
...
}
# jsoupImageSrcAttribute
Same as imageSrcAttribute
, it is used when extracting with jsoup instead of the whitecollar.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"jsoupImageSrcAttribute": "src",
...
},
...
}
# marfeelPressToken (Deprecated)
Deprecated
This flag is deprecated. No need to add it for new tenants and it can be safely removed from any definition.json
.
Used only on MarfeelPress Tenants. It defines the Marfeel API Token that is needed to authenticate with WordPress sites.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"marfeelPressToken" : "12345678ABCDEFGH",
...
},
...
}
# maxConcurrentExtractionRequests
Defines the maximum amount of concurrent extraction of article pages. Useful to throttle the extraction.
- Type:
number
- Default:
3
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"maxConcurrentExtractionRequests": 1,
...
},
...
}
# minImageSize
Defines the minimum height and width used to filter images to keep in the Boilerpipe MinSizeFilter.java (opens new window).
- Type:
number
- Default:
90
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"minImageSize": 40,
...
},
...
}
# minWordsToConsiderFar
The minimum amount of words defined to include an image in the article body used as top media, to be duplicated displayed within the body of the text as well.
- Type:
string
- Default:
70
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"minWordsToConsiderFar": "300",
...
},
...
},
...
}
# multipageGenerator
Defines the query selector multipage generator for the tenant.
- Type:
string
- Default:
undefined
- Format: DOMString
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"multipageGenerator": ".md-item-media,.swiper-slide",
...
},
...
}
# multipageTitleSelector
Defines the query selector for the multipage title.
- Type:
string
- Default:
undefined
- Format: DOMSTring
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"multipageTitleSelector": ".titleRanking",
...
},
...
}
# multipageUriGenerator
Defines a URI generator according to the string entered.
- Type:
string
- Default:
idUriGenerator
- Format: Must be an implementation of AbstractUriGenerator (opens new window).
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"multipageUriGenerator": "pageIndexUriGenerator",
...
},
...
}
# nextArticlesStrategy
Defines how the next articles are selected and filtered.
- Type:
string
- Default:
VALID_ITEM
- Format: One of:
NO_FILTER
NO_WIDGET
HAS_DETAILS
VALID_ITEM
HAS_VALID_ITEMS
WIDGET_ITEM
Example: If the nextArticles were to only use specific widget items, it would resemble the following:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"nextArticlesStrategy" : "WIDGET_ITEM,envivoIframe",
...
},
...
},
...
}
# nextPageBlacklist
Defines the elements to omit from the subsequent pages of an article.
Works like the blacklist.
"nextPageBlacklist": "next_pages_elements_to_remove"
# nextPageWhitelist
Define the elements to include from subsequent pages of an article.
Works like the whitelist.
"nextPageWhitelist": "next_pages_elements_to_include",
# nextPageLimit
Defines the maximum number of next pages to be extracted.
- Type:
number
- Default:
35
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"nextPageLimit": 100,
...
},
...
}
# notSelectableImages
Defines the images not to be used as Top Media (for example, images used in a photo slider or avatars for authors).
- Type:
string
- Default:
undefined
- Format: DOMString
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"notSelectableImages": ".rslides img",
...
},
...
}
See a usage example in the embed Gallery guide.
# pagePattern
Sets up a global pagePattern that will be applied to all the sections. It needs to contain the path that defines a page and the matching group needed in order to find the page number.
This global pagePattern won't be applied on sections with "enablePagination": "false"
.
- Type:
string
- Default:
"/page/([0-9]+)/?"
- Format: regular expression
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"pagePattern": "/p/([0-9]+)",
...
},
...
}
WARNING
The global pagePattern
of a definition's configuration only applies to dynamic sections.
WARNING
The global pagePattern
of a definition's configuration is not applied on sections that have more than one feedDefinition
# quartzInvalidation
Enables / Disables the invalidation scheduler (scheduleSectionInvalidationTasks).
- Type:
boolean
- Default:
true
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"quartzInvalidation": "false",
...
},
...
}
TIP
This flag should be false
if tenant is using the invalidation API. Tenants using MarfeelPress Plugin use the invalidation API by default.
# queryParamsWhitelist
Defines the allowed, but not mandatory, query params for a URL on article extraction. The rest of the query params will be excluded.
For example it can be used to extract article pages that use URL query params. In the following URL the query param is page
:
https://www.example.com/example-url.html?page=0%2C2
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"queryParamsWhitelist": "page",
...
},
...
}
TIP
This flag can be used in order to allow blacklisted query params by Gutenberg, like utm query params (for example, utm_source).
# respectTopMediaRatio
Forces Top Media to have the same ratio as the original image.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"respectTopMediaRatio": true,
...
},
...
},
...
}
# sanitizeContent
When enabled,the HTMLDocumentProcessor class (opens new window) sanitizes HTML.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"sanitizeContent": "true",
...
},
...
}
# showCategoriesInDetails
If set to true, categories will show on article pages.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"showCategoriesInDetails": true,
...
},
...
}
MarfeelPress specific
This flag is only active with the MarfeelPressFetcher
.
# showBreadcrumbsInDetails (MarfeelPress-specific)
If set to true, breadcrumbs will be generated on article pages.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"showBreadcrumbsInDetails": true,
...
},
...
}
MarfeelPress specific
This flag is only active with the MarfeelPressFetcher
.
# skipAmpCssCheck
Deactivates the AMP file size check (AMP has a 50,000 bytes limit (opens new window)).
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"skipAmpCssCheck": true,
...
},
...
},
...
}
Invalid AMP Pages
This flag leads to invalid AMP pages.
# skipDate (MarfeelPress-specific)
Used only on MarfeelPress Tenants. When set to true the date does not appear in the article details.
- Type:
boolean
- Default:
false
Example:
{
...,
"title" : "Title of the awesome example site",
"uri" : "www.example.com",
"configuration" : {
...,
"skipDate" : "true",
...
},
...
}
MarfeelPress specific
This flag is only active with the MarfeelPressFetcher
.
# skipSubtitle (MarfeelPress-specific)
By default, MarfeelPress always displays article tags as subtitles. If this flag is on, article tags are not extracted and never displayed in an article.
- Type:
boolean
- Default:
false
Usage:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"skipSubtitle": true,
...
},
...
}
# useLegacyAlibaba
When enabled, the old Alibaba version is used.
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"useLegacyAlibaba": true,
...
},
...
}
# useSniVerifier
Enables Server Name Indication verifications (that is, it uses the HTMLfetcher SNI verification).
- Type:
boolean
- Default:
false
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"useSniVerifier": true,
...
},
...
}
# videoProviders
List of the video providers useful for the current tenant.
- Type:
string
- Default:
undefined
- Format: comma-separated list. Contains the
name
property of any implementation of the VideoDetector (opens new window).
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"videoProviders": "brightcoveAllYou,brightcoveAds",
...
},
...
}
# whiteCollarUserAgent
Specifies the User-Agent that whiteCollar has to use to browse the site's HTML as rendered in a specified device.
- Type:
string
- Default:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 MarfeelMan
- Format: One of:
mobile
: translates to"Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4 MarfeelMan"
.marfeel
: translates to"Marfeel-crawler"
.
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"whiteCollarUserAgent": "mobile",
...
},
...
}
Different values
Some exisiting definitions set different values to this flag.
Those cases will always fallback to the default.
# whitelist
Enables the extraction of elements from article pages. See more information in the documentation about blacklist and whitelist.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"whitelist": "[href=/author/],slideshow-subtitle",
...
},
...
}
# whiteCollarScript
Establishes the path of the default WhiteCollar file to be used by the sections on sectionDefinitions
.
Needs to be placed in the configuration of the definition.json.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"whiteCollarScript": "index/src/whiteCollar/main.js",
...
},
...
}
TIP
To configure for a specific section, refer to this article
# widgets
Defines the widgets to be used.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"userInterface":{
...,
"features":{
...,
"widgets": "mostRead",
...
},
...
},
...
}
# validArticleQueryParams
Some Tenants have articles that are built with query parameters. To replicate these articles on the customer's Marfeel PWA, this flag has to be used with the definitions to identify a valid article.
- Type:
string
- Default:
undefined
Example:
{
...,
"title":"Title of the awesome example site",
"uri":"www.example.com",
"configuration":{
...,
"validArticleQueryParams": "&aid=,&MAID=,&MFID=",
...
},
...
}