# Article pages extraction
Article pages are extracted through Marfeel Boilerpipe.
Generally, article pages are transformed following those concepts:
- Only content that is relevant for mobile is extracted. Thus, it avoids elements from the desktop version that are oversized or links from the header, sidebar or footer that are not useful. This is automatic, but it can fail to extract parts of the text that are needed. To manage the addition or removal of content, the
whitelist
andblacklist
flags can be used in definition.json. - Tenant's images and video embeds are detected and kept in Marfeel's database. This enables the addition of features that enhance the UX. For example, images can turn into a gallery.
- Links in multipage articles are followed. We detect where the different links are and follow them to show the entire content.
- Headings and other crucial HTML tags are distinguished from the rest of the content.
- Image articles are turned into a gallery.
Specifically, an article processed by the boilerpipe goes through:
- Fetchers
- Extractor
- SAXProcessors
# Fetchers
It's the first piece of code. They go to the Tenant's site and extract the content. They make a request of the client, get the content and put it in Marfeel.
- HTMLFetcher: Obtains through an HTTP request the content from the Tenant's web servers.
- MarfeelPressFetcher: Obtains through an HTTP request the content from the Tenant's WordPress server.
- JsoupFetcher: Extract the content with an HTTP request with JSOUP from the Tenant's web servers.
- CleanerFetcher: It allows the extraction of complicated elements that don't have regular ids or classes, but it doesn't do well at performance. It must only be used as a last resource.
- ReactFetcher: Enrich received HTML prior to creating a HTMLDocument
- StringFetcher: Only used by Marfeel SEO service to create "lorem ipsum" pages.
- FileSystemHtmlFetcher: Only used in tests to read HTML from file
# MarfeelExtractor
The MarfeelExtractor in the Boilerpipe scans the HTML for text. It disregards any code regarding images, tags, etc.
WARNING
By default, we always use Boilerpipe
. It is important to know that we cannot mix boilerpipes depending on sections.
One article can belong to multiple sections, which would lead to inconsistencies.
To determine the remainder of the text to extract, Boilerpipe is governed by several heuristics. For example, to identify the first paragraph of the article, it searches and identifies the largest block of text close to the title.
Some of the processors are:
- TitleProcessor: Detects the title of the page or article by searching for a string of text that is similar to the page URL.
- ContentProcessor: Detects what is content and what is not. It looks for the biggest paragraph in the body.
- PremiumProcessor: Checks if there is a premium paywall.
# SAXProcessors
A SAXProcessor is a way to process the HTML by only using specific HTML tags so that the memory does not keep the unnecessary parts from the whole HTML.
This is done with TagActions.
TagActions process one Tag at a time and it's not necessary to keep the whole HTML in memory. This strategy is faster and more efficient with memory.
Every TagAction can have:
- StartElement
- Characters
- EndElement
There are two SAXProcessors in Marfeel:
# ImageDocumentSAXProcessor
Detects all the media and important things on the Tenant's page including its correct order. Fundamentally, the ImageDocumentSAXProcessor detects media, videos, images and commenting systems supported by Marfeel in a Tenant's article.
This detection process is done by Detectors.
# HTMLDocumentSAXProcessor
Once all the important elements have been detected, HTMLDocumentSAXProcessor is responsible for replacing the code of all of them.
For example, we detect multiple images, but we only want to show some of them, we change the HTML and we only show a span that says mrf-image and has different characteristics.
This process is where we applied the whitelist and blacklist and other configurations that we set in the Tenant's definition.json file.
This replacement process is done by Replacers.
# DocumentModifiers
DocumentModifiers are responsible for establishing the revolutionary and innovative features that boost engagement in the Marfeel solution.
For example, images within the same container are detected, extracted, and modified into collapsed galleries in the publisher's Marfeel PWA to optimize the UI. It is a JSOUP element, it is not a SAXprocessor, that allows you to do this kind of things.
There are various types of DocumentModifiers with different features. Each is defined in the Tenant's definition.json
file. Some of the modifiers are:
- CollapsedGalleries (GalleryGrid)
- MultipageGenerator
- ImageMarker
- ImageBlacklist
- ImageNotSelectable
# IframeDocumentModifier
IframeDocumentModifier is a DocumentModifier that enables publishers to add an iframe within Marfeel that is not on their desktop version.
To enable it, the tenant needs to add the data-mrf-iframe
attribute to the HTML element that needs to be replaced by an iframe. As a value, add the iframe source.
<div data-mrf-iframe="www.myIframeSource.com/cool-url-here">
WARNING
No action is required for it to work, tenants have full control over this feature but it is probable we need to add the new element to the whitelist
so it can be extracted.
To set the desired dimensions:
- Add
width
and/orheight
to the attributes of the HTML element wheredata-mrf-iframe
was added.
<div data-mrf-iframe="www.myIframeSource.com/cool-url-here" height="myHeight" width="myWidth">
TIP
The values for width
(opens new window) and height
(opens new window) must follow the HTML specification for iframe attributes: they are strings, and default to px
if the unit is not present.
- Add this postMessage function at the very end of the iframe's HTML.
window.parent.postMessage({
sentinel: 'amp',
type: 'embed-size',
height: document.body.scrollHeight
}, '*');