# Building blocks
Content extraction at Marfeel happens in two separate modules:
- MarfeelAlibaba for section pages,
- MarfeelBoilerpipe for article pages.
# Section extraction details
sequenceDiagram
participant Developer
participant MarfeelAlibaba
participant MarfeelBoilerpipe
Developer ->>+ MarfeelAlibaba: call to invalidate a section
MarfeelAlibaba ->>+ MarfeelBoilerpipe: Request articles invalidation
MarfeelBoilerpipe -->> MarfeelBoilerpipe: Extract articles
MarfeelBoilerpipe ->>- MarfeelAlibaba: Return updated articles info
MarfeelAlibaba ->>- Developer: Updated section content
The section extraction entrypoint (opens new window) selects which ripper to use depending on the current section and the tenant's configuration.
Different rippers can coexist in a tenant, for different sections. They are responsible both for fetching the pages (through a fetcher) and extracting the content from them, following the tenant's configuration.
All rippers return a SectionExecutionResult
object (opens new window) as result, which can be printed as a JSON object.
Among other things, this object contains a list of extracted items: this is a flat list containing all the items extracted following the tenant's configuration.
Each item (or article) from this list is individually invalidated, to guarantee the information visible on a section page is up-to-date. This invalidation happens through the DetailsItemProcessor
.
# Article extraction details
sequenceDiagram
participant Developer
participant ItemService
participant MarfeelBoilerpipe
Developer ->>+ ItemService: call to invalidate an article
ItemService ->> ItemService: Is the item expired?
ItemService -->> Developer: Same article content
ItemService ->>+ MarfeelBoilerpipe: Article is expired
MarfeelBoilerpipe ->> MarfeelBoilerpipe: Update article
MarfeelBoilerpipe ->>- ItemService: Updated article content
ItemService ->>- Developer: Updated article content
The article extraction entrypoint (opens new window) selects a Boilerpipe implementation to use depending on the article characteristics (premium or not).
The BoilerpipeExtractor
judges whether the URL requested for extraction is good enough to be marfeelized (eg. long enough to be an actual article) and, if yes, goes through all the page content to build the marfeel page.
This includes extracting or building all the relevant metadata, as well as processing the content.
The extraction of an article does not trigger the invalidation of the section pages it is appears in.
# Debug extraction
The entrypoint to debug the extraction of a section page is the com.marfeel.pressSystem.SectionDetailPress.getSection
method (opens new window).
The entrypoint to debug the extraction of an article page is com.marfeel.gutenberg.invalidations.item.ItemInvalidationService.invalidate
method (opens new window).
For extractions happening in production, check the logs in Kibana:
- The
nginx
index stores all the invalidation requested via MarfeelInsight endpoint logs (opens new window). - The
invalidations
index stores all the invalidation logs, for each page, and whether they succeed or fail.
TIP
Learn how to debug MarfeelPress extractions with these guides: