# Building blocks

Content extraction at Marfeel happens in two separate modules:

  • MarfeelAlibaba for section pages,
  • MarfeelBoilerpipe for article pages.

# Section extraction details

sequenceDiagram
    participant Developer
    participant MarfeelAlibaba
    participant MarfeelBoilerpipe

    Developer ->>+ MarfeelAlibaba: call to invalidate a section
    MarfeelAlibaba ->>+ MarfeelBoilerpipe: Request articles invalidation
    MarfeelBoilerpipe -->> MarfeelBoilerpipe: Extract articles
    MarfeelBoilerpipe ->>- MarfeelAlibaba: Return updated articles info
    MarfeelAlibaba ->>- Developer: Updated section content

The section extraction entrypoint (opens new window) selects which ripper to use depending on the current section and the tenant's configuration.

Different rippers can coexist in a tenant, for different sections. They are responsible both for fetching the pages (through a fetcher) and extracting the content from them, following the tenant's configuration.

All rippers return a SectionExecutionResult object (opens new window) as result, which can be printed as a JSON object.

Among other things, this object contains a list of extracted items: this is a flat list containing all the items extracted following the tenant's configuration.

Each item (or article) from this list is individually invalidated, to guarantee the information visible on a section page is up-to-date. This invalidation happens through the DetailsItemProcessor.

# Article extraction details

sequenceDiagram
    participant Developer
    participant ItemService
    participant MarfeelBoilerpipe

    Developer ->>+ ItemService: call to invalidate an article
    ItemService ->> ItemService: Is the item expired?
    ItemService -->> Developer: Same article content
    ItemService ->>+ MarfeelBoilerpipe: Article is expired
    MarfeelBoilerpipe ->> MarfeelBoilerpipe: Update article
    MarfeelBoilerpipe ->>- ItemService: Updated article content
    ItemService ->>- Developer: Updated article content

The article extraction entrypoint (opens new window) selects a Boilerpipe implementation to use depending on the article characteristics (premium or not).

The BoilerpipeExtractor judges whether the URL requested for extraction is good enough to be marfeelized (eg. long enough to be an actual article) and, if yes, goes through all the page content to build the marfeel page. This includes extracting or building all the relevant metadata, as well as processing the content.

The extraction of an article does not trigger the invalidation of the section pages it is appears in.

# Debug extraction

The entrypoint to debug the extraction of a section page is the com.marfeel.pressSystem.SectionDetailPress.getSection method (opens new window).

The entrypoint to debug the extraction of an article page is com.marfeel.gutenberg.invalidations.item.ItemInvalidationService.invalidate method (opens new window).

For extractions happening in production, check the logs in Kibana:

TIP

Learn how to debug MarfeelPress extractions with these guides: