# Extraction configuration migration
Any tenant at Marfeel can use several extraction strategies.
For example, one section can be extracted with the Whitecollar ripper, while the others work with MarfeelPress ripper.
Section and articles extraction configuration can also be mixed: extract all the articles of a site with BoilerpipePressExtractor
, but all the section pages with Jsoup Ripper for example: there are as many possible configurations as there are tenants.
This article describes how to change a tenant's article extraction strategy and section extraction strategy.
WARNING
Changing the configuration for a tenant using BoilerpipePressExtractor
or marfeelPressRipper
is discouraged and should be done as the last resort, as it lowers performance.
TIP
"Removing MarfeelPress", "Migrating the section" are different ways of saying the same thing: modifying the extraction configuration.
For example, if you "remove MarfeelPress for the home section" of a site, you're migrating the home section to a different ripper.
If you "remove MarfeelPress from articles", you're changing the Article extraction strategy.
# Article extraction strategy
Article extraction is handled by Boilerpipe
, which processes the content to produce the Marfeel version. By default, regular tenants use BoilerpipeExtractor
and MarfeelPress tenants use BoilerpipePressExtractor
.
To change the article extraction strategy of a tenant, use the itemExtractorType
in the global configuration of the definition.json
file. As value, add the desired extractor.
The Fetcher is the Boilerpipe component in charge of retrieving the content from the tenant's site.
BoilerpipePressExtractor
is bound to MarfeelPressFetcher
and therefore it can't be changed.
BoilerpipeExtractor
can use other fetchers, described in the Fetchers
section of the Article pages extraction
article. To change the Fetcher used by a tenant, add the boilerpipeFetcher
flag in the global configuration of the definition.json
file.
WARNING
All sections of a tenant must use the same Extractor and Fetcher, that's why they need to be configured in the global configuration, not in the section configuration.
# Section extraction strategy
Section extraction is handled by Alibaba
, which can be configured to use different Rippers
depending on the tenant's needs.
The different Rippers
available are described in the Sections pages extraction article.
WARNING
Rippers
can be configured for a specific section, therefore it is possible to have multiple Rippers
being used in the same tenant.
To change the Ripper
configuration of a section, use the feedRipper
flag in the desired section configuration.
TIP
The Ripper
can also be set in the global configuration of a tenant. In that case, it will be used by default when a section doesn't have a specific Ripper
configured.
# Global configuration
# Programmed invalidations
Tenants using the MarfeelPress plugin don't need programmed invalidations, as they have a different strategy to detect when the content needs to be extracted.
Tenants that don't have the MarfeelPress plugin though, use scheduled invalidations.
When an active tenant installs the MarfeelPress plugin, the scheduled invalidations can be disabled. This will reduce the load on the tenant's servers.
To disable the scheduled invalidations:
- Add the
quartzInvalidation
flag and set it tofalse
. - Add the
disabledConsumerInvalidation
flag and set it totrue
.
If an active tenant using the MarfeelPress plugin uninstalls it, the scheduled invalidations must be enabled. In that case, set quartzInvalidation
to true
and disabledConsumerInvalidation
to false
.
# Common issues
When migrating a tenant to use BoilerpipePressExtractor
, the following flags may be required to guarantee a successful extraction:
detailItemsProcessor
flag.- Add
mrf-toc
to thewhitelist
TIP
These flags are configured by default for tenants that install the MarfeelPress plugin.