# Content group filler
Content group filler is a feature for whitecollar to autommaticly group items by a key. The idea is that the developer selects what elements are items and what are content groups. The relation of knowing which item belongs to each content group is done without any extra work from the developer.
It's a feature only working with puppeteerRipper whitecollar.
# Benefits
We can avoid any logic related to the content groups in the item extraction, this is a good step towards a very simple whitecollar.
# Interface to implement
getContentGroups
is an optional array in the whitecollar script (see the whitecollar article).
This article details all the properties each content group can have, for example:
{
selector: ".balcon",
extractors: {
name: ".title"
}
}
# selector
Mandatory string passed to querySelectorAll
under the hood, to select all the content groups.
It can contain different comma separated selectors that wrap the information of every content group (e.g. #latest-news ARTICLE, .featured-items .post
).
# prefix
Optional property to prefix the keys of that group of selected balcones. All balcones detected will prefix the name
with that prefix. This is really useful in conjunction with startsWith
feature of the layoutDescriptor.
# extractors
Mandatory object containing all the instructions on how to extract content group properties. All extractors instructions are applied to each content group node found by the selector.
# name
This property will select inside the node the desired selector and sanitize to use it as an identifier for every content group.
For example:
<div class="separator">
<span class="title">Breaking news</span>
</div>
With name: ".title"
will be extracted as breakingnews
.
The special keyword INNER_TEXT will alow you to get the text content of the selected group.
This is specially useful for content groups without children
, for example:
<div class="group-title">Breaking news</div>
<article></article>
<article></article>
<article></article>
<div class="group-title">Repairing news</div>
<article></article>
<article></article>
With name: "INNER_TEXT"
, it will extract to content groups with names breakingnews
with 3 articles and repairingnews
with 2 articles.
WARNING
Take into account that INNER_TEXT will get the innerText property of the HTML node provided in the selector
.
So for content groups with children
it will get the whole content group innerText.
# title
Selector to extract the title of the content group. Not sanitized.
TIP
You can also use INNER_TEXT for the title
property.
# children
This is a special property to define the type of content group.
There are two types of content groups: with children or without.
It's important to define the content groups properly as for each type of content group the strategy to link them is different.
# Balcon with children
In this case, we have to define the property children
to true
.
Example:
<div class="content-group">
<div class="title">Breaking news</div>
<article>....</article>
<article>....</article>
</div>
# Balcon without children
In this case we don't need to put the property children
, as by default is false
.
Example:
<div class="content-group">
<div class="title">Breaking news</div>
</div>
<article>....</article>
<article>....</article>
# What happens under the hood
If everything is configured correctly your items will have the key from the content group that they belong to.
For example:
<div class="content-group">
<div class="title">Ultimas noticias</div>
</div>
<article>....</article>
<article>....</article>
<article>....</article>
<article>....</article>
getContentGroups: [
{
selector: ".content-group",
extractors: {
name: ".title"
}
}
]
[
{
"title": "El Síndic achaca las largas listas de espera en Sanidad a los pacientes del resto de España",
"uri": "https://www.vozpopuli.com/elliberal/politica/Sindic-achaca-Sanidad-pacientes-Espana_0_1307869371.html",
"subtitle": "12:03",
"relevance": 9189,
"column": 1,
"media": null,
"pocket": {
"key": "ultimasnoticias"
}
},
{
"title": "El PNV advierte al PSOE: también es “importante” que avancen las conversaciones con ellos",
"uri": "https://www.vozpopuli.com/politica/PNV-advierte-PSOE-importante-conversaciones_0_1307869397.html",
"subtitle": "11:38",
"relevance": 9330,
"column": 1,
"media": null,
"pocket": {
"key": "ultimasnoticias"
}
},
]
So on the layoutDescriptor you can create a content group with that key.
# How to disable it
In general, if you don't create the getContentGroups
in the whitecollar nothing will be applied.
If you want to disable it for a particular item, what you can do is to create a pocket with a key. In case that a pocket
with a key exists in an item, content groups filler won't do anything for that particular item.
# How it works
The algorithm saves all detected content groups and for every item tries to assign it to them. In order to do that it has two strategies:
# Contains strategy
Imagine this scenario:
<div class="content-group">
<div class="title">Breaking news</div>
<article>....</article>
<article>....</article>
<article>....</article>
<article>....</article>
</div>
In this case, the content group has to be defined as children
true
. Something like:
{
"selector": ".content-group",
"children": true,
"extractor": {
"name": ".title"
}
},
And the strategy here to know if it belongs or not to that particular content group it to check if the DOM element is inside the content group.
# Position strategy
Imagine this other scenario:
<div class="content-group">
<div class="title">Breaking news</div>
</div>
<article>....</article>
<article>....</article>
<article>....</article>
<article>....</article>
In this case, the content group has to be defined as children false. Something like:
{
"selector": ".content-group",
"extractor": {
"name": ".title"
}
},
The strategy here its a little bit more complex and we rely on CSS calculation to understand where is the item positioned in the screen. In a simplified way if an item is under a content group we say it belongs to that content group. This takes into account different columns and so on, to relate to the proper key.