# Debug article extraction
This article describes two methods to debug Boilerpipe execution, the process that retrieves HTML content of an article and Marfeelizes it.
To debug article extraction you can debug directly Gutenberg's extraction using IntelliJ, or use the GenerateTestFixtures
test suite. Whereas using the first option you will be able to see the rendered version of the article, using the fixtures is faster when you just need to check the HTML output.
This guide can be used to:
- Go through all methods executed during the Boilerpipe process and better understand if there are any issues.
- Validate if a modification in BoilerPipe has the expected output.
- Validate if a configuration flag has the expected behavior.
- Make sure a change that the tenant should do will solve the issue. Eg. Add a class, remove a malformed element...
MarfeelPress
To debug MarfeelPress Extractor API, follow this dedicated guide.
# Debug Gutenberg execution
Check this video to learn how to debug Gutenberg using IntelliJ:
IntelliJ Remote debug parameters:
Name: remoteDebug Host: localhost Port: 49285 Command line arguments for remote JVM: -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=49285
# Debug using test fixtures
The GenerateTestFixtures
class executes Boilerpipe on the HTML of an article and outputs the resulting HTML.
TIP
A test Fixture (opens new window) is an environment used to consistently test some items.
Follow this guide to successfully debug the HTML processing on any locally modified HTML file using GenerateTestFixtures.java
.
Set up a local server to get the desired HTML served in a local URL. Do so by installing the Live Server (opens new window) extension in Visual Studio (opens new window).
Create a new HTML file using Visual Studio editor and fill it with the tenant's target article source code. Modify the HTML according to your needs. Eg. Adding a new class to an image, removing an element that may be breaking the extraction...
While having the new HTML file open, click on the Go Live button. This button appeared at the bottom right corner of Visual Studio after the Live Server plugin installation, restart Visual Studio if it didn't.
This will open the HTML file in a browser, served from the local server set up by the extension. Keep it here, for now, you'll need the URL in a later step.
Using IntelliJ, open the Gutenberg project. Then, open the
GenerateTestFixtures.java
file.
Shortcut
To find the file, use the cmd
+ o
(opens new window) shortcut to open the search console in IntelliJ. There, type the class name GenerateTestFixtures
.
Assign the URL generated on step 4 to the
URL
variable (opens new window) of theGenerateTestFixtures.java
file.Set the tenant's extraction configuration flags in the
getOptions()
private method (opens new window).Run the
GenerateTestFixtures
class, inGenerateTestFixtures.java
to obtain boiler's HTML output. Select therun
option to launch the test, and thedebug
one if you want the execution to stop at the breakpoints.
- Find the output file in the
MarfeelGutenberg/MarfeelBoilerpipe/src/test/resources/
folder. Open it with your preferred IDE to see its content.
TIP
When debugging the test execution, there are two main operations to validate.
Boilerpipe HTML process, which you can debug by adding a breakpoint to the
HTMLDocumentProcessor(fetcher).process(args)
function. Once the debugger is stopped, you can go inside the Boilerpipe process and isolate the part of it you are interested in.Structured Data process, which you can debug by adding a breakpoint to the line where
fStructuredData
is created.
Test
Keep in mind this is a test.
Some parts might behave differently than production. Eg. The article URL is not the same.