Web - Select Custom Thumbnails & Content
Category: Data Sources
Learn to optimize search results with custom thumbnails, clean content, and precise XPath extraction for visually appealing and highly relevant product pages.
Optimizing Crawling of Products on www.shure.com
Let’s have a look at how to optimize the crawling of the products on www.shure.com. This is what it looks like when we get started. As you can see, we don’t have a nice thumbnail, and the content includes the menu and footer of the page.
And this is what the same results will look like after this tutorial. We now have a thumbnail that depicts the product, and the content does not include menus.
Example: SM58 Microphone
Let's use the SM58 microphone as an example. As a first step to optimize the search results for such a product page, let us use the product image as a thumbnail for the search results. To do this, we need to identify the image from the page that we want to use. Mindbreeze uses XPath for finding elements in the HTML.
What is XPath?
XPath is a query language that selects elements from XML-based documents, such as HTML. You can configure XPaths for extracting the title and the content of a search result. Additionally, you can extract metadata with any name you want. If the XPath matches, the extracted value is written into the metadata. If it is default metadata, it will be overwritten; if it is new, it will automatically create it. If no match is found, the metadata remains empty or defaults to standard behavior.
Step 1: Inspect the Product Page
Let's begin by setting the thumbnail by inspecting the product page in our web browser.
Open the browser console using Function Key 12.
Use the Inspector Tool to hover over the product image.
In our example, the product image is in an image tag. However, some images might have a wrong extension, such as .delay, which isn't helpful. By exploring further, we find li elements containing valid images with correct extensions, like .png or .jpeg.
To target these images, we'll construct an XPath query. For instance:
//img[contains(@src,'.png') or contains(@src,'.jpg') or contains(@src,'.jpeg')]
Once written, this query can be tested directly in the browser console.
This will highlight all matching images on the page, and we can verify the selected elements visually by hovering over them.
Step 2: Refine the XPath Query
Next, we refine the XPath to narrow the scope. For example, we can look for images within a specific <div> container, such as the one containing product-image in the class attribute.
The final XPath query would look like this:
//div[contains(@class,'product-img')]//img[contains(@src,'.png') or contains(@src,'.jpg') or contains(@src,'.jpeg')]
Step 3: Implement the XPath in Mindbreeze
To implement this in Mindbreeze, follow these steps:
- Copy the final XPath query from the browser console.
- Open the "Shure" Index with Advanced Settings.
- Navigate to the Connector settings in the Mindbreeze configuration.
- Locate the Content Extraction section.
- Add a new custom property, mesthumbnailurl (a predefined metadata field for thumbnail extraction), and paste the XPath query here.
- Ensure the query is clean—remove any $x("") wrappers or whitespace from the copied XPath.
- Add the src attribute to point the thumbnailer directly to the image URL:
//div[contains(@class,'product-img')]//img[contains(@src,'.png') or contains(@src,'.jpg') or contains(@src,'.jpeg')]//@src
- The thumbnail metadata type should be set to URL as we are extracting image URLs.
Once configured, save the settings and reindex the dataset. Depending on the size of your dataset, reindexing may take some time. Once complete, the extracted image will replace the default screenshot as the thumbnail in your search results.
So, this is what it looks like with nice thumbnails:
Refining the Content
Similarly, we'll source the relevant content for the description to make the results more compelling and informative.
With Mindbreeze, we have several ways to achieve this. Content, for instance, can be further enriched using tools like AI Content Optimization. However, in this case, we'll focus on the expert method: utilizing XPath to extract specific elements from web pages.
Step 4: Refine the Content Extraction
We've identified that the main text lies within a main container. Using a similar approach as before, we construct an XPath query to target this container:
//main
Test it in the browser console to ensure it highlights the desired content without including headers or footers.
Step 5: Implement the Content Extraction in Mindbreeze
Return to the Content Extraction settings in Mindbreeze:
- Locate the predefined field for content.
- Paste the new XPath query here.
- Save the configuration, restart the crawler, and reindex as before.
Once reindexing is complete, you'll notice that the thumbnail now displays the correct product image (e.g., the SM58 microphone) as well as the content without headers and footers.
Content Extraction Before and After
Before:
After:
Further Fine-Tuning
Further fine-tuning is always possible. For instance, adjust the XPath to extract additional metadata fields, such as product descriptions, prices, or ratings. You can also overwrite existing metadata by using matching field names in the configuration.
If the XPath doesn't match, the system will return to default values or leave the metadata uninitialized.
Next Tutorials
Activate the PDF Preview for Your Results
Learn to enable high-quality PDF previews for Office documents in Mindbreeze, transforming basic text extractions into visually engaging, fully formatted previews.
Similarity Search - Setup
Learn to activate Similarity Search in Mindbreeze to deliver smarter, context-aware results and enhance your search capabilities with natural language understanding.