Introducing the HtmlElement Rule Property

If you have a paid subscription (Standard plan and above), you can create rules for which links to follow and which links to ignore:

HtmlElement Rule Property

While previously a rule could only be based on the URL of a link (example: Url ENDSWITH ".gif"), it’s now also possible to include or exclude links based on where they are found in the HTML code. As an example, the following exclude rule makes our crawler ignore all <a> tags found inside HTML elements that have class="footer":

HtmlElement = ".footer a"

If you have worked with CSS before, the syntax should already be familiar to you. The table below lists the supported ways of selecting HTML elements:

Selector Description
.class Selects all elements that have the specified class
#id Selects an element based on the value of its ID attribute
element Selects all elements that have the specified tag name
[attr] Selects all elements that have an attribute with the specified name
[attr=value] Selects all elements that have an attribute with the specified name and value
[attr~=value] Selects all elements that have an attribute with the specified name and a value containing the specified word (which is delimited by spaces)
[attr|=value] Selects all elements that have an attribute with the specified name and a value equal to the specified string or prefixed with that string followed by a hyphen (-)
[attr^=value] Selects all elements that have an attribute with the specified name and a value beginning with the specified string
[attr$=value] Selects all elements that have an attribute with the specified name and a value ending with the specified string
[attr*=value] Selects all elements that have an attribute with the specified name and a value containing the specified string
* Selects all elements
A B Selects all elements selected by B that are inside elements selected by A
A > B Selects all elements selected by B where the parent is an element selected by A
A ~ B Selects all elements selected by B that follow an element selected by A (with the same parent)
A + B Selects all elements selected by B that immediately follow an element selected by A (with the same parent)
A, B Selects all elements selected by A and B

Equipped with this knowledge, it’s possible to construct quite powerful rules:

Example Matched HTML elements
HtmlElement = "#search > a, #filter > a" <a> tags directly under elements with the IDs search or filter
HtmlElement = "a[rel~=nofollow]" <a> tags with a rel="nofollow" attribute
HtmlElement = "img[src$=.png]" <img> tags with an src attribute value ending in .png
HtmlElement = ".comments *" Everything inside elements with a comments class
HtmlElement = "head > link[rel=alternate][hreflang=en-us]" All <link rel="alternate" hreflang="en-us"> elements directly inside the <head> tag

This feature has been in beta for a while and has proven to be quite a valuable addition. We hope you will find it as useful as we do. If you run into any problems or have a suggestion, please drop us a note.


Older Post Newer Post