So you have a requirement that needs you to work with real world HTML. Crawl the webpages, fetch the data desired and feed it to your perpetually hungry for information database. What do you do?
You approach your friend Google, tell him what’s troubling you and ask him to make your life easier as it’s been doing for over a decade now. Google delightfully transports you over to the gates of yet another java library. The sign on the board reads Jsoup.
You enter and you find out that Jsoup provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. You read through and as read through, you’re suddenly struck with a realization that your ginormous requirement was just a figment of your imagination.
Add the following jar dependency in your BuilConfig.groovy
Here’s how it’s done:
So basically, JSOUP parses HTML to the same DOM as modern browsers do. It helps you to:
i) Scrape and parse HTML from a URL, file, or string
ii) Find and extract data, using DOM traversal or CSS selectors
iii) Manipulate the HTML elements, attributes, and text
iv) Clean user-submitted content against a safe white-list, to prevent attacks
v) Output tidy HTML
Particularly concerned with the first two functionalities, the following helps you with how to proceed to make use of JSOUP. If your requirements meet other Jsoup features as well, please refer to the following link for full documentation:
I. How to load a Document from a URL
To load an instance of a Document Object, use the Jsoup.connect(String url) method, as shown below:
[groovy]Document document = Jsoup.connect("http://example.com/").get();[/groovy]
Here is how the above piece of code works:
The connect(String url) method creates for you a new Connection, and get() fetches and parses a HTML file.
The Connection interface is designed for method chaining to build specific requests. So for the given url to parse, you can specify precise requirements with your request. For example:
[groovy]Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").cookie("auth", "token").timeout(3000).post();[/groovy]
II. How to use selector-syntax to find elements
So now you have successfully fetched the HTML content you desire. Now, you want to find the elements that hold your desired content. With Jsoup you do this using a CSS or jquery-like selector syntax.
For this, Jsoup provides the
a) Element.select(String selector) method:
[groovy]Element element = document.select("div.title").first();[/groovy]
This returns the matching div element with class “title”. Further using the element.text(), will fetch you the content of the div.
b) Elements.select(String selector) methods
[groovy]Elements elements = document.select("a[href]");[/groovy]
This will return a list of all the anchor tags (links) in the HTML. Further using the element.attr(‘href’), will fetch you the refering url of the anchor tag.
The select method is available with a Document, Element, or in Elements Instance. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls. Once you have the desired Element or the list of Elements, you can simply refer to their attributes that you want to extract using the .attr(‘attribute’) method or their content as a String using the .text()
For furthur dwelling into Jsoup, refer to the follwing link: