jsoup – Parsing HTML using jsoup

posted in: Tools | 1

jsoup is an easy yet powerful tool for data extraction and manipulation of HTML data using Java. This post covers basic usage of jsoup, with a sample code for parsing HTML table using jsoup

jsoup is an Open Source Java library distributed under MIT licence for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Read more about jsoup here.

jsoup could be very useful in scenarios like scraping web pages, traversing and extracting links from webpages.

Selecting HTML elements using jsoup

Using jsoup is easy, elements could be selected using css/jquery like selectors. E.g. div#id

[java] Elements links = doc.select("div#customer"); // div with id customer
Elements pngs = doc.select("a");// select all links
[/java]

Read more about jsoup selectors.

Manipulating Data using jsoup

Also HTML could be manipulated like jquery. Following is an example

[java] doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
[/java]

Parsing HTML Table using jsoup

Following example demonstrates parsing a simple HTML table using jsoup

Consider the following scenario, where in a HTML table needs to pe parsed and stored in CSV or Excel format.
Such scenarios could be handled by jsoup with ease.

Sample Data

CompanyContactCountry
Alfreds FutterkisteMaria AndersGermany
Centro comercial MoctezumaFrancisco ChangMexico
Ernst HandelRoland MendelAustria
Island TradingHelen BennettUK
Laughing Bacchus WinecellarsYoshi TannamuriCanada
Magazzini Alimentari RiunitiGiovanni RovelliItaly

Solution

1. Get the HTML body.
2. Select Table Element using selector
3. Select TR’s from the Table Body.
4. Iterate over Table Data.
5. Store the data in required format

[java] public ArrayList<Elements> parseTable(String URL) {
ArrayList<Elements> data = new ArrayList&amp;lt;&amp;gt;();
try {
Document doc = Jsoup.connect(URL).get();//Get data from input location
Element table = doc.select("table#customers").get(0);// Select table
Elements rows = table.select("tr");// Select tr’s
data.add(rows.get(0).select("th"));// Select Table heading
for (int j = 1; j &amp;lt; rows.size(); j++) {// Iterate through table data
data.add(rows.get(j).select("td"));// Storing result in Array List
}
} catch (Exception e) {
e.printStackTrace();
}
return data;
}[/java]

Complete Code

[java] import java.util.ArrayList;
import java.util.Iterator;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ParseTable {
public ArrayList<Elements> parseTable(String URL) {
ArrayList<Elements> data = new ArrayList&amp;lt;&amp;gt;();
try {

Document doc = Jsoup.connect(URL).get();
Element table = doc.select("table#customers").get(0);// Select table
Elements rows = table.select("tr");// Select tr’s
data.add(rows.get(0).select("th"));// Select Table heading
for (int j = 1; j &amp;lt; rows.size(); j++) {// Iterate through table data
data.add(rows.get(j).select("td"));// Storing result in Array List
}
} catch (Exception e) {
e.printStackTrace();
}
return data;
}

public static void main(String[] args) {
ParseTable parseTable = new ParseTable();
ArrayList&amp;lt;Elements&amp;gt; tableData = parseTable.parseTable("https://www.evertechie.com/jsoup-parsing-html-using-jsoup/");
for (Elements elements : tableData) {
for (Iterator&amp;lt;Element&amp;gt; iterator = elements.iterator(); iterator
.hasNext();) {
Element element = (Element) iterator.next();
System.out.print(element.text() + "\n");
}
System.out.println("\n");
}

}

[/java]

Output

[bash] Company
Contact
Country

Alfreds Futterkiste
Maria Anders
Germany

Centro comercial Moctezuma
Francisco Chang
Mexico

Ernst Handel
Roland Mendel
Austria

Island Trading
Helen Bennett
UK

Laughing Bacchus Winecellars
Yoshi Tannamuri
Canada

Magazzini Alimentari Riuniti
Giovanni Rovelli
Italy

[/bash]

Leave a Reply

Your email address will not be published. Required fields are marked *