jsoup is an easy yet powerful tool for data extraction and manipulation of HTML data using Java. This post covers basic usage of jsoup, with a sample code for parsing HTML table using jsoup
jsoup is an Open Source Java library distributed under MIT licence for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Read more about jsoup here.
jsoup could be very useful in scenarios like scraping web pages, traversing and extracting links from webpages.
Selecting HTML elements using jsoup
Using jsoup is easy, elements could be selected using css/jquery like selectors. E.g. div#id
[java] Elements links = doc.select("div#customer"); // div with id customerElements pngs = doc.select("a");// select all links
[/java]
Read more about jsoup selectors.
Manipulating Data using jsoup
Also HTML could be manipulated like jquery. Following is an example
[/java]
Parsing HTML Table using jsoup
Following example demonstrates parsing a simple HTML table using jsoup
Consider the following scenario, where in a HTML table needs to pe parsed and stored in CSV or Excel format.
Such scenarios could be handled by jsoup with ease.
Sample Data
Company | Contact | Country |
---|---|---|
Alfreds Futterkiste | Maria Anders | Germany |
Centro comercial Moctezuma | Francisco Chang | Mexico |
Ernst Handel | Roland Mendel | Austria |
Island Trading | Helen Bennett | UK |
Laughing Bacchus Winecellars | Yoshi Tannamuri | Canada |
Magazzini Alimentari Riuniti | Giovanni Rovelli | Italy |
Solution
1. Get the HTML body.
2. Select Table Element using selector
3. Select TR’s from the Table Body.
4. Iterate over Table Data.
5. Store the data in required format
ArrayList<Elements> data = new ArrayList&lt;&gt;();
try {
Document doc = Jsoup.connect(URL).get();//Get data from input location
Element table = doc.select("table#customers").get(0);// Select table
Elements rows = table.select("tr");// Select tr’s
data.add(rows.get(0).select("th"));// Select Table heading
for (int j = 1; j &lt; rows.size(); j++) {// Iterate through table data
data.add(rows.get(j).select("td"));// Storing result in Array List
}
} catch (Exception e) {
e.printStackTrace();
}
return data;
}[/java]
Complete Code
[java] import java.util.ArrayList;import java.util.Iterator;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ParseTable {
public ArrayList<Elements> parseTable(String URL) {
ArrayList<Elements> data = new ArrayList&lt;&gt;();
try {
Document doc = Jsoup.connect(URL).get();
Element table = doc.select("table#customers").get(0);// Select table
Elements rows = table.select("tr");// Select tr’s
data.add(rows.get(0).select("th"));// Select Table heading
for (int j = 1; j &lt; rows.size(); j++) {// Iterate through table data
data.add(rows.get(j).select("td"));// Storing result in Array List
}
} catch (Exception e) {
e.printStackTrace();
}
return data;
}
public static void main(String[] args) {
ParseTable parseTable = new ParseTable();
ArrayList&lt;Elements&gt; tableData = parseTable.parseTable("https://www.evertechie.com/jsoup-parsing-html-using-jsoup/");
for (Elements elements : tableData) {
for (Iterator&lt;Element&gt; iterator = elements.iterator(); iterator
.hasNext();) {
Element element = (Element) iterator.next();
System.out.print(element.text() + "\n");
}
System.out.println("\n");
}
}
[/java]Output
[bash] CompanyContact
Country
Alfreds Futterkiste
Maria Anders
Germany
Centro comercial Moctezuma
Francisco Chang
Mexico
Ernst Handel
Roland Mendel
Austria
Island Trading
Helen Bennett
UK
Laughing Bacchus Winecellars
Yoshi Tannamuri
Canada
Magazzini Alimentari Riuniti
Giovanni Rovelli
Italy
jsoup - Fetching Pages from Web using Jsoup.Connect - EverTechie
[…] my previous post on Parsing HTML using jsoup, I have covered on how jsoup could be used for parsing and scraping HTML pages. In this post […]