Fetching Pages from Web using Jsoup.Connect

posted in: Tools | 0

Jsoup provides a mechanism for connection to web server and fetching pages, making it easier to work with.  No additional libraries are required for connection or request/response handling.

In my previous post on Parsing HTML using jsoup, I have covered on how jsoup could be used for parsing and scraping HTML pages. In this post I would be exploring different connection methods and cookie handling using jsoup.

Fetching Pages using Jsoup (HTTP requests)

Jsoup has a Connection Interface (known implementation – Class HttpConnection) that could be used to fetch pages from the web server. A new connection can be initialized using Jsoup.connect(String Url).

Making a simple GET/POST request using Jsoup

[java]Document doc = Jsoup.connect("http://en.wikipedia.org/").get();[/java] [java]Document doc = Jsoup.connect("http://en.wikipedia.org/").post();[/java]
Adding data to post request

The simplest way of adding data to a POST request using jsoup is data(“key”, “value).  A Collection or a Map<String, String> could also be used in  case of adding multiple values.

[java] Document doc = Jsoup

Using Cookies

Similar to adding POST data, Cookies could be added to a request using cookie(“key”, “value) and a Collection or a Map<String, String> could also be used in  case of adding multiple values.

.cookie("SESSIONID", sessionId)

Authentication using jsoup

At times, cookies are required to be sent with each request, this could be handled with an ease reusing cookies from Connection.Response object.

In the following example, cookies are first obtained by requesting the home page. Later have used the obtained cookies in subsequent request made to the web server.

[java] // Getting Cookies
Connection.Response home = Jsoup

//Using it in the next request
Document document = Jsoup
.data("name", "username")
.data("password", "pass")

Specifying User Agent and Time out for Jsoup requests

  1. User Agent

    User Agent for the request can be set using userAgent(String) method. This is necessary where the pages for Mobile and Desktop are served different by the web server.  You could get the complete list of User Agent Strings here.

  2. Timeout

    According to the javadocs, the default timeout for an org.jsoup.Connection is 3 seconds. You might need to increase it in certain cases where request takes more than normal timeout to download the page.

.userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36")
.timeout(12000) [/java]

Using jsoup for HTTPS requests

You might get an exceptions while requesting HTTPS content. A simple workaround for this would be setting the jsse.enableSNIExtension property. Paste the following line above Jsoup.Connect statement.

[java]System.setProperty("jsse.enableSNIExtension", "false");[/java]

Leave a Reply

Your email address will not be published. Required fields are marked *