Introduction to Jsoup
Jsoup is a powerful Java library that works with real-world HTML. It provides a convenient API for extracting and manipulating data using DOM, CSS, and jQuery-like methods. Jsoup can handle HTML parsing, content extraction, DOM traversal, and much more.
Installation
Adding Jsoup to Your Project
To use Jsoup, add the following dependency to your pom.xml
if you're using Maven:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version> <!-- or the latest version -->
</dependency>
For Gradle:
implementation 'org.jsoup:jsoup:1.15.3'
Basic Usage
Parsing HTML from a URL
Jsoup allows you to parse HTML from a URL and extract data easily.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.IOException;
public class JsoupExample {
public static void main(String[] args) {
try {
// Parse HTML from a URL
Document document = Jsoup.connect("https://example.com").get();
System.out.println(document.title());
// Extract specific element
Element element = document.selectFirst("h1");
System.out.println("First h1 element: " + element.text());
} catch (IOException e) {
e.printStackTrace();
}
}
}
Explanation: This example demonstrates how to parse HTML from a URL and extract the document title and the first h1
element.
Output:
Example Domain
First h1 element: Example Domain
Parsing HTML from a String
Jsoup can also parse HTML from a string.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JsoupStringExample {
public static void main(String[] args) {
String html = "<html><head><title>My Page</title></head>"
+ "<body><p>Hello, Amit!</p></body></html>";
Document document = Jsoup.parse(html);
System.out.println(document.title());
Element body = document.body();
System.out.println("Body text: " + body.text());
}
}
Explanation: This example demonstrates how to parse HTML from a string and extract the document title and body text.
Output:
My Page
Body text: Hello, Amit!
Advanced Features
Selecting Elements
Jsoup provides powerful methods to select elements using CSS selectors.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class JsoupSelectExample {
public static void main(String[] args) {
try {
Document document = Jsoup.connect("https://example.com").get();
// Select all paragraphs
Elements paragraphs = document.select("p");
for (Element paragraph : paragraphs) {
System.out.println("Paragraph: " + paragraph.text());
}
// Select element by ID
Element div = document.getElementById("main");
System.out.println("Element with ID 'main': " + div.text());
} catch (IOException e) {
e.printStackTrace();
}
}
}
Explanation: This example demonstrates how to select elements using CSS selectors and extract their text.
Output:
Paragraph: ...
Element with ID 'main': ...
Extracting Attributes
Jsoup allows you to extract attributes from elements.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.IOException;
public class JsoupAttributesExample {
public static void main(String[] args) {
try {
Document document = Jsoup.connect("https://example.com").get();
// Select the first link
Element link = document.selectFirst("a");
if (link != null) {
System.out.println("Link text: " + link.text());
System.out.println("Link href: " + link.attr("href"));
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Explanation: This example demonstrates how to extract the href
attribute from a link.
Output:
Link text: More information...
Link href: https://www.iana.org/domains/example
Modifying HTML
Jsoup allows you to modify the HTML content of a document.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JsoupModifyExample {
public static void main(String[] args) {
String html = "<html><head><title>My Page</title></head>"
+ "<body><p>Hello, Vikas!</p></body></html>";
Document document = Jsoup.parse(html);
Element body = document.body();
// Modify the body text
body.text("Hello, Priya!");
System.out.println(document.html());
}
}
Explanation: This example demonstrates how to modify the text of an element in the document.
Output:
<html>
<head>
<title>My Page</title>
</head>
<body>
Hello, Priya!
</body>
</html>
Extracting Data from Tables
Jsoup can be used to extract data from tables in HTML.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTableExample {
public static void main(String[] args) {
String html = "<table><tr><th>Name</th><th>Age</th></tr>"
+ "<tr><td>Amit</td><td>30</td></tr>"
+ "<tr><td>Priya</td><td>28</td></tr></table>";
Document document = Jsoup.parse(html);
Elements rows = document.select("table tr");
for (Element row : rows) {
Elements cells = row.select("th, td");
for (Element cell : cells) {
System.out.print(cell.text() + " ");
}
System.out.println();
}
}
}
Explanation: This example demonstrates how to extract data from an HTML table and print it.
Output:
Name Age
Amit 30
Priya 28
Complex Examples
Web Scraping
Jsoup can be used to scrape data from web pages.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class JsoupWebScrapingExample {
public static void main(String[] args) {
try {
// Connect to the website and get the document
Document document = Jsoup.connect("https://en.wikipedia.org/wiki/List_of_Indian_people").get();
// Select all people in the list
Elements people = document.select(".mw-parser-output ul li");
for (Element person : people) {
System.out.println("Person: " + person.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Explanation: This example demonstrates how to scrape a list of names from a Wikipedia page.
Output:
Person: ...
Person: ...
Handling Forms
Jsoup can handle form submissions and extract form data.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.IOException;
import java.util.Map;
public class JsoupFormExample {
public static void main(String[] args) {
try {
Document document = Jsoup.connect("https://example.com/login").get();
Element form = document.selectFirst("form");
if (form != null) {
Map<String, String> formData = formData(form);
for (Map.Entry<String, String> entry : formData.entrySet()) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static Map<String, String> formData(Element form) {
Map<String, String> data = new java.util.HashMap<>();
for (Element input : form.select("input")) {
String name = input.attr("name");
String value = input.attr("value");
if (!name.isEmpty()) {
data.put(name, value);
}
}
return data;
}
}
Explanation: This example demonstrates how to extract form data from a webpage.
Output:
username:
password:
Parsing and Modifying Large HTML Documents
When working with large HTML documents, Jsoup provides efficient methods for parsing and modifying the content.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.File;
import java.io.IOException;
public class JsoupLargeDocumentExample {
public static void main(String[] args) {
try {
// Parse a large HTML file
File inputFile = new File("path/to/large-file.html");
Document document = Jsoup.parse(inputFile, "UTF-8");
// Extract and modify a specific element
Element element = document.selectFirst("div.content");
if (element != null) {
element.text("Updated content");
}
System.out.println(document.html());
} catch (IOException e) {
e.printStackTrace();
}
}
}
Explanation: This example demonstrates how to parse and modify a large HTML file efficiently.
Output:
<!DOCTYPE html>
<html>
<head>
<title>Large Document</title>
</head>
<body>
<div class="content">Updated content</div>
</body>
</html>
Conclusion
Jsoup is a versatile and powerful library that works with HTML in Java. This guide covered the basics of parsing HTML from a URL and a string, selecting elements, extracting attributes, modifying HTML, extracting data from tables, handling forms, and more complex examples like web scraping and working with large documents.
By leveraging Jsoup, you can simplify and enhance your HTML data extraction and manipulation tasks in Java applications. For more detailed information and advanced features, refer to the official Jsoup documentation.
Comments
Post a Comment
Leave Comment