Using Java Jsoup Web Scraping to Extract Website Data with FreeBSD

Java is one of the most popular, most widely used programming languages and has a large community. This enables the creation of highly scalable and reliable services and multi-threaded data extraction solutions. Let's use the main concepts of web scraping with Java and review the most popular libraries for organizing data extraction streams from websites with JSoup.

Jsoup is a Java library designed to work with HTML scripts and also Java which simplifies work with HTML and XML. So developers can easily read and analyze data. JSoup can parse HTML from URLs, files, or strings, It offers easy-to-use APIs for URL retrieval, data parsing, extraction, and manipulation using DOM API, CSS, and xpath selector methods.

This is an excellent library for simple web scraping due to its simple nature and its ability to parse HTML scripts in the same way that browsers do so you can use commonly known CSS selectors.

The way JSoup works is almost the same as modern browsers, namely by implementing the WHATWG HTML5 specification, and parsing HTML to DOM. Not only that, JSoup has several advantages that ordinary browsers don't have.
  1. clean user-submitted content against a safelist, to prevent XSS attacks.
  2. find and extract data, using DOM traversal or CSS selectors.
  3. scrape and parse HTML from a URL, file, or string.
  4. output tidy HTML.\
  5. manipulate the HTML elements, attributes, and text.

In this article we will learn step by step how to extract data from a website using Java JSoup and Maven. The way to create web scraping in this tutorial uses a FreeBSD server that has Java and Maven installed.


1. What is Web Scraping

Web scraping is one of the simplest, effective, efficient and useful ways to extract data from websites. Some websites may contain very large amounts of valuable data. This is where web scraping can help to read and analyze data.

Web scraping refers to the extraction of data from a website. Web scraping is the process of obtaining data from a website both on a large and small scale. The information is then collected and then exported into a format that is easier for users to read. Be it a spreadsheet or an API.

By using web scraping we can obtain certain data such as images, tables, posts, or source code from the entire content of a website. The data obtained can be used for various purposes such as data collection, research, analysis and others.

Web Scraper can obtain all data on a website or blogsite. To get this data we need to provide the URL of the website that we want to scrap. We recommend that you determine what type of data you want to scrape so that the process is fast and efficient. 

For example, If we want a video or image from a website, we specify that we only need elements with the img tag to retrieve the image. This removes any img tags found on the website with the provided URL. Web scrapers load all HTML code from URLs, although some advanced scrapers can even load CSS and JavaScript. The extracted data can be saved in an excel or CSV file or even a JSON file.


2. Installing JSoup

To be able to use JSoup on a FreeBSD server, make sure your server has JAVA and Maven installed, because this article uses Maven as the JSoup build system. Read our previous article on how to install Maven and Java on FreeBSD.


FreeBSD repository does not provide JSoup, you can install JSoup from Github. The commands below will guide you to install JSoup on FreeBSD. Because you have installed Maven, we place the JSoup directory in the Maven directory "/usr/local/etc/maven-wrapper/instances.d".
root@ns7:~ # cd /usr/local/etc/maven-wrapper/instances.d
root@ns7:/usr/local/etc/maven-wrapper/instances.d # git clone https://github.com/jhy/jsoup.git
root@ns7:/usr/local/etc/maven-wrapper/instances.d # cd jsoup
root@ns7:/usr/local/etc/maven-wrapper/instances.d/jsoup #
The script above is used to configure JSoup from Github to your local FreeBSD server. Now we install JSoup.
root@ns7:/usr/local/etc/maven-wrapper/instances.d/jsoup # mvn install
or
root@ns7:/usr/local/etc/maven-wrapper/instances.d/jsoup # mvn install -X


3. Using jsoup Web Scraping

Web scraping should always start with a human touch. Before you scrape a website, you need to understand the HTML data structure of the website. Understanding data structures will give you an idea of how to traverse HTML tags when you implement a scraper.

We will use Maven as a web scraping tool. To create a new Maven project, open a Putty terminal, if you are running it through Windows and run the following command.

"mvn archetype:generate -DgroupId=com.example.jsoupexample -DartifactId=jsoup-example -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false"
root@ns7:/usr/local/etc/maven-wrapper/instances.d # mvn archetype:generate -DgroupId=com.example.jsoupexample -DartifactId=jsoup-example -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------< org.apache.maven:standalone-pom >-------------------
[INFO] Building Maven Stub Project (No POM) 1
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] >>> archetype:3.2.1:generate (default-cli) > generate-sources @ standalone-pom >>>
[INFO]
[INFO] <<< archetype:3.2.1:generate (default-cli) < generate-sources @ standalone-pom <<<
[INFO]
[INFO]
[INFO] --- archetype:3.2.1:generate (default-cli) @ standalone-pom ---
[INFO] Generating project in Batch mode
[INFO] ----------------------------------------------------------------------------
[INFO] Using following parameters for creating project from Archetype: maven-archetype-quickstart:1.4
[INFO] ----------------------------------------------------------------------------
[INFO] Parameter: groupId, Value: com.example.jsoupexample
[INFO] Parameter: artifactId, Value: jsoup-example
[INFO] Parameter: version, Value: 1.0-SNAPSHOT
[INFO] Parameter: package, Value: com.example.jsoupexample
[INFO] Parameter: packageInPathFormat, Value: com/example/jsoupexample
[INFO] Parameter: package, Value: com.example.jsoupexample
[INFO] Parameter: groupId, Value: com.example.jsoupexample
[INFO] Parameter: artifactId, Value: jsoup-example
[INFO] Parameter: version, Value: 1.0-SNAPSHOT
[INFO] Project created from Archetype in dir: /usr/local/etc/maven-wrapper/instances.d/jsoup-example
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  5.997 s
[INFO] Finished at: 2023-12-30T16:42:32+07:00
[INFO] ------------------------------------------------------------------------
Edit the "/usr/local/etc/maven-wrapper/instances.d/jsoup-example/pom.xml" file, and then delete its entire contents. Replace with the script below.

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.example.jsoupexample</groupId>
  <artifactId>jsoup-example</artifactId>
  <version>1.0-SNAPSHOT</version>

  <name>jsoup-example</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>

  <dependencies>
 <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.14.3</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
  </dependencies>

  <build>
        <plugins>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>com.example.jsoupexample.App</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>  
                        <phase>package</phase>  
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>


</project>


Don't forget, delete the entire contents of the "/usr/local/etc/maven-wrapper/instances.d/jsoup-example/src/main/java/com/example/jsoupexample/App.java" file, and replace it with the script below.

package com.example.jsoupexample;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;


public class App {

 public static void main(String[] args) {
        for(int i = 1; i <= 4; ++i) {
            System.out.println("PAGE " + i);
            try {
                String url = (i==1) ? "https://www.scrapingbee.com/blog" : "https://www.scrapingbee.com/blog/page/" + i;

                Document document = Jsoup.connect(url)
                                        .timeout(5000)
                                        .get();

                Elements blogs = document.getElementsByClass("p-10");
                for (Element blog : blogs) {
                    String title = blog.select("h4").text();
                    System.out.println("TITLE: " + title);

                    String link = blog.select("a").attr("href");
                    System.out.println("LINK: " + link);

                    String headerImage = blog.selectFirst("img").attr("src");
                    System.out.println("HEADER IMAGE: " + headerImage);
                    String authorImage = blog.select("img[src*=authors]").attr("src");
                    System.out.println("AUTHOR IMAGE:" + authorImage);

                    System.out.println();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

    }
}

To run the web scraper, open your terminal, navigate to "/usr/local/etc/maven-wrapper/instances.d/jsoup-example" directory, and run the following command.
root@ns7:/usr/local/etc/maven-wrapper/instances.d # cd jsoup-example
root@ns7:/usr/local/etc/maven-wrapper/instances.d/jsoup-example # mvn compile && mvn package && mvn install




Run web scraper with jar.
root@ns7:/usr/local/etc/maven-wrapper/instances.d/jsoup-example # java -jar target/jsoup-example-1.0-SNAPSHOT-jar-with-dependencies.jar



You can see all the scripts in this article in full at Github.

This example shows only a small part of what jsoup is capable of. JSoup is an excellent choice for web scraping in Java. While this article introduced the library, you can find out more about it in the jsoup documentation. Even though jsoup is easy to use and efficient, it has its drawbacks. For example, it's not able to run JavaScript code, which means jsoup can't be used for scraping dynamic web pages and single-page applications. In those cases, you'll need to use something like Selenium.
Iwan Setiawan

I Like Adventure: Mahameru Mount, Rinjani Mount I Like Writer FreeBSD

Post a Comment

Previous Post Next Post