Create GoogleBot Web Crawler NodeJS on FreeBSD

Web scrapers and search engines rely on web crawling to extract information from the web. With Web crawling, web URLs will quickly appear in the Google search engine. As a result, web crawlers are becoming increasingly popular.

Creating a web spider project with a proper library like NodeJs is easy. Here you will learn how to create a JavaScript web crawler with the most popular web crawling library, namely NodeJS.

In this article we will create a web scraper with NodeJs, we will demonstrate how to create a web crawler in Node.js to scrape a website and store the retrieved data in a database. Our web crawler will perform web scraping and perform data transfer using Node.js worker threads.

Benefits of using the Node.JS web crawler.
  1. Web crawling with node crawler.
  2. Uses JavaScript so Node.js is ideal for web scraping.
  3. Web scraping with worker threads in Node.js.
  4. Implemented a basic crawler with Cheerio and Axios.
  5. Best practices and crawler improvement techniques.
  6. Browser automation and proxies for evasion.


1. Web Crawling in Node.js

Apart from indexing the world wide web, crawling can also collect data from websites. This is known as web scraping. The web scraping process can be very taxing on the computer's CPU, depending on the structure of the site and the complexity of the data being extracted. You can use the number of threads used to optimize the CPU-intensive operations required to perform web scraping in Node.js.

With Java support, NodeJs is widely used because it is a lightweight, high-performance, efficient and easy to configure platform.

The event loop in JavaScript is a critical component that allows Node.js to perform large-scale web crawling tasks efficiently. Javascript language can at any time do one thing compared to other languages like C or C++ which use multiple threads to do many things simultaneously.

The underlying thing is in the Javascript language, where it cannot run things in parallel but can run things simultaneously. Javascript cannot process all web crawling jobs simultaneously.




JavaScript, and NodeJs, in performing web scraping tasks, this means Node.js can start a task (such as making an HTTP request) and continue executing other code without waiting for the task to complete. The non-blocking nature of NodeJS allows Node.js to manage multiple operations simultaneously efficiently.

In carrying out web crawling tasks, Node.js uses an event-based architecture. When an asynchronous operation, such as a network request or file read, completes, it triggers an event. The event loop listens for these events and sends a callback function to handle them. This event-based model ensures that Node.js can manage multiple tasks simultaneously without any blocking.


2. Installing NodeJS on FreeBSD

Node.js does not require any fancy and expensive hardware to run, most computers in this era can run NodeJs efficiently. Even the smallest computers like the BeagleBone or Arduino YÚN can run NodeJs. However, a lot still depends on what memory software you run on the same system.

Each operating system has different methods and ways for NodeJs installation. The core settings files are different for each operating system. However, the creators of Node.js have been careful to provide the necessary files for each operating system.

On FreeBSD systems, NodeJS is run by the Node Package Manager. So you have to install NPM and Node to be able to run NodeJS on FreeBSD. The following is a guide to installing NodeJS on FreeBSD.
root@ns7:~ # pkg install brotli && pkg install icu && pkg install libuv
After that you run the following command.
root@ns7:~ # cd /usr/ports/www/node
root@ns7:/usr/ports/www/node # make install clean
The last package you have to install is Node Package Manager (NPM) with Node.js. NPM is an open source library of the Node.js package. To install NPM on FreeBSD, use the following command:
root@ns7:~ # cd /usr/ports/www/npm-node18
root@ns7:/usr/ports/www/npm-node18 # make install clean
You can read a complete explanation of how to install NodeJS in the previous article.



3. Web Crawler NodeJS

This CLI application crawls the base URL, which the user provides, as well as all its subpages. The NodeJS web crawler then divides the links into internal and external links. Internal links are links that point to other pages on the same domain. External links are links that lead to other domains on your website. For each link, it counts the number of occurrences and displays the final result in descending order. The output is split into internal and external links and printed to the terminal, where it is saved in a .csv file.

To start the NodeJS Web crawler, we have provided the script on the Github server. Clone the repository to your computer and run the following command in the directory where you placed the clone from Github.
root@ns7:~ # cd /var
root@ns7:/var # git clone https://github.com/unixwinbsd/WebCrawler-NodeJS-FreeBSD.git
root@ns7:/var # cd WebCrawler-NodeJS-FreeBSD
root@ns7:/var/WebCrawler-NodeJS-FreeBSD #
Use the "ci" command to start the installation.
root@ns7:/var/WebCrawler-NodeJS-FreeBSD # npm ci
Use the "init" command to initialize the application.
root@ns7:/var/WebCrawler-NodeJS-FreeBSD # npm init
This utility will walk you through creating a package.json file.
It only covers the most common items, and tries to guess sensible defaults.

See `npm help init` for definitive documentation on these fields
and exactly what they do.

Use `npm install <pkg>` afterwards to install a package and
save it as a dependency in the package.json file.

Press ^C at any time to quit.
package name: (web-crawler_js)
version: (1.0.0)
keywords:
author:
license: (ISC)
About to write to /var/WebCrawler-NodeJS-FreeBSD/package.json:

{
  "name": "web-crawler_js",
  "version": "1.0.0",
  "description": "You must have Node.js installed on your computer. This project was developed using Node.js v18.7.0",
  "main": "index.js",
  "scripts": {
    "test": "jest",
    "start": "node index.js"
  },
  "author": "",
  "license": "ISC",
  "devDependencies": {
    "jest": "^29.3.1"
  },
  "repository": {
    "type": "git",
    "url": "git+https://github.com/Sunkio/web-crawler_js.git"
  },
  "bugs": {
    "url": "https://github.com/Sunkio/web-crawler_js/issues"
  },
  "homepage": "https://github.com/Sunkio/web-crawler_js#readme",
  "dependencies": {
    "jsdom": "^21.1.2",
    "json2csv": "^6.0.0-alpha.2"
  }
}


Is this OK? (yes) yes
In the "npm init" command, there are several options that you have to fill in, just press enter, to speed up the process. Finally, type "yes" and continue with the enter key to end.

Run the following command to continue the installation.
root@ns7:/var/WebCrawler-NodeJS-FreeBSD # npm install jest --save-dev
root@ns7:/var/WebCrawler-NodeJS-FreeBSD # npm install jsdom
root@ns7:/var/WebCrawler-NodeJS-FreeBSD # npm install json2csv
Run the “npm start” command to start the NodeJS Web crawler.
root@ns7:/var/WebCrawler-NodeJS-FreeBSD # npm start

> web-crawler_js@1.0.0 start
> node index.js

Which URL do you want to crawl?https://www.unixwinbsd.site
Then just follow the instructions. This program will ask for the URL that you want to crawl. Enter the URL you want to browse. The program will then crawl the URL and all its subpages. The results will then be output to the terminal and exported to a .csv file. The file will be saved in the reports directory.



In this tutorial, we have learned how to create a web crawler that takes data from your website and saves it to a database. Do web crawling every day and every time, so that your website page URLs are quickly indexed and appear in search engines.
Iwan Setiawan

I Like Adventure: Mahameru Mount, Rinjani Mount I Like Writer FreeBSD

Post a Comment

Previous Post Next Post