Create Web Crawling Googlebot With Colly On FreBSD

Web crawling is a technique that allows you to examine, parse, and extract data that may be difficult to access from a blog or web site. Web crawling is done systematically, starting with the “seed” URL, and recursively visiting the links the crawler finds on each page visited.

The Colly library is a Go package that can be used to create web scrapers and web crawlers. The way it works is based on net/HTTP Go (for network communications) and goquery (which allows you to use “jQuery-like” syntax to target HTML elements).

With Colly you can easily extract structured data from a blogsite or WEb App. You can use this data for various applications, such as data mining, data processing, or data archiving.

Main Features Colly
* Clean API
* Google App Engine support
* Fast (capable of performing >1k requests/second on a single core)
* Manages request delays and maximum concurrency per domain
* Sync/async/parallel scraping
* Distributed scraping
*Can cache data
* Automatic encoding of non-unicode responses
* Robots.txt support
* Automatic cookie and session handling
* Configuration via environment variables
* Extensions

In this article, we will use the Go programming language, a popular scripting language from Google, and one of the packages we will use Colly. Creating a crawler in Go and Colly is easy. In this article, I'll show you how to create a basic script that can check the health of links, labels and images across a website. I will also share some things that will help you build a professional crawler.


1. Installing Colly

Because the Colly library runs in the GO language, so as a first step we install the GO application first. To speed up the installation process, you can use the PKG FreeBSD package.
root@ns7:~ # pkg install go
Now is the time for us to install Colly, to get the complete Colly library we just use the FreeBSD Port system.
root@ns7:~ # cd /usr/ports/www/colly
root@ns7:/usr/ports/www/colly # make install clean
We will create a new directory to store all the projects we will work on with Colly.
root@ns7:/usr/ports/www/colly # mkdir -p /usr/local/etc/colly
root@ns7:/usr/ports/www/colly # cd /usr/local/etc/colly
To create a new scraper, we can use the command below.
root@ns7:/usr/local/etc/colly # colly new crawlingblogsite.go
root@ns7:/usr/local/etc/colly # ls
crawlingblogsite.go
You have successfully created a scraper file named "/usr/local/etc/colly/crawlingblogsite.go". You can see the contents of the "/usr/local/etc/colly/crawlingblogsite.go" file below.

package main

import (
        "log"

        "github.com/gocolly/colly/v2"
)

func main() {
        c := colly.NewCollector()

        c.Visit("https://yourdomain.com/")
}

Change the domain name to your own domain, for example "https://www.unixwinbsd.site".


2. Colly Web Crawler

Colly's main entity is in the Collector structure. The collector keeps track of each page queued to be visited, manages network communications and is responsible for executing attached callbacks when pages are being scraped.

Colly gives us a number of callbacks that can be set or not. These callbacks will be called at various stages in the crawl job, and we must decide which one to choose based on our needs. Following is a list of all callbacks and the order in which they are called.
  1. OnResponse(f ResponseCallback) - Called after response received.
  2. OnRequest(f RequestCallback) - Called before a request.
  3. OnXML(xpathQuery string, f XMLCallback) - Called right after OnHTML if the received content is HTML or XML.
  4. OnError(f ErrorCallback) - Called if error occured during the request.
  5. OnHTML(goquerySelector string, f HTMLCallback) - Called right after OnResponse if the received content is HTML.
  6. OnScraped(f ScrapedCallback) - Called after OnXML callbacks.
OK, let's just start creating a web crawler with Colly. The first step is to edit the file "/usr/local/etc/colly/crawlingblogsite.go", and we delete all its contents, then add the script below to the file.

package main

import (
	"flag"
	"fmt"
	"net/url"
	"os"
	"regexp"
	"time"

	"github.com/gocolly/colly"
	_ "go.uber.org/automaxprocs"
)

var (
	domain, header            string
	parallelism, delay, sleep int
	daemon                    bool
)

func init() {
	flag.StringVar(&domain, "domain", "", "Set url for crawling. Example: https://example.com")
	flag.IntVar(&parallelism, "parallelism", 2, "Parallelism is the number of the maximum allowed concurrent requests of the matching domains")
	flag.IntVar(&delay, "delay", 1, "Delay is the duration to wait before creating a new request to the matching domains")
	flag.BoolVar(&daemon, "daemon", false, "Run crawler on daemon mode")
	flag.IntVar(&sleep, "sleep", 60, "Time in seconds to wait before run crawler again")
	flag.StringVar(&header, "header", "", "Set header for crawler request. Example: header_name:header_value")
}

func crawler() {
	u, err := url.Parse(domain)

	if err != nil {
		fmt.Fprintf(os.Stderr, "%v\n", err)
		os.Exit(1)
	}

	// Instantiate default collector
	c := colly.NewCollector(
		// Turn on asynchronous requests
		colly.Async(true),
		// Visit only domain
		colly.AllowedDomains(u.Host),
	)

	// Limit the number of threads
	c.Limit(&colly.LimitRule{
		DomainGlob:  u.Host,
		Parallelism: parallelism,
		Delay:       time.Duration(delay) * time.Second,
	})

	// On every a element which has href attribute call callback
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		// Visit link found on page
		// Only those links are visited which are in AllowedDomains
		c.Visit(e.Request.AbsoluteURL(link))
	})

	if len(header) > 0 {
		c.OnRequest(func(r *colly.Request) {
			reg := regexp.MustCompile(`(.*):(.*)`)
			headerName := reg.ReplaceAllString(header, "${1}")
			headerValue := reg.ReplaceAllString(header, "${2}")
			r.Headers.Set(headerName, headerValue)
		})
	}

	c.OnResponse(func(r *colly.Response) {
		fmt.Println(r.Request.URL, "\t", r.StatusCode)
	})

	c.OnError(func(r *colly.Response, err error) {
		fmt.Println(r.Request.URL, "\t", r.StatusCode, "\nError:", err)
	})

	fmt.Print("Started crawler\n")
	// Start scraping
	c.Visit(domain)
	// Wait until threads are finished
	c.Wait()
}

func main() {
	flag.Parse()

	if len(domain) == 0 {
		fmt.Fprintf(os.Stderr, "Flag -domain required\n")
		os.Exit(1)
	}

	if daemon {
		for {
			crawler()
			fmt.Printf("Sleep %v seconds before run crawler again\n", sleep)
			time.Sleep(time.Duration(sleep) * time.Second)
		}
	} else {
		crawler()
	}
}

Continue with the following command.
root@ns7:/usr/local/etc/colly #  go mod colly
root@ns7:/usr/local/etc/colly # go get go.uber.org/automaxprocs
root@ns7:/usr/local/etc/colly # go get github.com/gocolly/colly
To run the web crawler we have to add several "flags".
-daemon: Run crawler on daemon mode
-delay int: Duration to wait before creating a new request to the matching domains (default 1)
-domain string: Set url for crawling. Example: https://example.com
-header string: Set header for crawler request. Example: header_name:header_value
-parallelism int: Parallelism is the number of the maximum allowed concurrent requests of the                                         matching domains (default 2).
-sleep int: Time in seconds to wait before run crawler again (default 60)

Example of use:
root@ns7:/usr/local/etc/colly # go run crawlingblogsite.go -domain https://www.unixwinbsd.site -header header_name:header_value -daemon




Or, as a practical way, you can download the Colly Web Crawler repository from Github
root@ns7:~/.cache/pypoetry/index # git clone https://github.com/unixwinbsd/CollyWebCrawling.git

Now you have your own web crawler, and it is ready to use at any time. Do it every day Crawling your Blogsite or web site, so that Googlebot can quickly index the URLs on our blogsite.
Iwan Setiawan

I Like Adventure: Mahameru Mount, Rinjani Mount I Like Writer FreeBSD

Post a Comment

Previous Post Next Post