Web Crawler

JacobViti
4 min readNov 18, 2021

--

I just finished my first big PHP project. I decided to make it a google clone named Sparks. It was a really fun project, and I learned a lot about PHP with it. My favorite thing is that I got to make a WebCrawler and learned about how they work. Let’s go line by line and show how my crawler works.

Follow Links

Here is the beginning of the crawler. We call followLinks and take in a $url. We create a new instance of DomDocumentParser; we will go over that class later. For now, we will get all the <a> tags at the $url site. An <a> tag is anything on the site that is clickable and will reroute us to a different url. For each link we get, we have to filter out other things that comeback with <a> tags, such as JavaScript functions that open something on their site. We make sure we didn’t already crawl through the site before, and then we get more details about the site. Also at the bottom, we recursively call this function with any new url we find while crawling. This means if we find a new page, then we get the information off that page as well.

Start of getDetails

This is how getDetails starts. Here we take in the $url and create a new instance on a class called DomDocumentParser. We call a function named getTitletags from that class. We create a new instance so we can call the functions inside of DomDocumentParser in getDetails as well as followLinks.

DomDocumentParser

Here is how the DomDocumentParser class works. We first construct the class. If you remember, when we called a new instance of this class, we sent the $url. We then set some options that will tell webservers who is accessing their site, and what they are doing. In our case, we are using GET methods to tell that the site SparksBot is the one sending requests. We then get the context of those options and load the HTML from the url to get its contents.

Here’s your fun fact about loadHTML(). It’s a really good function that takes in html and that doesn’t even have to be formatted correctly. It will display that out on a page. In PHP 8.0.0, which is what I built this app on, the method will statically throw an Error exception. So theoretically, the crawler could get information indefinitely, but for me, the error would stop it after a short time. This is because it is deprecated, and I didn’t know that until after the project.

The rest of the functions in DomDocumentParser are to get certain tags from the HTML of sites the crawler visited. I get these tags for the site to ether jump to those sites or display information about those sites.

getDetails from right after $titleArray = $parser->getTitletags();

After getting the title tags, we have to filter out the bad results. We mainly look for the ones without anything and then end the function. If they don’t have anything, we move on the the next url. If they do, then we need to get <meta> tags. <meta> tags are information about the page such as a description about the site or keywords for google to find the site. We then get that information and put it into variables for us to use later.

Code to the end of getDetails

We have to check if the $url is in our database. We use a database to store all the urls, information on site, and images for our search engine. If it is, we skip the insert. If not, we use insertLink to take us to another function to put it in the database.

We will also do the same with images as we did with links. We look for any <img> tag on a site, and then collect the picture and data with the picture. We will then insert all of this data into the database for images.

Picture of Site

Now that we have that information in the database, we will make calls to that database and display information to the user based on user input. This was a really fun project. I hope to get this project online so you can take a closer look at it. I learned a lot about how PHP works. I already have another website clone that I want to do; however, if that is my next project , or I choose to put that on the backburner for now has yet to be decided. I do hope you have a good day, and to check my GitHub to see more of my projects: https://github.com/JakeKViti, and to follow my twitter where I plan to post more about what I am coding: https://twitter.com/JakeKViti. Thank you, and have a great day!

--

--

JacobViti