I just finished my first big PHP project. I decided to make it a google clone named Sparks. It was a really fun project, and I learned a lot about PHP with it. My favorite thing is that I got to make a WebCrawler and learned about how they work. Let’s go line by line and show how my crawler works.
This is how getDetails starts. Here we take in the $url and create a new instance on a class called DomDocumentParser. We call a function named getTitletags from that class. We create a new instance so we can call the functions inside of DomDocumentParser in getDetails as well as followLinks.
Here is how the DomDocumentParser class works. We first construct the class. If you remember, when we called a new instance of this class, we sent the $url. We then set some options that will tell webservers who is accessing their site, and what they are doing. In our case, we are using GET methods to tell that the site SparksBot is the one sending requests. We then get the context of those options and load the HTML from the url to get its contents.
Here’s your fun fact about loadHTML(). It’s a really good function that takes in html and that doesn’t even have to be formatted correctly. It will display that out on a page. In PHP 8.0.0, which is what I built this app on, the method will statically throw an Error exception. So theoretically, the crawler could get information indefinitely, but for me, the error would stop it after a short time. This is because it is deprecated, and I didn’t know that until after the project.
The rest of the functions in DomDocumentParser are to get certain tags from the HTML of sites the crawler visited. I get these tags for the site to ether jump to those sites or display information about those sites.
After getting the title tags, we have to filter out the bad results. We mainly look for the ones without anything and then end the function. If they don’t have anything, we move on the the next url. If they do, then we need to get <meta> tags. <meta> tags are information about the page such as a description about the site or keywords for google to find the site. We then get that information and put it into variables for us to use later.
We have to check if the $url is in our database. We use a database to store all the urls, information on site, and images for our search engine. If it is, we skip the insert. If not, we use insertLink to take us to another function to put it in the database.
We will also do the same with images as we did with links. We look for any <img> tag on a site, and then collect the picture and data with the picture. We will then insert all of this data into the database for images.
Now that we have that information in the database, we will make calls to that database and display information to the user based on user input. This was a really fun project. I hope to get this project online so you can take a closer look at it. I learned a lot about how PHP works. I already have another website clone that I want to do; however, if that is my next project , or I choose to put that on the backburner for now has yet to be decided. I do hope you have a good day, and to check my GitHub to see more of my projects: https://github.com/JakeKViti, and to follow my twitter where I plan to post more about what I am coding: https://twitter.com/JakeKViti. Thank you, and have a great day!