Did you ever wonder what keeps people from scrolling up and down on different social media platforms?
Well, it is the news feed that engages the people with these platforms. More specifically, when people see content that is relevant to them, they tend to spend more time on these platforms. Businesses are leveraging this fact to improve user traffic and returning visitors' rates.
I had got an opportunity to build a news feed curated for CADashboard which is India’s first cloud-based integrated platform for Professionals & SME clients. The task was to build a news feed filtered on specific keywords that aligned perfectly with the business objectives. Web scraping was the way to go for creating an in-house solution that could be easily deployed and customized according to the requirements. You can have a look at the news feed here - https://www.cadashboard.com/news_corner
There are several packages available in python for web scraping. I have personally used Beautiful Soup as it is a powerful & user-friendly package with exhaustive documentation. It allows us to parse the HTML content of a given URL and access the different elements using their tags and attributes. We also need the requests library to make HTTP requests and retrieve the HTML code to pass it to Beautiful Soup for further processing.
I used the MySQL database for storing the scraped news. The format of the news feed comprised of a title, thumbnail, source of the article, date published, and the URL to the complete news article. We will also need a MySQL driver to communicate with the database using Python.
To install all the above things, please type the following code into your Python distribution -
pip install beautifulsoup4 pip install requests pip install mysql-connector-python
Step 1 - Collecting the URL sources for scraping
The first step is to get a list of all the websites that provide the news articles which are relevant to the business. You also need to check whether the data that you are scraping is publicly available on the website. Most of the websites also have a robots.txt file that tells you the pages that can be crawled by the user agent.
Step 2 - Writing the code
The next step is to write the code for scraping the news articles and extracting the required attributes. I had used RegEx for filtering the articles based on the list of keywords.
The code is not shared due to confidentiality reasons.
Step 3 - Storing the data
After we have collected the relevant news articles it is time to store them in the database. The business application will fetch the data from the database and display it on the webpage in a structured format.
Putting it all together
Finally, we need to create a job that will run every day and fetch the latest news. In Windows, it is easy to run the script daily using the task scheduler.
Web scraping has been fun, and it has great potential to create small utilities that can have a significant impact on your business. Hope you enjoyed this blog post to create an automated news feed for your business!