Thursday, November 28, 2019

A Python based web scrapper for blogs written at blogspot.com

If you want to programmatically extract data from a website, what you'd do is known as web scraping.  Some websites such as Facebook provide API for accessing data in their website.  The API provided by Facebook is Graph API.  Only a handful of websites provide such API.  For programmatically extracting data from all other websites, what we'd have to do is web scraping.
Let's see this Python based web scrapper I have prepared for extracting data from blogs written at blogspot.com

Web scraping that we are going to do is :
1. Send HTTP (or HTTPS) request to the web server.  The web server would respond by returning HTML content of the URL.
2. From the HTML content that is received, parse the data so that we obtain what we were looking for.

Let us write a Python program for this purpose.  We will need a Linux system where we will write and execute our Python program.  Also we will need three external Python libraries, listed below.  Before we start writing our python program, let us check if we have those external Python libraries available in the system.  And if not, let us install those.
The three external python libraries that we are going to use are :
1. requests
2. BeautifulSoup
3. html5lib

Let us check if these external Python libraries are available or not.  At the Linux command prompt, type python and hit enter.  At the Python prompt, type import requests and press enter.  Here is an example.
# python
Python 2.7.5 (default, Jun 11 2019, 14:33:56)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>>
If you do not see any error, you have that Python library available.  If you see an error, you need to install that Python library.  You could use pip for installing the required Python libraries.

# pip install requests
When we have the required external Python libraries available, it is time to write our Python program.

import requests
from bs4 import BeautifulSoup
import re

blog_url = "https://ysawant.blogspot.com/"
r_blog = requests.get(blog_url)

blog_soup = BeautifulSoup(r_blog.content, 'html5lib')
Now let us obtain a list of URLs mentioned in this webpage.  And also a count of how many URLs are mentioned in this webpage.
The links are mentioned using the anchor tag of HTML.  Here is an example from my blog.
< a class='timestamp-link' href='https://ysawant.blogspot.com/2019/10/how-to-disable-weak-arcfour-cipher-in.html' rel='bookmark' title='permanent link'>< abbr class='published' title='2019-10-30T20:14:00+05:30'>8:14 PM< /abbr>< /a>
print 'A list of _all_ links on this webpage :'
link_count = 0
for link in blog_soup.find_all('a'):
    href = link.get('href')
    if href == None:
        # Empty.  So skipping.
        continue
    found = re.search("^http", href)
    if found:
        print found.string
        link_count = link_count + 1
print '\nTotal', link_count, 'links found.\n'

You'd notice that some URLs are listed more than once.  So a possible improvement in our code is to remove duplicate URLs.  For this, we'll have to store all URLs in an array.  Then remove the duplicate entries in that array.  Then print the elements of the array.  I leave this as an exercise to be done by the readers.

Next, let us obtain the dates on which articles were published in this blog.  The blogs written at blogspot.com have this detail in the < h2 class='date-header'> HTML tag.  Here is an example from my blog.
< h2 class='date-header'>< span>Wednesday, October 30, 2019< /span>< /h2>

How did I get to know this?  By looking at the page source.  We, the programmers, have to decide exactly what data to grab from the whole lot of HTML content that is available.  For this, we have to closely look at the HTML content of the webpage.
print '\nArticles in this webpage were written on these dates :'
all_dates = blog_soup.find_all('h2', attrs = {'class':'date-header'})
for a_date in all_dates:
    print a_date.text

Next, let us obtain the titles of the articles that are published in this blog.  In blogs written at blogspot.com website, the titles of the articles are present in the the < h3 class='post-title entry-title'> HTML tag.  Here is an example from my blog.
< h3 class='post-title entry-title'>
< a href='https://ysawant.blogspot.com/2019/10/how-to-disable-weak-arcfour-cipher-in.html'>How to disable the weak arcfour cipher in Linux< /a>
< /h3>

And how did I get to know this?  By looking at the page source.
print '\nTitles of the Articles in this webpage :'
all_titles = blog_soup.find_all('h3', attrs = {'class':'post-title entry-title'})
for a_title in all_titles:
    print a_title.text
I checked this Python program with few of the blogs I know at blogspot.com
blog_url = "http://sudhirdeore29.blogspot.com/"
blog_url = "http://bhadkamkar.blogspot.com/"
blog_url = "https://navinraomhatre.blogspot.com/"
blog_url = "https://pakhandkhandinee.blogspot.com/"

We have our own Python based web scrapper, albeit a simple one.  Please note, this web scrapper would work with blogs written at blogspot.com only.  For other websites, we'll have to write web scrappers according to the HTML content of each website.


Let us see another web scrapper.  The wikinews.org website is full of news from around the world. The Main page of this website lists latest news in short one-liners.  Here is a web scrapper for obtaining the short one-liner latest news.
import requests
from bs4 import BeautifulSoup
from datetime import date

wikinews_url = "https://en.wikinews.org/wiki/Main_Page"

r_wikinews = requests.get(wikinews_url)

wikinews_soup = BeautifulSoup(r_wikinews.content, 'html5lib')
# print(wikinews_soup.prettify())

today = date.today()
print "Latest news on", today
latest_news = wikinews_soup.find_all('div', attrs = {'class':'latest_news_text'})
for news in latest_news:
    if news.text:
        print(news.text)
For writing this web scrapper, I looked at the page source of the Main page at wikiews.org and identified the HTML content that needs to be fetched.
Our desired content is included in < div class="latest_news_text" id="MainPage_latest_news_text">

You'd notice that the news at wikinews.org are not updated daily.  And in my opinion, they are not much useful as well.  I find that the Main page at wikipedia.org has some brief news that are updated regularly.  So here is another Python program to grab the news from the Main page at wikipedia.org
Here it is.
import requests
from bs4 import BeautifulSoup
from datetime import date
import re

wikipedia_url = "https://en.wikipedia.org/wiki/Main_Page"

r_wikipedia = requests.get(wikipedia_url)

wikipedia_soup = BeautifulSoup(r_wikipedia.content, 'html5lib')
# print(wikipedia_soup.prettify())

today = date.today()
print "In the news, ", today
in_the_news = wikipedia_soup.find_all('div', attrs = {'id':'mp-itn'})
count = 0
for news in in_the_news:
    for line in news.find_all('ul'):
        if count == 0:
            print(line.text)
            count = count + 1

I looked at page source of the Main page at wikipedia.org and found the HTML content that is useful in this case.

< div id="mp-itn" style="padding:0.1em 0.6em;">< div role="figure" class="itn-img" style="float: right; margin-left: 0.5em;">