Tuesday, May 12, 2020

Python script to find broken links in a web page

Today I was told to find broken links in a web page.  Someone could have done this manually, by opening the web page in a web browser, and checking each link.  That is not the right way, though.  I wrote a python script for this job.  Here it is.

import sys
import requests
from bs4 import BeautifulSoup
from urlparse import urlparse
from urlparse import urljoin


link_count = 0
searched_links = []
broken_links = []

def getLinksFromHTML(html):
    def getLink(el):
        return el["href"]
    return list(map(getLink, BeautifulSoup(html, features="html.parser").select("a[href]")))



def check_links(domainToSearch, URL, parentURL, depth):
    if (depth == 2):
    # We do not want to search all links recursively.
        return
    if (not (URL in searched_links)) and (not URL.startswith("mailto:")) and (not ("javascript:" in URL)) and (not URL.endswith(".png")) and (not URL.endswith(".jpg")) and (not URL.endswith(".jpeg")):
        try:
            requestObj = requests.get(URL);
            searched_links.append(URL)
            global link_count
            link_count = link_count + 1
            if(requestObj.status_code == 404):
                broken_links.append("Broken: link " + URL + " from " + parentURL)
                print(broken_links[-1])
            else:
                print("Not broken: link " + URL + " from " + parentURL)
                if urlparse(URL).netloc == domainToSearch:
                    for link in getLinksFromHTML(requestObj.text):
                        check_links(domainToSearch, urljoin(URL, link), URL, (int(depth)+1))
                        pass
        except Exception as e:
            print("ERROR: " + str(e));
            searched_links.append(domainToSearch)



# Written by y.sawant @ gmail.com
if (len(sys.argv) != 2):
    print "Please provide a URL.\n"
    sys.exit()
depth = 0
check_links(urlparse(sys.argv[1]).netloc, sys.argv[1], "", depth)

print("\n--- Checked " + str(link_count) + " links ---\n")

if not broken_links:
    print("No broken links are found.")
else:
    print("Broken links are listed below:")
    for link in broken_links:
        print ("\t" + link)


I know this script is not perfect.  But it gets the job done.  If you are seeking perfection in this script, I leave that part to you.

The three external python libraries that I used in this script are :
1. requests
2. BeautifulSoup
3. urlparse

If you want to check if these external Python libraries are available or not, here is how to.  At the Linux command prompt, type python and hit enter.  At the Python prompt, type import followed by the name of the library and press enter.  For example, to check whether requests library is available or not, type import requests and press enter.  Here is an example.

    # python
    Python 2.7.5 (default, Jun 11 2019, 14:33:56)
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import requests
    >>>

If you do not see any error, you have that Python library available.  If you see an error, you need to install that Python library.  You could use pip for installing the required Python libraries.

    # pip install requests

I used my script to check if https://www.google.co.in contains any broken links.  Here it is.

# python find_broken_links.py "https://www.google.co.in"
Not broken: link https://www.google.co.in from
Not broken: link https://www.google.co.in/imghp?hl=en&tab=wi from https://www.google.co.in
Not broken: link https://maps.google.co.in/maps?hl=en&tab=wl from https://www.google.co.in
Not broken: link https://play.google.com/?hl=en&tab=w8 from https://www.google.co.in
Not broken: link https://www.youtube.com/?gl=IN&tab=w1 from https://www.google.co.in
Not broken: link https://news.google.co.in/nwshp?hl=en&tab=wn from https://www.google.co.in
Not broken: link https://mail.google.com/mail/?tab=wm from https://www.google.co.in
Not broken: link https://drive.google.com/?tab=wo from https://www.google.co.in
Not broken: link https://www.google.co.in/intl/en/about/products?tab=wh from https://www.google.co.in
Not broken: link http://www.google.co.in/history/optout?hl=en from https://www.google.co.in
Not broken: link https://www.google.co.in/preferences?hl=en from https://www.google.co.in
Not broken: link https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.co.in/ from https://www.google.co.in
Not broken: link https://www.google.co.in/advanced_search?hl=en-IN&authuser=0 from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefs?sig=0_V3GwPGYEtv57hWTx9gHH5SRjSjo%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwijhYvltK7pAhWPxTgGHfF2AZ8Q2ZgBCAU from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefs?sig=0_V3GwPGYEtv57hWTx9gHH5SRjSjo%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwijhYvltK7pAhWPxTgGHfF2AZ8Q2ZgBCAY from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefs?sig=0_V3GwPGYEtv57hWTx9gHH5SRjSjo%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwijhYvltK7pAhWPxTgGHfF2AZ8Q2ZgBCAc from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefs?sig=0_V3GwPGYEtv57hWTx9gHH5SRjSjo%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwijhYvltK7pAhWPxTgGHfF2AZ8Q2ZgBCAg from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefs?sig=0_V3GwPGYEtv57hWTx9gHH5SRjSjo%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwijhYvltK7pAhWPxTgGHfF2AZ8Q2ZgBCAk from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefs?sig=0_V3GwPGYEtv57hWTx9gHH5SRjSjo%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwijhYvltK7pAhWPxTgGHfF2AZ8Q2ZgBCAo from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefs?sig=0_V3GwPGYEtv57hWTx9gHH5SRjSjo%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwijhYvltK7pAhWPxTgGHfF2AZ8Q2ZgBCAs from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefs?sig=0_V3GwPGYEtv57hWTx9gHH5SRjSjo%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwijhYvltK7pAhWPxTgGHfF2AZ8Q2ZgBCAw from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefs?sig=0_V3GwPGYEtv57hWTx9gHH5SRjSjo%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwijhYvltK7pAhWPxTgGHfF2AZ8Q2ZgBCA0 from https://www.google.co.in
Not broken: link https://www.google.co.in/intl/en/ads/ from https://www.google.co.in
Not broken: link http://www.google.co.in/services/ from https://www.google.co.in
Not broken: link https://www.google.co.in/intl/en/about.html from https://www.google.co.in
Not broken: link https://www.google.co.in/setprefdomain?prefdom=US&sig=K_8Rgf8LawsO1reHRLJSf5TzwNn9E%3D from https://www.google.co.in
Not broken: link https://www.google.co.in/intl/en/policies/privacy/ from https://www.google.co.in
Not broken: link https://www.google.co.in/intl/en/policies/terms/ from https://www.google.co.in

--- Checked 28 links ---

No broken links are found.
#