Hands on with Scraping and Plotting Data in Python

Syed Saad Ahmed
3 min readMay 16, 2020

--

Recently, I’ve come along a requirement to scrape some content from a website, well it was Wikipedia. I have to utilize it somewhere and try to get something out of that content. Well getting something out of it is another story, trying to achieve this i came across a library used in python named as “BeautifulSoup”, There are many other libraries as well like “scrapy”. But for this task I opted beautifulsoup. It is indeed a good library to scrape off the content from any web-page or website within couple of minutes.

Introduction to BeautifulSoup

Beautiful Soup is a simple library in Python for extracting and getting data out of HTML and XML files. It commonly saves programmers hours or days of work, providing them Pythonic idioms for iterating, searching, and modifying the parse tree.

Scraping a Wikipedia Page with tables in it.

Starting off with a simple project to extract name of countries and their score from the table given in the Wikipedia page as mentioned in the URL variable in the script.

import requests                       
from bs4 import BeautifulSoup
import re
import pandas as pd
URL = 'https://en.wikipedia.org/wiki/World_Happiness_Report#2019_report'
material = requests.get(URL)
soup = BeautifulSoup(material.content, 'html.parser')
tables = soup.find_all('table', class_='wikitable sortable')

Starting off the script, first of all we have imported all the required headers. It includes BeautifulSoup as seen in the first three lines.

Secondly we request the content present on the URL using GET request module, and after that we have used the function BeautifulSoup to get the content as HTML parser, and then comes the find_all function using which we have find the table present on the page using the name of class of table described in the HTML page. the name of the class is wikitable and sortable as seen in the screenshot below also;

Screen-shot showing the class of table present on the Wikipedia page
Name of class of table present on Wikipedia page.

After the execution of the part of script shown above will get the table in the tables variable using the class name of the table. Moving forward iterating the table extracted is the main thing to do, Iterating over tr and td in the table will help getting the value;

for table in tables:
ths = table.find_all('th')
headings = [th.text.strip() for th in ths]
if headings[:2] == ['Country or region','Score']:
break
for tr in table.find_all('tr'):
tds = tr.find_all('td')
if not tds:
continue
country_name, score = [td.text.strip() for td in tds[:2]]
df = df.append({
'Country or region': country_name,
'Score': score,
}, ignore_index=True)

Getting the country name and scores inside the array, the next thing to do is to put them inside a data-frame to plot it and make some easy calculations on the data. To enclose the data into data-frame, one can easily use pandas library in Python.

Once the data is in data-frame, we can easily plot the data using matplotlib or any other plotting library present. We can here see an example with matplotlib in the code below;

df['Score']=df['Score'].astype(float)
df.plot.bar(x='Country or region',y='Score', rot=90, title="Distribution of Happiness score per Country",legend=False);
plot.ylabel("Happiness Score")
plot.xlabel("Countries")
plot.show();

Taking the name of countries on x-axis and Happiness scores on the y-axis, we can easily plot the relationship of Happiness score per Country. For example plot is show below for a better understanding;

Plot of Happiness Score vs Countries

Here is the plot shown, As there are a lot of countries in the table, so the x-axis is a bit messy. In a nutshell merging all of the code in pieces above, the overall code will be somewhat like this;

URL = 'https://en.wikipedia.org/wiki/World_Happiness_Report#2019_report'
material = requests.get(URL)
soup = BeautifulSoup(material.content, 'html.parser')
tables = soup.find_all('table', class_='wikitable sortable')for table in tables:
ths = table.find_all('th')
headings = [th.text.strip() for th in ths]
if headings[:2] == ['Country or region','Score']:
break
df = pd.DataFrame(columns=['Country or region','Score'])for tr in table.find_all('tr'):
tds = tr.find_all('td')
if not tds:
continue
country_name, score = [td.text.strip() for td in tds[:2]]
df = df.append({
'Country or region': country_name,
'Score': score,
}, ignore_index=True)
df['Score']=df['Score'].astype(float)
df.plot.bar(x='Country or region',y='Score', rot=90, title="Distribution of Happiness score per Country",legend=False);
plot.ylabel("Happiness Score")
plot.xlabel("Countries")
plot.show();

Happy Coding and Scraping !!!

--

--

Syed Saad Ahmed
Syed Saad Ahmed

Written by Syed Saad Ahmed

Python, DevOps, Cryptography, Infrastructure Automation. https://thesaadahmed.com/

No responses yet