Web Scraping in Python || Tool for Digital Marketing
Updated: Jan 25, 2021
The fuel for Digital Marketing as we all know is "Data" and Internet is full of it. As per statistics "It is believed that over 2.5 quintillion bytes (2.5 e+9 GB) of the data is created every day"(Source takeo.ai) With this much data generated every day it is really difficult to get specific as per your need.
There are multiple ways to get data available publicly "Web Scraping" is one of it. In this post I will walk you through step by step to extract information needed from websites.
Use Case: Extract email id's of Company Directors.
Assumptions:
Company CIN- Will be used as input to our script for finding email.
ChromeDriver Installed - Script will be based on Chrome Driver.
Necessary packages installed
Xpath's of the elements used are extracted by inspecting elements or by using using available XPath extraction tools (Chropath is what I prefer)
Step 1: Import necessary packages using import statements.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time
Step 2:Reading Data to pass as input to Script.
path='./Data_file.csv'
April_cin_dir={"CIN":[],"Director":[]}
df=pd.read_csv(path)
cin_list=df.values.tolist()
url='https://www.zaubacorp.com/' #Change with your desired website
Step 3:Chrome driver Initialization to open webpage
driver=webdriver.Chrome()
driver.maximize_window()
Step 4:Defining function for Opening Webpage
def open_app(url):
driver.get(url)
Step 5:Defininfg function for extracting email id
def get_director(cin_list):
counter=0
WebDriverWait(driver,10)
try:
for i in cin_list:
counter+=1
driver.find_element_by_xpath("//input[@id='search-com']").send_keys(i) #Finding search box on the page and entering search criteria.
driver.find_element_by_xpath("//button[@id='edit-submit--3']").click() #Finding Submit button and sending click event
time.sleep(5) #This becomes savior in most of the case as webpages take time to load post click
April_cin_dir["CIN"].append(i)
April_cin_dir["Director"].append(driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/div[2]/div[10]/table/tbody/tr/td[2]/strong/a").text) #Traversing till email field and extracting it's text
except: #Except condition to handle if email not present.
April_cin_dir["Director"].append("Not Present")
get_director(cin_list[counter:])
Step 6:Calling Custom Function to Run Email Extraction Script and Storing Extracted date in CSV
open_app(url)
get_director(cin_list)
driver.close()
df_dir=pd.DataFrame.from_dict(April_cin_dir)
df_dir.to_csv("Nov_cin_director_opc1.csv")
I have Assumed Finding XPATH is known to you. In case you want me to right that, please comment. Will share insights on that.
Thanks for reading through.