G-HGRQ3430W9 259738593
top of page
Search

Web Scraping in Python || Tool for Digital Marketing

Updated: Jan 25, 2021

The fuel for Digital Marketing as we all know is "Data" and Internet is full of it. As per statistics "It is believed that over 2.5 quintillion bytes (2.5 e+9 GB) of the data is created every day"(Source takeo.ai) With this much data generated every day it is really difficult to get specific as per your need.

There are multiple ways to get data available publicly "Web Scraping" is one of it. In this post I will walk you through step by step to extract information needed from websites.


Use Case: Extract email id's of Company Directors.

Assumptions:

  1. Company CIN- Will be used as input to our script for finding email.

  2. ChromeDriver Installed - Script will be based on Chrome Driver.

  3. Necessary packages installed

  4. Xpath's of the elements used are extracted by inspecting elements or by using using available XPath extraction tools (Chropath is what I prefer)

Step 1: Import necessary packages using import statements.


import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time

Step 2:Reading Data to pass as input to Script.


path='./Data_file.csv'
April_cin_dir={"CIN":[],"Director":[]}
df=pd.read_csv(path)
cin_list=df.values.tolist()
url='https://www.zaubacorp.com/' #Change with your desired website

Step 3:Chrome driver Initialization to open webpage


driver=webdriver.Chrome()
driver.maximize_window()


Step 4:Defining function for Opening Webpage


def open_app(url):
    driver.get(url)

Step 5:Defininfg function for extracting email id



def get_director(cin_list):
    counter=0
    WebDriverWait(driver,10)
    try:
        for i in cin_list:
            counter+=1
            driver.find_element_by_xpath("//input[@id='search-com']").send_keys(i) #Finding search box on the page and entering search criteria.
            driver.find_element_by_xpath("//button[@id='edit-submit--3']").click() #Finding Submit button and sending click event
            time.sleep(5) #This becomes savior in most of the case as webpages take time to load post click 
            April_cin_dir["CIN"].append(i)
            April_cin_dir["Director"].append(driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/div[2]/div[10]/table/tbody/tr/td[2]/strong/a").text) #Traversing till email field and extracting it's text
    except: #Except condition to handle if email not present.
        April_cin_dir["Director"].append("Not Present")
        get_director(cin_list[counter:])
        

Step 6:Calling Custom Function to Run Email Extraction Script and Storing Extracted date in CSV


open_app(url)
get_director(cin_list)
driver.close()

df_dir=pd.DataFrame.from_dict(April_cin_dir)
df_dir.to_csv("Nov_cin_director_opc1.csv")

I have Assumed Finding XPATH is known to you. In case you want me to right that, please comment. Will share insights on that.


Thanks for reading through.




86 views0 comments

Recent Posts

See All

Google, the most powerful and most used search engine in the planet. We marketers almost every hour go to google to look for something or audit something. When you are doing keyword research it become

Use of Natural Language Processing aka NLP in Digital Marketing is increasing day by day.As individuals behaviour in searching content and personification of content is increasing thus use of NLP is a

Post: Blog2_Post
bottom of page