Acquiring Data

From GEST-S482 Digital Business
Jump to navigation Jump to search

Data and the Web

When data is available online, two challenges must be tackled:

  1. Identifying where to retrieve the data
  2. Aggregating the data collected. In order to retrieve data, we can for example use search engines, web scraping, or API calls.

Wayback Machine

What is it and why could it be very useful? Everyone has already experienced this phenomenon: You saw something on a main page website and the next day you would like to look up the same thing on that website but unfortunately the website has been updated so the page you were looking for could be stored somewhere else or perhaps even worse, it could be gone… Annoying, isn't it? Wayback machine is actually a name given to different websites that allow you to go back in time, and retrieve a page from a particular day. That could be yesterday but just as well a month ago or even decades ago (if the website already existed). An example is the website http://www.archive.org .

Search Engines

Search engines companies are in the business of understanding the meaning of a query by a user, storing the content and information about many webpages and matching this user to the page that best fit the user intent and that is going to maximize the advertising revenue in the process. These companies have to accomplish a very difficult task to understand what the user does not know he wants and give him a relevant outcome.

When you have a clear idea of what you want to search in your web browser, you can weaponize your search query with some techniques.

Inclusion

Sometimes, searching engines give results which do not match with the exact term you were searching. This is because it questions you and thinks that you are looking for something else. To make the search engine look for results that match exactly the term we are writing, without having the search engine inferring other terms looked for, we can include the term in this quotation sign: term looked for.
To make sure a term will not be in the results, we can use the dash before the term: - term not looked for.

Combinations

We can combine search terms using Boolean logic (OR, AND).

For example, when you look for dogs AND cats or even for dogs cats, the search engine will try to find resources containing the maximum of keywords. However, if you specify the keyword OR, as in cats OR dogs, you are going to receive results that contain either word (and some that contain both).

Site

In order to have results from a specific source, we can simply add site:the domain name of the source before our term searched.

For example, to search for forecasts coming from the World Bank, we write: site:worldbank.org “forecast”.

Filetype

If we want a specific type of file (pdf, xls, ppt, etc. but interestingly not a csv), we can use filetype:the format you want. This is very useful when you want to look for quantitative data.

Web Scraping

Web scraping is really useful to look for information among a huge quantity of data. In the case we are about to see, we will use the beautiful Soup module to extract information from an HTML web page, and then be able to process it with python.Knowledge of HTML , css and object-oriented python is a prerequisite to be able to understand the last part.

Web scraping program usually work in 3 steps:

  1. Send a request to the server for a webpage and receive the server’s reply. This is often outsourced to a library. Often, we will use requests, which is a library that enables sending a GET or a POST request and getting an answer (already transformed back to text).
  2. Transform this reply into a tree (= HTML tree) that you can then explore (= parsing the document). This is also outsourced to a library, in our case to Beautiful Soup Python library. It makes parsing and then subsequently retrieving a breeze and it is fast enough to be used on big projects.
  3. Identify the parts of the trees that are interesting using classes and id on the page.

Basic code

#to receive page content with an ulr : 
1.import requests
2.my_url = "http://.../"
3.page_content = requests.get(my_url).content

#Transform into a tree
4.from bs4 import BeautifulSoup
5.page_tree = BeautifulSoup(page_content, "html.parser")

#find by id
6.target_content = page_tree.find(id="idName")

Explanations :

  1. We have to import the request module, (Request will download a page from the internet).
  2. We record in the variable my_url , a url in string format with the " " or ' ' '.
  3. We register in the variable page_content , our web page to be called the function request , we indicate the location of the web page via the url so requests.get(url) then we call the method content which will register the content of the page.
  4. The BeautifulSoup module is imported
  5. We call Beautifulsoup in the page_tree variable, the class asks for 2 parameters : the content which will be what we have previously saved in html and the format which we will name html.parser.
  6. Finally we take again the variable and we call upon the find method which will seek on the recorded page a content with the id idName, it is NOT necessary to pay attention because we canNOT have several variable (because of two reasons : 1. The find() will return only the first element found 2. An id can be put at only one tag (not like a class). And thus it will NOT be necessary to use the principle of the lists in this example to have the first value.

Navigating the tree

If the web page is really well structured and exactly what you are looking for has an id you are really lucky. In most case it won't be the case... Finding what you are looking for is still possible trough other ways.

If an id is close to what you are looking for, you should try the fonction

interesting_content = target_content.next_element        #to get the next element
interesting_content = target_content.previous_element    #to get the previous element
interesting_content = target_content.previous_sibling    #to get the previous sibling
...                                                      #you get the point...

Tips

  • Note that these functions need to be applied on a tree.
  • Do not hesitate to take the previous element of a previous element.
  • To see other functions to navigate the tree go on the BeautifulSoup Website

- Official documentation: https://beautiful-soup-4.readthedocs.io/en/latest/

API calls

API stands for Application Programming Interface. It is a software that allows two applications, systems or softwares to share data and use it. It thus makes a link between different entities whitout having to store the data from one into the other, and allowing automatic updates. API calls are mainly used to request data that are not formatted in HTML. There are two parts in an API call:

  1. The call itself
  2. The receipt of the resource you asked for.

An advantage of this program, is that you only need to execute the call correctly. Usually, the API will then provide you the data in an easy-to-handle format with only relevant data for you. The challenge here is to make sure you execute the call correctly to make sure you will retrieve the data you were looking for. There are two types of data that can be retrieved; near-real-time data which is also called data flow, or historical data (on which we focus in this course).

Postman is a program that helps to test an API call. Postman is going to test the request and give an idea of what we are going to get from our request. It also allows you to test HTTP requests.

If you want more information, check out these 5 videos of a few minutes. Here are the links, enjoy!

  1. Sending a request: https://www.youtube.com/watch?list=PLM-7VG-sgbtBsenu0CM-UF3NZj3hQFs7E&v=YKalL1rVDOE&feature=emb_logo
  2. Authorizing a request: https://www.youtube.com/watch?v=d519r1stILE&list=PLM-7VG-sgbtBsenu0CM-UF3NZj3hQFs7E&index=2
  3. Writing a test: https://www.youtube.com/watch?v=6Cp4Ez5dwbM&list=PLM-7VG-sgbtBsenu0CM-UF3NZj3hQFs7E&index=3
  4. Running a collection: https://www.youtube.com/watch?v=oZHd226DZPU&list=PLM-7VG-sgbtBsenu0CM-UF3NZj3hQFs7E&index=4
  5. Chaining requests: https://www.youtube.com/watch?v=shYn3Ys3ygE&list=PLM-7VG-sgbtBsenu0CM-UF3NZj3hQFs7E&index=5

Some API need to receive some options to give you the correct information. When you perform a GET request (which is the case in most of the API), the options, or parameters, must be passed in the URL. We can add parameters in API calls: after the root of the API, we have to put a question mark (?) and then key-values couples (key1 = value1) separated by an ampersand (&).

Web Crawling

A web crawler is a program that browses the world wide web in a automated, methodical manner. This process is called Web crawling. The key difference with "web scraping" is that a web scraper searches for specific information on specific websites or pages while web crawling creates a copy of what's there and web scraping extracts specific data for analysis, or to create something new. Another view of web crawling is the one given in the course: it is the process of searching for information using the result of your previous information retrieval.

Example

The following code counts the number of appearance of the word "digital" in the class wiki.

def count_digital(text):
    words = text.split()
    digital_count = 0
    for word in words:
        if word.lower() == "digital":
            digital_count += 1
    return digital_count - 1

# Try to figure out why I have to subtract 1 from the tally here

def filter_urls(urls):
    url_base = "/index.php/" # garantees we only index our wiki
    
    check_incl = lambda x : True if x.get('href') is not None and \
    url_base == x.get("href")[:11] and \
    "#" not in x.get("href") and \
    "Special:" not in x.get("href") and \
    "action=" not in x.get('href') and \
    x.get("class") != "new" else False
    
    urls = filter(check_incl, urls)
    urls = [url["href"] for url in urls]
    
    return list(urls)

def treat_page(url):
    base = "https://www.thedigitalfirm.eu"
    page_content = requests.get(base + url).content
    tree = BeautifulSoup(page_content, "html.parser")
    content_page = tree.find(id="content")
    digital_count = count_digital(content_page.text)
    retrieved_urls = filter_urls(content_page.find_all("a"))
    return digital_count, retrieved_urls

Explanation of the code:

  • word.lower (): The ".lower()" will put the whole word in lower. Otherwise "Digital" would not be counted since "Digital" == "digital" is false due to the upper case.
  • return digital_count - 1 : section one "digital" is hidden in the code of the page so we don't see it when we look at the page and we have to remove it

Link to the assistant web page

Where to go?

Main page Exercises - Next Session Analysing Data