Web scraping with Python: lesson 1

5 minute read

Introduction

Web scraping refers to the process of extracting, copying and storing web content or source code sections with the aim of analysing or otherwise exploiting them. With the help of web scraping procedures, many different types of information can be collected and analysed automatically: For example, contact, network or behavioural data. Web scraping is no longer a novelty in empirical research. There are numerous reasons why web scraping methods are used in empirical research: For example, repeated collection of cross-sectional web data, the desire for replicability of data collection processes or the desire for time-efficient and less error-prone collection of online data (Munzert and Nyhuis 2019).

Web scraping with Python

As a low-threshold introduction to webscraping, we want to use Python to collect various online-accessible articles from the art magazine Monopol from the section Art Market. For this we need the Pythons libaries BeautifulSoup, requests, pandas and for the later backup of our collected data the library openpyxl. Since I myself use RStudio as IDE ((Integrated Development Environment)) for Python, I load the R package reticulate in an R environment and turn off the (to me annoying) notification function of the package. Finally, I use the function use_python() to tell R under which path my Python binary can be found. Users who work directly in Python can skip this step.

library(reticulate)
options(reticulate.repl.quiet = TRUE)
use_python("~/Library/r-miniconda/bin/python")

In a further step, the Python libaries mentioned above are loaded into the working environment. If the libraries have not yet been installed, this can be done from R with the reticulate function py_install(). In Python, libraries can be installed directly with the function pip install.

#py_install("pandas")
#py_install("bs4")
#py_install("openpyxl")
from bs4 import BeautifulSoup
import requests
import pandas as pd
import openpyxl

Once this is done, we define the URL where the links to the ten newest articles of the section “Art Market” are stored. The defined URL are fed to the function requests.get(), which should display the status code 200: HTTP status code 200 indicates that the request was successfully processed. All requested data has been located on the web server and is being transferred to the client

#load url
url = "https://www.monopol-magazin.de/ressort/kunstmarkt"
res = requests.get(url)
print(res)

## <Response [200]>

If we pass the object text from the list res generated by the requests.get() function to the BeautifulSoup function, we get a BeautifulSoup object that transforms the HTML document into an easily readable nested data structure.

soup = BeautifulSoup(res.text)

With the help of the selector gadget we select the CSS selector that leads to the URLs of the first ten articles of the art market section. In this case it is: #block-kf-bootstrap-content header.

articles = soup.select("#block-kf-bootstrap-content header")

From the list articles created in this way, all titles of the ten articles are now extracted as pure text elements with the help of a for-loop and the list articles_titles is written.

articles_titles = [i.text for i in articles]

Furthermore, another for-loop iterates over the soup object to extract the links to the full texts of the ten articles as simple text elements. An empty list called articles_urls is created in which all partial URLs are written that can be found in the HTML elements of the class article__title and there in the HTML element a. Since only the respective paths are stored in the a elements of the HTML code, but not the main domain of the URL, this is finally merged with the paths to form a single string.

articles_urls = []
for article__title in soup.find_all(attrs={'class':'article__title'}):
  for link in article__title.find_all('a'):
    url = "https://www.monopol-magazin.de"+link.get('href')
    articles_urls.append(url)
type(articles_urls)

## <class 'list'>

Now the text elements of the ten articles can be collected with a for-loop. Again, an empty list is created, this time with the name data. The for-loop iterates over the ten URLs extracted above that are in the list articles_urls. The requests.get() function is called first on each of the ten URLs. If the respective web page of the articles of the status code reflects 200, we create - as above - a soup object with the function BeautifulSoup(), which puts the HTML code of the respective page into a clear structure for us. Then, with the help of the selector gadget, we select the CSS selector that leads to the articles of each section of the art market. In this case it is: .field--type-text-long.field--label-hidden. Finally, the collected text elements are adapted to the unicode standard using the function normalize of the library unicodedata and a for-loop. This is done with the help of the NFKD algorithm.

data = []

for url in articles_urls:
    res = requests.get(url)
    if res.status_code == 200:
        soup = BeautifulSoup(res.text)
        for element in soup.select('.field--type-text-long.field--label-hidden'):
            data.append((element.get_text()))

#clean text
import unicodedata
data = [unicodedata.normalize("NFKD",i) for i in data]

Voilà! As we could see, our code created little spiders that roam the URLs we specified and reliably collect all the information we instructed them to collect in our code. In this way, the little spiders that we brought to life with a few lines of Python code can do the work in seconds with an accuracy and speed that would take us humans a quarter of an hour.

We now merge the three lists with the titles of the articles, the URLs and the actual text corpus first into a list called d and then into a data-frame called df. Furthermore, we use the replace() function and the strip() function to remove all line breaks and whitespace from all columns of the type string.

data_sub = data[0:10]
d = {'titles':articles_titles,'urls':articles_urls,'texts':data_sub}
df = pd.DataFrame(d)

df[df.columns] = df.apply(lambda x: x.str.replace("\n", ""))
df[df.columns] = df.apply(lambda x: x.str.strip())

print(df)

##                                               titles  ...                                              texts
## 0  Spark Art Fair in Wien      Kunstmesse mit sel...  ...  Die Spark Art Fair lohnt den Besuch bei Positi...
## 1  Auktion in New York      Venedig-Gemälde von M...  ...  Ein Venedig-Gemälde des impressionistischen M...
## 2  Erste Ausgabe im Oktober      Art Basel gibt N...  ...  Die neue Messe der Art Basel in Paris, die im ...
## 3  Auktion in New York      Gedichtband von junge...  ...  Ein unveröffentlichter Gedichtband und Liebes...
## 4  Urteil im Streit um Purrmann-Gemälde      "Die...  ...  Der Enkel des Malers Hans Purrmann klagt auf F...
## 5  Online-Auktion      Initiative "Artists for Uk...  ...  Katharina Garbers-von Boehm ist Partnerin bei ...
## 6  Auktion in New York      Warhol-Werk könnte be...  ...  Zur Unterstützung von Hilfsprojekten für Gef...
## 7  Rekordpreis      Zeichnung von René Magritte e...  ...  Ein von Künstler Andy Warhol (1928-1987) ange...
## 8  Auktion in den USA      Banksy-Kunstwerk für 5...  ...  Das Werk "Die Vergewaltigung" von René Magrit...
## 9  Benefizversteigerung      Ketterer-Auktion für...  ...  Ein Werk des britischen Street-Art-Künstlers ...
##
## [10 rows x 3 columns]

Finally, we export the data-frame created in this way into the .xlsx format with the function ExcelWriter(). To do this, we define the desired output-format, write the object df into an Excel sheet and export this with the function writer.save().

writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer)
writer.save()

There are many possibilities for analysing the data obtained, some of which will be presented here in the medium term.

References

Munzert, Simon, and Dominic Nyhuis. 2019. “Die Nutzung von Webdaten in den Sozialwissenschaften.” In Handbuch Methoden der Politikwissenschaft, 373–97. Wiesbaden: Springer. https://doi.org/10.1007/978-3-658-16937-4_22-1.