Web scraping with Python: lesson 1
Introduction
Web scraping refers to the process of extracting, copying and storing web content or source code sections with the aim of analysing or otherwise exploiting them. With the help of web scraping procedures, many different types of information can be collected and analysed automatically: For example, contact, network or behavioural data. Web scraping is no longer a novelty in empirical research. There are numerous reasons why web scraping methods are used in empirical research: For example, repeated collection of cross-sectional web data, the desire for replicability of data collection processes or the desire for time-efficient and less error-prone collection of online data (Munzert and Nyhuis 2019).
Web scraping with Python
As a low-threshold introduction to webscraping, we want to use
Python to collect various online-accessible
articles from the art magazine
Monopol from the section Art
Market. For this we need the Pythons libaries BeautifulSoup
,
requests
, pandas
and for the later backup of our collected data the
library openpyxl
. Since I myself use
RStudio as IDE ((Integrated Development
Environment)) for Python, I load the R package reticulate
in an R
environment and turn off the (to me annoying) notification function of
the package. Finally, I use the function use_python()
to tell R under
which path my Python binary can be found. Users who work directly in
Python can skip this step.
library(reticulate)
options(reticulate.repl.quiet = TRUE)
use_python("~/Library/r-miniconda/bin/python")
In a further step, the Python libaries mentioned above are loaded into
the working environment. If the libraries have not yet been installed,
this can be done from R with the reticulate
function py_install()
.
In Python, libraries can be installed directly with the function
pip install
.
#py_install("pandas")
#py_install("bs4")
#py_install("openpyxl")
from bs4 import BeautifulSoup
import requests
import pandas as pd
import openpyxl
Once this is done, we define the URL where the links to the ten newest
articles of the section “Art Market” are stored. The defined URL are fed
to the function requests.get()
, which should display the status code
200: HTTP status code 200 indicates that the request was successfully
processed. All requested data has been located on the web server and is
being transferred to the client
#load url
url = "https://www.monopol-magazin.de/ressort/kunstmarkt"
res = requests.get(url)
print(res)
## <Response [200]>
If we pass the object text
from the list res
generated by the
requests.get()
function to the BeautifulSoup function, we get a
BeautifulSoup object that transforms the HTML document into an easily
readable nested data structure.
soup = BeautifulSoup(res.text)
With the help of the selector
gadget
we select the CSS selector that leads to the URLs of the first ten
articles of the art market section. In this case it is:
#block-kf-bootstrap-content header
.
articles = soup.select("#block-kf-bootstrap-content header")
From the list articles
created in this way, all titles of the ten
articles are now extracted as pure text elements with the help of a
for-loop and the list articles_titles
is written.
articles_titles = [i.text for i in articles]
Furthermore, another for-loop iterates over the soup
object to extract
the links to the full texts of the ten articles as simple text elements.
An empty list called articles_urls
is created in which all partial
URLs are written that can be found in the HTML elements of the class
article__title
and there in the HTML element a
. Since only the
respective paths are stored in the a
elements of the HTML code, but
not the main domain of the URL, this is finally merged with the paths to
form a single string.
articles_urls = []
for article__title in soup.find_all(attrs={'class':'article__title'}):
for link in article__title.find_all('a'):
url = "https://www.monopol-magazin.de"+link.get('href')
articles_urls.append(url)
type(articles_urls)
## <class 'list'>
Now the text elements of the ten articles can be collected with a
for-loop. Again, an empty list is created, this time with the name
data
. The for-loop iterates over the ten URLs extracted above that are
in the list articles_urls
. The requests.get()
function is called
first on each of the ten URLs. If the respective web page of the
articles of the status code reflects 200, we create - as above - a soup
object with the function BeautifulSoup()
, which puts the HTML code of
the respective page into a clear structure for us. Then, with the help
of the selector
gadget,
we select the CSS selector that leads to the articles of each section of
the art market. In this case it is:
.field--type-text-long.field--label-hidden
. Finally, the collected
text elements are adapted to the unicode standard using the function
normalize
of the library unicodedata
and a for-loop. This is done
with the help of the NFKD algorithm.
data = []
for url in articles_urls:
res = requests.get(url)
if res.status_code == 200:
soup = BeautifulSoup(res.text)
for element in soup.select('.field--type-text-long.field--label-hidden'):
data.append((element.get_text()))
#clean text
import unicodedata
data = [unicodedata.normalize("NFKD",i) for i in data]
Voilà! As we could see, our code created little spiders that roam the URLs we specified and reliably collect all the information we instructed them to collect in our code. In this way, the little spiders that we brought to life with a few lines of Python code can do the work in seconds with an accuracy and speed that would take us humans a quarter of an hour.
We now merge the three lists with the titles of the articles, the URLs
and the actual text corpus first into a list called d
and then into a
data-frame called df
. Furthermore, we use the replace()
function and
the strip()
function to remove all line breaks and whitespace from all
columns of the type string
.
data_sub = data[0:10]
d = {'titles':articles_titles,'urls':articles_urls,'texts':data_sub}
df = pd.DataFrame(d)
df[df.columns] = df.apply(lambda x: x.str.replace("\n", ""))
df[df.columns] = df.apply(lambda x: x.str.strip())
print(df)
## titles ... texts
## 0 Spark Art Fair in Wien Kunstmesse mit sel... ... Die Spark Art Fair lohnt den Besuch bei Positi...
## 1 Auktion in New York Venedig-Gemälde von M... ... Ein Venedig-Gemälde des impressionistischen M...
## 2 Erste Ausgabe im Oktober Art Basel gibt N... ... Die neue Messe der Art Basel in Paris, die im ...
## 3 Auktion in New York Gedichtband von junge... ... Ein unveröffentlichter Gedichtband und Liebes...
## 4 Urteil im Streit um Purrmann-Gemälde "Die... ... Der Enkel des Malers Hans Purrmann klagt auf F...
## 5 Online-Auktion Initiative "Artists for Uk... ... Katharina Garbers-von Boehm ist Partnerin bei ...
## 6 Auktion in New York Warhol-Werk könnte be... ... Zur Unterstützung von Hilfsprojekten für Gef...
## 7 Rekordpreis Zeichnung von René Magritte e... ... Ein von Künstler Andy Warhol (1928-1987) ange...
## 8 Auktion in den USA Banksy-Kunstwerk für 5... ... Das Werk "Die Vergewaltigung" von René Magrit...
## 9 Benefizversteigerung Ketterer-Auktion für... ... Ein Werk des britischen Street-Art-Künstlers ...
##
## [10 rows x 3 columns]
Finally, we export the data-frame created in this way into the .xlsx
format with the function ExcelWriter()
. To do this, we define the
desired output-format, write the object df
into an Excel sheet and
export this with the function writer.save()
.
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer)
writer.save()
There are many possibilities for analysing the data obtained, some of which will be presented here in the medium term.
References
Munzert, Simon, and Dominic Nyhuis. 2019. “Die Nutzung von Webdaten in den Sozialwissenschaften.” In Handbuch Methoden der Politikwissenschaft, 373–97. Wiesbaden: Springer. https://doi.org/10.1007/978-3-658-16937-4_22-1.