library(reticulate)
options(reticulate.repl.quiet = TRUE)
use_python("~/Library/r-miniconda/bin/python")
Introduction
Web scraping refers to the process of extracting, copying and storing web content or source code sections with the aim of analysing or otherwise exploiting them. With the help of web scraping procedures, many different types of information can be collected and analysed automatically: For example, contact, network or behavioural data. Web scraping is no longer a novelty in empirical research. There are numerous reasons why web scraping methods are used in empirical research: For example, repeated collection of cross-sectional web data, the desire for replicability of data collection processes or the desire for time-efficient and less error-prone collection of online data (Munzert and Nyhuis 2019).
Web scraping with Python
As a low-threshold introduction to webscraping, we want to use Python to collect various online-accessible articles from the art magazine Monopol from the section Art Market. For this we need the Pythons libaries BeautifulSoup
, requests
, pandas
and for the later backup of our collected data the library openpyxl
. Since I myself use RStudio as IDE ((Integrated Development Environment)) for Python, I load the R package reticulate
in an R environment and turn off the (to me annoying) notification function of the package. Finally, I use the function use_python()
to tell R under which path my Python binary can be found. Users who work directly in Python can skip this step.
In a further step, the Python libaries mentioned above are loaded into the working environment. If the libraries have not yet been installed, this can be done from R with the reticulate
function py_install()
. In Python, libraries can be installed directly with the function pip install
.
#py_install("pandas")
#py_install("bs4")
#py_install("openpyxl")
from bs4 import BeautifulSoup
import requests
import pandas as pd
import openpyxl
Once this is done, we define the URL where the links to the ten newest articles of the section “Art Market” are stored. The defined URL are fed to the function requests.get()
, which should display the status code 200: HTTP status code 200 indicates that the request was successfully processed. All requested data has been located on the web server and is being transferred to the client
#load url
= "https://www.monopol-magazin.de/ressort/kunstmarkt"
url = requests.get(url)
res print(res)
<Response [200]>
If we pass the object text
from the list res
generated by the requests.get()
function to the BeautifulSoup function, we get a BeautifulSoup object that transforms the HTML document into an easily readable nested data structure.
= BeautifulSoup(res.text) soup
With the help of the selector gadget we select the CSS selector that leads to the URLs of the first ten articles of the art market section. In this case it is: #block-kf-bootstrap-content header
.
= soup.select("#block-kf-bootstrap-content header") articles
From the list articles
created in this way, all titles of the ten articles are now extracted as pure text elements with the help of a for-loop and the list articles_titles
is written.
= [i.text for i in articles] articles_titles
Furthermore, another for-loop iterates over the soup
object to extract the links to the full texts of the ten articles as simple text elements. An empty list called articles_urls
is created in which all partial URLs are written that can be found in the HTML elements of the class article__title
and there in the HTML element a
. Since only the respective paths are stored in the a
elements of the HTML code, but not the main domain of the URL, this is finally merged with the paths to form a single string.
= []
articles_urls for article__title in soup.find_all(attrs={'class':'article__title'}):
for link in article__title.find_all('a'):
= "https://www.monopol-magazin.de"+link.get('href')
url
articles_urls.append(url)type(articles_urls)
<class 'list'>
Now the text elements of the ten articles can be collected with a for-loop. Again, an empty list is created, this time with the name data
. The for-loop iterates over the ten URLs extracted above that are in the list articles_urls
. The requests.get()
function is called first on each of the ten URLs. If the respective web page of the articles of the status code reflects 200, we create - as above - a soup object with the function BeautifulSoup()
, which puts the HTML code of the respective page into a clear structure for us. Then, with the help of the selector gadget, we select the CSS selector that leads to the articles of each section of the art market. In this case it is: .field--type-text-long.field--label-hidden
. Finally, the collected text elements are adapted to the unicode standard using the function normalize
of the library unicodedata
and a for-loop. This is done with the help of the NFKD algorithm.
= []
data
for url in articles_urls:
= requests.get(url)
res if res.status_code == 200:
= BeautifulSoup(res.text)
soup for element in soup.select('.field--type-text-long.field--label-hidden'):
data.append((element.get_text()))
#clean text
import unicodedata
= [unicodedata.normalize("NFKD",i) for i in data] data
Voilà! As we could see, our code created little spiders that roam the URLs we specified and reliably collect all the information we instructed them to collect in our code. In this way, the little spiders that we brought to life with a few lines of Python code can do the work in seconds with an accuracy and speed that would take us humans a quarter of an hour.
We now merge the three lists with the titles of the articles, the URLs and the actual text corpus first into a list called d
and then into a data-frame called df
. Furthermore, we use the replace()
function and the strip()
function to remove all line breaks and whitespace from all columns of the type string
.
= data[0:10]
data_sub = {'titles':articles_titles,'urls':articles_urls,'texts':data_sub}
d = pd.DataFrame(d)
df
= df.apply(lambda x: x.str.replace("\n", ""))
df[df.columns] = df.apply(lambda x: x.str.strip())
df[df.columns]
print(df)
titles ... texts
0 Gallery Weekend Berlin Wohin in der Potsd... ... Gallery Weekend Berlin, 28. bis 30. AprilWeite...
1 Kunstmesse Paper Positions in Berlin Stab... ... Parallel zum Gallery Weekend in Berlin zeigt d...
2 Gallery Weekend Berlin Wohin im Westen? ... Cave-Ayumi Gallery: Naoto Kumagai "Integrate",...
3 Gallery Weekend Berlin Wohin in Mitte? ... Kräftig und klar wirken dagegen die Siebdruck...
4 Art Brussels Eine Messe für Spürnasen ... Māksla XO: Helena Heinrihsone "Red", 2022
5 Kunstmesse 20 Jahre Art Karlsruhe ... Auf der Messe werden mehrere Preise für die b...
6 Auktionen im September Privatsammlung von... ... Paper Positions, Berlin, 27. bis 30. April
7 Sotheby's Riesenspinne und Klimt-Premiere... ... Gallery Weekend Berlin, 28. bis 30. AprilWeite...
8 Galeristin Xiaochan Hua "Freiheit ist mir... ... Gallery Weekend Berlin, 28. bis 30. AprilWeite...
9 Kunstmesse Art Brussels Das nächste große... ... Die gut informierte lokale Sammlerszene liebt ...
[10 rows x 3 columns]
Finally, we export the data-frame created in this way into the .xlsx
format with the function ExcelWriter()
. To do this, we define the desired output-format, write the object df
into an Excel sheet and export this with the function writer.save()
.
= pd.ExcelWriter('output.xlsx')
writer
df.to_excel(writer) writer.save()
<string>:1: FutureWarning: save is not part of the public API, usage can give unexpected results and will be removed in a future version
There are many possibilities for analysing the data obtained, some of which will be presented here in the medium term.