Correspondence Analysis

R
data visualization
data exploration
analysis
Author

Malte Hückstädt

Published

March 30, 2023

Introduction

Correspondence analysis (ca) is a chi-square statistic-based, primarily geometric, descriptive or explorative method, that can be used to examine latent structures of multivariate data. Like cluster analysis, correspondence analysis is a data reduction procedure that, in contrast to cluster analysis and principal component analysis, is used in particular to describe categorical data that are grouped together in a contingency table (Blasius 2001). In contrast to cluster analysis, however, the characteristics or subjects are not assigned to a cluster, but are localized and related to each other in a space spanned by latent dimensions, which are represented by continuously scaled structural axes. The goal of ca, meanwhile, is to determine an optimal (for various, pragmatic reasons, usually two-dimensional) subspace of a multidimensional hyperspace. In the definition of this system the dimensions are to be chosen in such a way that with them a maximum of the variation of the data can be explained (Blasius 2001).

The ca is primarily interpreted graphically, while numerical interpretation tends to be of lower priority (Backhaus et al. 2016). The graphical representation possibilities are then probably also one of the main advantages of ca: Thus, ca can represent data structures clearly, graphically, reduce complexity on the basis of latent dimensions, and thus make complex, unknown relationships visible at a glance. The general maxim of exploratory data analysis “Let the data speak for themselves!(Le Roux and Rouanet 2010, 4) is thus quite redeemed in the context of ca’s graphical representations. However, the comparatively catchy “geometric maps(Lenger, Schneickert, and Schumacher 2013, 207) that result from the geometric solution of a ca, also harbor a number of interpretational hurdles that even Pierre Bourdieu was not always able to overcome without error (Blasius and Winkler 1989; Gollac 2015).

Data

The ca is, at least outside France, known and recognized as Bourdieu’s statistical method (Le Roux and Rouanet 2010, 4). In fact, however, it was not Pierre Bourdieu who invented this empirical method, often still perceived as “strange(Hepp and Kergel 2014), but his close friend, the analyst and linguist Jean-Paul Benzécris, who first introduced the ca in France in the 1960s.

In order to stay close to the origins of ca, in the following we use data that Pierre Bourdieu (1987) already supplied to ca in his famous main work Distinction: A Social Critique of the Judgement of Taste: We explore the relationship between social class membership, aesthetic attitudes, and favorite painters (Bourdieu 2012, 822–23).

Data processing

To do this, we first load the corresponding contingency table into our workspace using the read_excel() function of the readxl package and specify (with the help of a pipe) as rownames the variable of the social classes that already has the name rownames in our excel-sheet.

In a further step, we print the contingency table (in transposed form for clarity) using the kable() function of the kableExtra package.

library(readxl)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(kableExtra)
Warning: package 'kableExtra' was built under R version 4.3.3

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows
df <- read_excel("Bourdieu_p822.xlsx") %>% column_to_rownames("rownames")

knitr::kable(t(df), "pipe")
working classes middle classes upper classes
sunset 90 84 64
communion 50 31 18
folkloric dance 63 52 37
girl with cat 56 60 50
lactating woman 44 55 53
tree bark 17 32 48
steel frame 6 12 23
pregnant woman 11 16 19
cabbages 7 11 18
car accident 0 1 4
Raphael 32 27 18
Buffet 8 17 9
Utrillo 20 18 20
Vlamink 6 12 11
Watteau 16 19 16
Renoir 49 51 48
Van Gogh 48 47 49
Dali 3 4 5
Braque 5 7 9
Goya 16 19 31
Brueghel 1 12 27
Kandinsky 0 2 4

Correspondence analysis

As we could see, the complexity of our 3x22 contingency table is so high that correlations between class position and corresponding, aesthetic settings and weaknesses for specific favorite painters are hardly or hardly to be inferred from the table for the naked eye. For this reason, we specify a ca. We load the package FactoMineR into our workspace and feed our data to the FactoMineR function CA() and store the results in the object ca. Since the automated graphs of the FactoMineR package are not publishable, especially for complex correspondence analyses, we set the graph argument to FALSE.

library(FactoMineR)
Warning: package 'FactoMineR' was built under R version 4.3.3
ca <- CA(df, graph = F)
ca
**Results of the Correspondence Analysis (CA)**
The row variable has  3  categories; the column variable has 22 categories
The chi square of independence between the two variables is equal to 115.082 (p-value =  1.008015e-08 ).
*The results are available in the following objects:

   name              description                   
1  "$eig"            "eigenvalues"                 
2  "$col"            "results for the columns"     
3  "$col$coord"      "coord. for the columns"      
4  "$col$cos2"       "cos2 for the columns"        
5  "$col$contrib"    "contributions of the columns"
6  "$row"            "results for the rows"        
7  "$row$coord"      "coord. for the rows"         
8  "$row$cos2"       "cos2 for the rows"           
9  "$row$contrib"    "contributions of the rows"   
10 "$call"           "summary called parameters"   
11 "$call$marge.col" "weights of the columns"      
12 "$call$marge.row" "weights of the rows"         

If we output the object ca, we see that our ca was based on a 3x22 contingency table whose row and column variables are statistically significantly related (p-value = 1.008015e-08). Furthermore, we see all storage locations of the central information concerning the specified ca. In order to determine the number of axes of the ca to be taken into account, we first have the eigenvalues and the proportions of the variances “explained” by the different axes of the ca output.

knitr::kable(as.data.frame(ca$eig), "pipe")
eigenvalue percentage of variance cumulative percentage of variance
dim 1 0.0624176 93.179969 93.17997
dim 2 0.0045685 6.820031 100.00000

The eigenvalues correspond to the amount of information each axis contains. The dimensions are ordered in decreasing order and ranked by the amount of variance explained in the final solution of the ca. Dimension 1 of our specified ca explains 93.18% of the variance, and dimension 2 explains 6.82% of the variance (Hjellbrekke 2018). The first two axes of the specified ca explain 100% of the variation in the contingency table fed to the ca. Accordingly, within the graphical representation of the geometric solution of ca, we extract the first two axes.

To do this, we use the mutate()- und select()-function of the dplyr-Package to create a new data-frame containing the coordinates (ca$row$coord) of the row and column profiles, as well as their squared cosines values (cos2) and contribution values (contrib). The former measures the extent of association between row and column categories and a particular dimension of ca, the latter indicates which row and column categories are most significant in explaining the variability of the data on which the contingency table is based. Row and column categories of low importance are characterized by the fact that they do not make a high contribution to any of the first dimensions of a ca (Le Roux and Rouanet 2010).

ca_row <- as.data.frame(ca$row$coord) %>% select(`Dim 1`,`Dim 2`) %>%
  mutate(type=rep("social class",3),
         cos2_dim1=ca$row$cos2[,1],
         cos2_dim2=ca$row$cos2[,2],
         contr_dim1=ca$row$contrib[,1],
         contr_dim2=ca$row$contrib[,2])

ca_col <- as.data.frame(ca$col$coord) %>% select(`Dim 1`,`Dim 2`) %>%
  mutate(type=rep("taste",22),
         cos2_dim1=ca$col$cos2[,1],
         cos2_dim2=ca$col$cos2[,2],
         contr_dim1=ca$col$contrib[,1],
         contr_dim2=ca$col$contrib[,2])

The two datasets ca_row and ca_col, which contain the extracted information of the CA, are joined line by line using the function rbind() and again specify the names of the line profiles in the column rownames for the following graphical representation.

ca_df <- rbind(ca_row,ca_col)
ca_df$rownames <- rownames(ca_df)
knitr::kable(ca_df, "pipe")
Dim 1 Dim 2 type cos2_dim1 cos2_dim2 contr_dim1 contr_dim2 rownames
working classes -0.3020220 -0.0554749 social class 0.9673633 0.0326367 46.6152120 21.4872327 working classes
middle classes -0.0274330 0.0932834 social class 0.0796004 0.9203996 0.4133641 65.3025847 middle classes
upper classes 0.3126783 -0.0422438 social class 0.9820743 0.0179257 52.9714239 13.2101826 upper classes
sunset -0.1593478 0.0086692 taste 0.9970489 0.0029511 5.6355867 0.2278978 sunset
communion -0.4173789 -0.0959956 taste 0.9497594 0.0502406 16.0829650 11.6236890 communion
folkloric dance -0.2339650 -0.0201684 taste 0.9926239 0.0073761 7.7591702 0.7877601 folkloric dance
girl with cat -0.0705356 0.0337091 taste 0.8140734 0.1859266 0.7701848 2.4033062 girl with cat
lactating woman 0.0467194 0.0438757 taste 0.5313588 0.4686412 0.3093915 3.7281835 lactating woman
tree bark 0.3712275 0.0021798 taste 0.9999655 0.0000345 12.4658518 0.0058725 tree bark
steel frame 0.4930348 -0.0667783 taste 0.9819855 0.0180145 9.2941342 2.3294917 steel frame
pregnant woman 0.1896658 0.0256267 taste 0.9820712 0.0179288 1.5431415 0.3849018 pregnant woman
cabbages 0.3571570 -0.0503835 taste 0.9804881 0.0195119 4.2824346 1.1643537 cabbages
car accident 0.9792702 -0.2239721 taste 0.9502905 0.0497095 4.4714152 3.1956868 car accident
Raphael -0.2483293 -0.0032539 taste 0.9998283 0.0001717 4.4280915 0.0103874 Raphael
Buffet -0.0080562 0.3315053 taste 0.0005902 0.9994098 0.0020578 47.6064729 Buffet
Utrillo -0.0193692 -0.0702185 taste 0.0707084 0.9292916 0.0202918 3.6436581 Utrillo
Vlamink 0.1791711 0.1642087 taste 0.5434912 0.4565088 0.8681677 9.9631457 Vlamink
Watteau -0.0275261 0.0605970 taste 0.1710476 0.8289524 0.0360354 2.3860424 Watteau
Renoir -0.0321726 0.0011474 taste 0.9987296 0.0012704 0.1428582 0.0024827 Renoir
Van Gogh -0.0129299 -0.0357981 taste 0.1154028 0.8845972 0.0224504 2.3512007 Van Gogh
Dali 0.1826517 -0.0055608 taste 0.9990740 0.0009260 0.3733348 0.0047279 Dali
Braque 0.2119424 -0.0032304 taste 0.9997677 0.0002323 0.8796806 0.0027922 Braque
Goya 0.2631705 -0.0952198 taste 0.8842418 0.1157582 4.2627339 7.6243880 Goya
Brueghel 0.7816252 -0.0283536 taste 0.9986858 0.0013142 22.7890942 0.4097152 Brueghel
Kandinsky 0.7977577 0.0433777 taste 0.9970521 0.0029479 3.5609282 0.1438439 Kandinsky

Now we are ready to plot the graphical solution of the ca. For this we use the package ggplot2.

library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.3.3
ggplot(ca_df, aes(`Dim 1`, `Dim 2`,color=type)) + #specify the x and y variable
  geom_vline(xintercept = 0, color="grey",linetype="dashed")+ #add x-axis
  geom_hline(yintercept = 0, color="grey",linetype="dashed")+ #add y-axis
  geom_point(aes(shape=type),color = "red")+ # add (red) points corresponding to the coordinates of the variables 'Dim 1' and 'Dim 2
  geom_text_repel(aes(label = rownames), min.segment.length = 0, seed = 42, box.padding = 0.5,max.overlaps = Inf)+  #avoid overlapping labels
  scale_color_manual(values=c("#d80000", "#2450a4"))+ #add manual color shading
  theme(legend.position = "none")+ #hide legend
  ggtitle("factor map of correspondence analysis") + #add main title
  xlab("Dim 1 (93.18%)") + ylab("Dim 1 (6.82%)") #add axis titles

As indicated above, with respect to the interpretation of the graphical solution of a ca, there are some pitfalls that need to be considered. Carroll et al (1986) put it aptly like this: “[…]it is legitimate to interpret distances among elements of one set of points. […] It is also legitimate to interpret the relative positions of one point of one set with respect to all the points in the other set. Except in special cases, it is extremely dangerous to interpret the proximity of two points corresponding to different sets of points.(Carroll, Green, and Schaffer 1986, 46)

With this in mind, the graphical solution of our ca can be interpreted as follows:

The relatively wide gaps between the three class profiles (working class, middle class, upper class) can be interpreted as an indication that the taste profiles of the three classes differ relatively strongly from each other, with the strongest difference in taste profiles between the working class and the upper class: members of the working class prefer a communion, a sunset and the depiction of a folkloric dance as motifs for a beautiful picture with above-average frequency. In contrast, members of the upper class favor less classical motifs such as a tree bark, a steel frame or cabbages. Finally, members of the middle class favor harmonious motifs such as a girl with cat, a nursing woman, or a pregnant woman more often than average. The interpretation of the correspondences between social classes and favored painters could be carried out analogously.

Attention: It would be unjustified to interpret the distance between motifs or painters and classes in the sense that most of the respondents who favor e.g. Kandinsky are also members of the upper class. Rather, we must assume that Kandinsky is simply more highly regarded, relatively speaking, among upper class respondents than is the case among middle or working class respondents (Carroll, Green, and Schaffer 1986).

In addition to the interpretation of the meaning of the relation of the row and column profiles, we are further interested in the squared cosines values (cos2) of the row and column profiles as well as their contribution values (contrib). As we mentioned above, the cos2 value measures the extent of association between row and column categories and a particular dimension of a ca. The contrib value further indicates which row and column categories are most important in explaining the variability of the data underlying the contingency table.

Since the first dimension of our ca captures over 93% of the variability in the data, we focus our exploration of the cos2 and contrib values only on the first axis of the ca.1

ggplot(ca_df, aes(`Dim 1`, `Dim 2`,color=cos2_dim1)) + #specify the x and y variable
  geom_vline(xintercept = 0, color="grey",linetype="dashed")+ #add x-axis
  geom_hline(yintercept = 0, color="grey",linetype="dashed")+ #add y-axis
  geom_point(aes(size = cos2_dim1,color=cos2_dim1), alpha=.5)+ # add (red) points corresponding to the coordinates of the variables 'Dim 1' and 'Dim 2
  geom_text_repel(aes(label = rownames), min.segment.length = 0, seed = 42, box.padding = 0.5,max.overlaps = Inf)+  #avoid overlapping labels
  scale_color_gradient(low="#2450a4", high="#d80000")+ #add manual color shading
  ggtitle("squared cosines values map of correspondence analysis") + #add main title
  xlab("Dim 1 (93.18%)") + ylab("Dim 1 (6.82%)") #add axis titles

The possible values of cos2 are between 0 and 1. If a column or row feature has a cos2 value of 1, it is perfectly represented by a dimension. A cos2 of 0, on the other hand, would indicate that the feature is not represented by the corresponding dimension. As can be seen from the specified squared cosines values map of correspondence analysis, most of the features of the contigency table are clearly correlated with the first dimension and are accordingly acceptably represented by it.

ggplot(ca_df, aes(`Dim 1`, `Dim 2`,color=contr_dim1)) + #specify the x and y variable
  geom_vline(xintercept = 0, color="grey",linetype="dashed")+ #add x-axis
  geom_hline(yintercept = 0, color="grey",linetype="dashed")+ #add y-axis
  geom_point(aes(size = contr_dim1,color=contr_dim1), alpha=.5)+ # add (red) points corresponding to the coordinates of the variables 'Dim 1' and 'Dim 2
  geom_text_repel(aes(label = rownames), min.segment.length = 0, seed = 42, box.padding = 0.5,max.overlaps = Inf)+  #avoid overlapping labels
  scale_color_gradient(low="#2450a4", high="#d80000")+ #add manual color shading
  ggtitle("contribution values map of correspondence analysis") + #add main title
  xlab("Dim 1 (93.18%)") + ylab("Dim 1 (6.82%)") #add axis titles

The contribution values map of correspondence analysis gives an idea of the pole of the dimensions to which the different column and row categories contribute. It can be seen that the working class makes a central contribution to the negative pole of the first dimension, while the upper class makes a large contribution to the positive pole of the first dimension. Furthermore, the specified plot shows that the 1st dimension is mainly determined by the opposition of working and upper class.

Conclusion

We have seen how to perform a correspondence analysis with R and how to interpret it in a basic way. Despite the comparatively simple interpretation, the ca has several challenges and pitfalls that need to be considered for a correct application. For a low-threshold introduction, I recommend reading Le Roux and Rouanet (2010) or, especially for social scientists, Hjellbrekke (2018).

In case you are a novice in R, the package factoextra offers a beginner-friendly introduction to the visualization of the results of a ca.

References

Backhaus, Klaus, Bernd Erichson, Rolf Weiber, and Wulff Plinke. 2016. Korrespondenzanalyse.” In Multivariate Analysemethoden: Eine anwendungsorientierte Einführung, edited by Klaus Backhaus, Bernd Erichson, Wulff Plinke, and Rolf Weiber, 619–27. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-46076-4_17.
Blasius, Jörg. 2001. Korrespondenzanalyse. Internationale Standardlehrbücher der Wirtschafts- und Sozialwissenschaften. München: Oldenbourg.
Blasius, Jörg, and Joachim Winkler. 1989. “Gibt Es Die "Feinen Unterschiede"? Eine Empirische Ueberpruefung Der Bourdieuschen Theorie.” Gedruckt. Kölner Zeitschrift Für Soziologie Und Sozialpsychologie 41 (1): 72–94.
Bourdieu, Pierre. 1987. Distinction: A Social Critique of the Judgement of Taste. Reprint Edition. Cambridge, Mass: Havard university press.
———. 2012. Die feinen Unterschiede: Kritik der gesellschaftlichen Urteilskraft. Twenty-second. Frankfurt am Main: Suhrkamp.
Carroll, J. Douglas, Paul E. Green, and Catherine M. Schaffer. 1986. “Interpoint Distance Comparisons in Correspondence Analysis.” Journal of Marketing Research 23 (3): 271–80. https://doi.org/10.2307/3151485.
Gollac, Michael. 2015. “Eine Fröhliche Wissenschaft. Über Pierre Bourdieus Gebrauch Quantitativer Methoden.” In Pierre Bourdieu. Kunst Und Kultur. Kunst Und Künstlerisches Feld. Schriften Zur Kultursoziologie 4, edited by Franz Schultheis and Stephan Egger. Frankfurt am Main: Suhrkamp.
Hepp, Rolf-Dieter, and Sabine Kergel. 2014. “Epistemologische Wachsamkeit.” In Handbuch Bourdieu, edited by Boike Rehbein, Gernot Saalmann, and Hermann Schwengel, 94–99. Konstanz: UVK. https://doi.org/10.1007/978-3-476-01379-8_20.
Hjellbrekke, Johs. 2018. Multiple Correspondence Analysis for Social Sciences. New York: Routledge.
Le Roux, Brigitte, and Henry Rouanet. 2010. Multiple Correspondence Analysis. Quantitative Applications in the Social Sciences 163. Thousand Oaks, Calif: Sage Publications.
Lenger, Alexander, Christian Schneickert, and Florian Schumacher. 2013. Pierre Bourdieus Konzeption des Habitus.” In Pierre Bourdieus Konzeption des Habitus: Grundlagen, Zugänge, Forschungsperspektiven, edited by Alexander Lenger, Christian Schneickert, and Florian Schumacher, 11–41. Wiesbaden: Springer Fachmedien. https://doi.org/10.1007/978-3-531-18669-6_1.

Footnotes

  1. Of course, it is recommended to use the role of row and column categories for all other extracted dimensions as well. dimensions to be investigated and reported.↩︎