Network Analysis

I am learning natural language processing using R. I will do a network analysis for research article titles. Dataset is "mantle xenolith" from PetDB.

Network analysis is the method to investigate the relationships of objects (words). The figure shows the result of network analysis on some papers about mantle peridotites. Black circles (nodes) are target words and lines (edges) are showing relationships between words. 
For example, peridotite in the center is connected to mantle, ophiolite, xenolith, isotope, etc.

Nowadays, there are so many research papers published. To get an overview of the research field, Natural Language Processing (NLP) for the research article titles is helpful to understand the structure of a relationship in the research field.






Library

library(tidyverse)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)


Read Dataset

data =read.csv("/PATH/petdb_mantle_xenolith_title_network.csv")
head(data)

##   year
## 1 2016
## 2 2012
## 3 2016
## 4 2006
## 5 2009
## 6 2014
##                                                                                                                                                                                            text
## 1                                                             A FORE-ARC SETTING OF THE GERF OPHIOLITE, EASTERN DESERT, EGYPT: EVIDENCE FROM MINERAL CHEMISTRY AND GEOCHEMISTRY OF ULTRAMAFITES
## 2                                                     THE DUNITIC MANTLE-CRUST TRANSITION ZONE IN THE OMAN OPHIOLITE: RESIDUE OF MELT-ROCK INTERACTION, CUMULATES FROM HIGH-MGO MELTS, OR BOTH?
## 3                                 SOFTENING OF SUB-CONTINENTAL LITHOSPHERE PRIOR RIFTING: EVIDENCE FROM CLINOPYROXENE CHEMISTRY IN PERIDOTITE XENOLITHS FROM NATASH VOLCANIC PROVINCE, SE EGYPT
## 4                                                           SERPENTINIZATION AND DEHYDRATION IN THE UPPER MANTLE BENEATH FUERTEVENTURA (EASTERN CANARY ISLANDS): EVIDENCE FROM MANTLE XENOLITHS
## 5                                         GEOCHEMISTRY OF FE-RICH PERIDOTITES AND ASSOCIATED PYROXENITES FROM HORNÍ BORY, BOHEMIAN MASSIF: INSIGHTS INTO SUBDUCTION-RELATED MELT–ROCK REACTIONS
## 6 GEOCHEMICAL AND PETROLOGICAL CONSTRAINTS ON MANTLE COMPOSITION OF THE OHRE (EGER) RIFT, BOHEMIAN MASSIF: PERIDOTITE XENOLITHS FROM THE CESKE STREDOHORI VOLCANIC COMPLEX AND NORTHERN BOHEMIA
##                                          journal  paper
## 1                                         LITHOS paper1
## 2                                        GEOLOGY paper2
## 3 JOURNAL OF VOLCANOLOGY AND GEOTHERMAL RESEARCH paper3
## 4                                         LITHOS paper4
## 5                               CHEMICAL GEOLOGY paper5
## 6                 INT J EARTH SCI (GEOL RUNDSCH) paper6
Dataset consists of year (publication), text (title of papers), journal, and paper ID.


data %>% head(10) %>% pull(text)
##  [1] "A FORE-ARC SETTING OF THE GERF OPHIOLITE, EASTERN DESERT, EGYPT: EVIDENCE FROM MINERAL CHEMISTRY AND GEOCHEMISTRY OF ULTRAMAFITES"                                                            
##  [2] "THE DUNITIC MANTLE-CRUST TRANSITION ZONE IN THE OMAN OPHIOLITE: RESIDUE OF MELT-ROCK INTERACTION, CUMULATES FROM HIGH-MGO MELTS, OR BOTH?"                                                    
##  [3] "SOFTENING OF SUB-CONTINENTAL LITHOSPHERE PRIOR RIFTING: EVIDENCE FROM CLINOPYROXENE CHEMISTRY IN PERIDOTITE XENOLITHS FROM NATASH VOLCANIC PROVINCE, SE EGYPT"                                
##  [4] "SERPENTINIZATION AND DEHYDRATION IN THE UPPER MANTLE BENEATH FUERTEVENTURA (EASTERN CANARY ISLANDS): EVIDENCE FROM MANTLE XENOLITHS"                                                          
##  [5] "GEOCHEMISTRY OF FE-RICH PERIDOTITES AND ASSOCIATED PYROXENITES FROM HORNÍ BORY, BOHEMIAN MASSIF: INSIGHTS INTO SUBDUCTION-RELATED MELT–ROCK REACTIONS"                                        
##  [6] "GEOCHEMICAL AND PETROLOGICAL CONSTRAINTS ON MANTLE COMPOSITION OF THE OHRE (EGER) RIFT, BOHEMIAN MASSIF: PERIDOTITE XENOLITHS FROM THE CESKE STREDOHORI VOLCANIC COMPLEX AND NORTHERN BOHEMIA"
##  [7] "HIGHLY SIDEROPHILE ELEMENT GEOCHEMISTRY OF PERIDOTITES AND PYROXENITES FROM HORNÍ BORY, BOHEMIAN MASSIF: IMPLICATIONS FOR HSE BEHAVIOUR IN SUBDUCTION-RELATED UPPER MANTLE"                   
##  [8] "ALKALINE AND CARBONATE-RICH MELT METASOMATISM AND MELTING OF SUBCONTINENTAL LITHOSPERIC MANTLE: EVIDENCE FROM MANTLE XENOLITHS, NE BAVARIA, BOHEMIAN MASSIF"                                  
##  [9] "METASOMATISM IN LITHOSPHERIC MANTLE ROOTS: CONSTRAINTS FROM WHOLE-ROCK AND MINERAL CHEMICAL COMPOSITION OF DEFORMED PERIDOTITE XENOLITHS FROM KIMBERLITE PIPE UDACHNAYA"                      
## [10] "GEOCHEMISTRY OF ECLOGITE XENOLITHS FROM THE UDACHNAYA KIMBERLITE PIPE: SECTION OF ANCIENT OCEANIC CRUST SAMPLED"

To split a sentence (title) into a word and count how many times each word is used

data %>% 
  unnest_tokens(output = word, input = text)%>%
  count(word, sort = TRUE)
##                    word   n
## 1                   the 870
## 2                    of 773
## 3                   and 669
## 4                mantle 539
## 5                  from 507
## 6                    in 481
## 7             xenoliths 295
## 8            peridotite 188
## 9                   for 149
## 10             evidence 128
.....

Remove stopwords

Words such as the, of, and do not have meaning and we want to remove those words.
#stopwords
stop_words
## # A tibble: 1,149 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # … with 1,139 more rows

data %>% 
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE)
##                    word   n
## 1                mantle 539
## 2             xenoliths 295
## 3            peridotite 188
## 4              evidence 128
## 5           peridotites 118
## 6               beneath 115
## 7          lithospheric 105
## 8          implications  88
## 9          geochemistry  85
## 10               craton  83

Now, we can see that mantle, xenolith and peridotite are the words that appear most in the titles.



Filtering the words

filter(paper_n >= 20) : Filter the words that appear on more than 10 papers 

data_paper <- data_words %>%
  count(word, name = "paper_n") %>%
  filter(paper_n >= 20)


word_correlations <- data_words %>%
  semi_join(data_paper, by = "word") %>%
  pairwise_cor(item = word, feature = paper) %>%
  filter(correlation >= 0.3)


Word graph

graph_from_data_frame(d = word_correlations,
                      vertices = data_paper) %>%
  ggraph(layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name))



graph_from_data_frame(d = word_correlations,
                      vertices = data_paper %>%
                        semi_join(word_correlations, by = c("word" = "item1"))) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation)) +
  geom_node_point() +
  geom_node_text(aes(color = paper_n, label = name), repel = TRUE)



Making a function to generate a word graph

generate_word_graph <- function(data_words,
                                minimum_paper_n = 5,
                                minimum_correlation = 0.2){
  
  data_paper <- data_words %>%
    count(word, name = "paper_n") %>%
    filter(paper_n >= minimum_paper_n)
  
  word_correlations <- data_words %>%
    semi_join(data_paper, by = "word") %>%
    pairwise_cor(item = word, feature = paper) %>%
    filter(correlation >= minimum_correlation)
  
  
  
  graph_from_data_frame(d = word_correlations,
                        vertices = data_paper %>%
                          semi_join(word_correlations, by = c("word" = "item1"))) %>%
    ggraph(layout = "fr") +
    geom_edge_link(aes(alpha = correlation)) +
    geom_node_point() +
    geom_node_text(aes(color = paper_n, label = name), repel = TRUE)
}


Visualizing word graph (using the function)

#test1
data_words %>%
  generate_word_graph(minimum_paper_n = 10,
                      minimum_correlation = 0.2)


#test2
data_words %>%
  generate_word_graph(minimum_paper_n = 20,
                      minimum_correlation = 0.2)


References



Text Mining with R