Network Analysis
I am learning natural language processing using R. I will do a network analysis for research article titles. Dataset is "mantle xenolith" from PetDB.
Network analysis is the method to investigate the relationships of objects (words). The figure shows the result of network analysis on some papers about mantle peridotites. Black circles (nodes) are target words and lines (edges) are showing relationships between words.
For example, peridotite in the center is connected to mantle, ophiolite, xenolith, isotope, etc.
Nowadays, there are so many research papers published. To get an overview of the research field, Natural Language Processing (NLP) for the research article titles is helpful to understand the structure of a relationship in the research field.
Library
library(tidyverse)library(tidytext)library(widyr)
library(igraph)library(ggraph)
Read Dataset
data =read.csv("/PATH/petdb_mantle_xenolith_title_network.csv")
head(data)## year
## 1 2016
## 2 2012
## 3 2016
## 4 2006
## 5 2009
## 6 2014
## text
## 1 A FORE-ARC SETTING OF THE GERF OPHIOLITE, EASTERN DESERT, EGYPT: EVIDENCE FROM MINERAL CHEMISTRY AND GEOCHEMISTRY OF ULTRAMAFITES
## 2 THE DUNITIC MANTLE-CRUST TRANSITION ZONE IN THE OMAN OPHIOLITE: RESIDUE OF MELT-ROCK INTERACTION, CUMULATES FROM HIGH-MGO MELTS, OR BOTH?
## 3 SOFTENING OF SUB-CONTINENTAL LITHOSPHERE PRIOR RIFTING: EVIDENCE FROM CLINOPYROXENE CHEMISTRY IN PERIDOTITE XENOLITHS FROM NATASH VOLCANIC PROVINCE, SE EGYPT
## 4 SERPENTINIZATION AND DEHYDRATION IN THE UPPER MANTLE BENEATH FUERTEVENTURA (EASTERN CANARY ISLANDS): EVIDENCE FROM MANTLE XENOLITHS
## 5 GEOCHEMISTRY OF FE-RICH PERIDOTITES AND ASSOCIATED PYROXENITES FROM HORNÍ BORY, BOHEMIAN MASSIF: INSIGHTS INTO SUBDUCTION-RELATED MELT–ROCK REACTIONS
## 6 GEOCHEMICAL AND PETROLOGICAL CONSTRAINTS ON MANTLE COMPOSITION OF THE OHRE (EGER) RIFT, BOHEMIAN MASSIF: PERIDOTITE XENOLITHS FROM THE CESKE STREDOHORI VOLCANIC COMPLEX AND NORTHERN BOHEMIA
## journal paper
## 1 LITHOS paper1
## 2 GEOLOGY paper2
## 3 JOURNAL OF VOLCANOLOGY AND GEOTHERMAL RESEARCH paper3
## 4 LITHOS paper4
## 5 CHEMICAL GEOLOGY paper5
## 6 INT J EARTH SCI (GEOL RUNDSCH) paper6Dataset consists of year (publication), text (title of papers), journal, and paper ID.
data %>% head(10) %>% pull(text)## [1] "A FORE-ARC SETTING OF THE GERF OPHIOLITE, EASTERN DESERT, EGYPT: EVIDENCE FROM MINERAL CHEMISTRY AND GEOCHEMISTRY OF ULTRAMAFITES"
## [2] "THE DUNITIC MANTLE-CRUST TRANSITION ZONE IN THE OMAN OPHIOLITE: RESIDUE OF MELT-ROCK INTERACTION, CUMULATES FROM HIGH-MGO MELTS, OR BOTH?"
## [3] "SOFTENING OF SUB-CONTINENTAL LITHOSPHERE PRIOR RIFTING: EVIDENCE FROM CLINOPYROXENE CHEMISTRY IN PERIDOTITE XENOLITHS FROM NATASH VOLCANIC PROVINCE, SE EGYPT"
## [4] "SERPENTINIZATION AND DEHYDRATION IN THE UPPER MANTLE BENEATH FUERTEVENTURA (EASTERN CANARY ISLANDS): EVIDENCE FROM MANTLE XENOLITHS"
## [5] "GEOCHEMISTRY OF FE-RICH PERIDOTITES AND ASSOCIATED PYROXENITES FROM HORNÍ BORY, BOHEMIAN MASSIF: INSIGHTS INTO SUBDUCTION-RELATED MELT–ROCK REACTIONS"
## [6] "GEOCHEMICAL AND PETROLOGICAL CONSTRAINTS ON MANTLE COMPOSITION OF THE OHRE (EGER) RIFT, BOHEMIAN MASSIF: PERIDOTITE XENOLITHS FROM THE CESKE STREDOHORI VOLCANIC COMPLEX AND NORTHERN BOHEMIA"
## [7] "HIGHLY SIDEROPHILE ELEMENT GEOCHEMISTRY OF PERIDOTITES AND PYROXENITES FROM HORNÍ BORY, BOHEMIAN MASSIF: IMPLICATIONS FOR HSE BEHAVIOUR IN SUBDUCTION-RELATED UPPER MANTLE"
## [8] "ALKALINE AND CARBONATE-RICH MELT METASOMATISM AND MELTING OF SUBCONTINENTAL LITHOSPERIC MANTLE: EVIDENCE FROM MANTLE XENOLITHS, NE BAVARIA, BOHEMIAN MASSIF"
## [9] "METASOMATISM IN LITHOSPHERIC MANTLE ROOTS: CONSTRAINTS FROM WHOLE-ROCK AND MINERAL CHEMICAL COMPOSITION OF DEFORMED PERIDOTITE XENOLITHS FROM KIMBERLITE PIPE UDACHNAYA"
## [10] "GEOCHEMISTRY OF ECLOGITE XENOLITHS FROM THE UDACHNAYA KIMBERLITE PIPE: SECTION OF ANCIENT OCEANIC CRUST SAMPLED"To split a sentence (title) into a word and count how many times each word is used
data %>%
unnest_tokens(output = word, input = text)%>%
count(word, sort = TRUE)## word n
## 1 the 870
## 2 of 773
## 3 and 669
## 4 mantle 539
## 5 from 507
## 6 in 481
## 7 xenoliths 295
## 8 peridotite 188
## 9 for 149
## 10 evidence 128
.....Remove stopwords
Words such as the, of, and do not have meaning and we want to remove those words.
#stopwords
stop_words## # A tibble: 1,149 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rowsdata %>%
unnest_tokens(output = word, input = text) %>%
anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE)## word n
## 1 mantle 539
## 2 xenoliths 295
## 3 peridotite 188
## 4 evidence 128
## 5 peridotites 118
## 6 beneath 115
## 7 lithospheric 105
## 8 implications 88
## 9 geochemistry 85
## 10 craton 83Now, we can see that mantle, xenolith and peridotite are the words that appear most in the titles.
Filtering the words
filter(paper_n >= 20) : Filter the words that appear on more than 10 papers
data_paper <- data_words %>%
count(word, name = "paper_n") %>%
filter(paper_n >= 20)
word_correlations <- data_words %>%
semi_join(data_paper, by = "word") %>%
pairwise_cor(item = word, feature = paper) %>%
filter(correlation >= 0.3)Word graph
graph_from_data_frame(d = word_correlations,
vertices = data_paper) %>%
ggraph(layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name))graph_from_data_frame(d = word_correlations,
vertices = data_paper %>%
semi_join(word_correlations, by = c("word" = "item1"))) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation)) +
geom_node_point() +
geom_node_text(aes(color = paper_n, label = name), repel = TRUE)
Making a function to generate a word graph
generate_word_graph <- function(data_words,
minimum_paper_n = 5,
minimum_correlation = 0.2){
data_paper <- data_words %>%
count(word, name = "paper_n") %>%
filter(paper_n >= minimum_paper_n)
word_correlations <- data_words %>%
semi_join(data_paper, by = "word") %>%
pairwise_cor(item = word, feature = paper) %>%
filter(correlation >= minimum_correlation)
graph_from_data_frame(d = word_correlations,
vertices = data_paper %>%
semi_join(word_correlations, by = c("word" = "item1"))) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation)) +
geom_node_point() +
geom_node_text(aes(color = paper_n, label = name), repel = TRUE)
}Visualizing word graph (using the function)
#test1
data_words %>%
generate_word_graph(minimum_paper_n = 10,
minimum_correlation = 0.2)#test2
data_words %>%
generate_word_graph(minimum_paper_n = 20,
minimum_correlation = 0.2)




0 Comments