Computational tools for extracting relationships from text (often referred to as “natural language processing” tools) are increasingly powerful. Here, I analyze the content of a series of abstract from members of the Science family of journals using a natural language package in R called quenteda. The approach begins with a relatively large body (often referred to as a corpus) of abstracts. The frequency of different words within this corpus is analyzed. The similarity between every pair of abstracts is then calculated based on the fraction of words that the pair have in common, with words that are rare in the corpus being weighted more heavily than common words. These similarity scores are then converted into “distances,” ranging from 0 (very similar) to 1 (no similarity). To visualize and analyze these distances, a two-dimensional space is constructed that maintains the relative distances between pairs of points as faithfully as possible. Points in this two-dimensional space can then be plotted and examined.
Contracting the initial corpus of abstracts from Science
To build the corpus, abstracts for more than 2200 research papers from Science from 2013 through 2015 were assembled. The first task is to read in the data set and convert the data into a corpus for analysis.
Calculating similarities between all pairs of abstracts
The next step is to calculate the similarity matrix. This can be done using TF-IDF (term frequency–inverse document frequency) weighting. This weights terms that occurred rarely in the corpus more highly than common terms. The similarity index is the so-called “cosine similarity.” It will be necessary to convert this to a distance metric subsequently.
With this similarity matrix in hand, we calculate distances using the formula distance = 2*arccos(similarity)/pi.
Representing the relative distances in two dimensions
The results can then be plotted.
This plot reveals an interesting three-pointed structure. Note that only the shape of this figure is meaningful; the orientation is arbitrary. Examination of the abstracts that correspond to the three points reveals that these correspond to biomedical sciences, physical sciences, and Earth sciences.
Extending the analysis to the other Science family journals
With this framework in place, we can now expand the corpus to include papers published in the other Science family journals. For this purpose, we will use most of the papers published in Science Advances, Science Signaling, Science Translational Medicine, Science Immunology, and Science Robotics in 2016.
We now plot each journal separately.
The orientations of these figures are slightly different from that produced by the initial Science-only corpus but, as noted above, this orientation is arbitrary. Several points emerge from examining these plots. First, the breadth of disciplines covered by Science Advances is essentially the same as that covered by Science. Comparison with more papers from Science Advances may reveal differences in emphasis between these two broad journals. Second, the papers from Science Signaling, Science Translational Medicine, and Science Immunology lie in the same general region in the biomedical arm of the plot. More detailed analysis should reveal more nuanced differences between the content of these journals.
This analysis represents a first step toward using these tools for unbiased analysis of the contents of the Science family of journals. More refined analysis is in progress.
Additional documents and code
The abstracts used in this analysis are available in six .csv files. The R Markdown file that generates this post including the analysis is also available.
- Science research abstracts 2013-2015 (for personal use)
- Science Advances abstracts 2016 (for personal use)
- Science Signaling abstracts 2016 (for personal use)
- Science Translational Medicine abstracts 2016 (for personal use)
- Science Immunology abstracts 2016 (for personal use)
- Science Robotics abstracts 2016 (for personal use)
- R Markdown file