Introduction to Sciencehound
Welcome to my new blog at Science. I began blogging when I was Director of the National Institute of General Medical Sciences (NIGMS) at the US National Institutes of Health (NIH). Our blog was called the NIGMS Feedback Loop. I found this to be a very effective way of sharing information and data with NIGMS stakeholders. A couple of years after leaving NIGMS, I started a new blog called Datahound. There, I have continued sharing data and analyses about programs of interest to the scientific community. I greatly appreciated those who took the time to comment, providing feedback and sometimes raising important questions. I am starting Sciencehound with the same intent, providing data and analyses and, importantly, initiating discussions with the readers of Science and the Science family of journals. Enjoy and join in!
Journal impact factors
Journal impact factors are used as metrics for the quality of academic journals. In addition, they are (ab)used as metrics for individual publications or individual scientists (see my editorial in Science). The journal impact factor is defined as the average number of times articles published in a given journal over the past 2 years are cited in a given year. This average is derived from a relatively broad distribution of publications with different numbers of citations. Recently, Larivière et al. posted on BioRxiv a proposal recommending sharing these full distributions . This manuscript includes 2015 distributions for 11 journals (in a readily downloadable format). The distribution for Science magazine is shown below:
Note that the point at 100 represents the sum of the numbers of all papers that received 100 or more citations.
Fitting citation curves as the difference of exponential functions
This curve rises quickly and then falls more slowly. As a chemist, this reminded me of the curves representing the concentration of an intermediate B in a reaction of the form
A -> B -> C.
The concentration of B rises when A is converted to B and then falls when B is transformed into C.
Solving equations for the kinetics of this scheme results in a function that is the difference between two exponential functions with negative exponents, that is,
P(c) = N(exp(-k1c) – exp(-k2c)) with k1 < k2.
Here, c is the number of citations, P(c) is the population of papers with c citations, k1 and k2 are adjustable constants, and N is a scale factor. The curve rises with an initial slope proportional to (1/k2 – 1/k1) and falls expontially approximately as exp(-k1c).
Before fitting the citation curve to this function, we first normalize the curve so that the area under the curve is 1.0 and the y-axis is the fraction of the number of total papers.
This normalized curve can now be fit to the difference of exponential functions. It is easy to show that the normalization constant for the difference of exponential functions is N = k1k2/(k2 – k1) (see mathematical appendix).
The best fit occurs with k1 = 0.05 and k2 = 0.19.
The apparent journal impact factor can be calculated from these parameters (See mathematical appendix). It can be shown that the journal impact factor (JIF) is:
JIF = (k1 + k2) / k1k2.
The calculated JIF = 25.3.
Note that this value is smaller that the journal impact factor that is reported (34.7). This is because highly cited papers (with more than 100 citations) have a substantial effect on the journal impact factor but are not well fit by the difference of exponential functions.
Results for a collection of journals
With this fitting protocol in place we can now fit the distributions for the other 10 journals.
Nature
The best fit occurs with k1 = 0.07 and k2 = 0.08.
The calculated JIF = 26.8.
eLife
The best fit occurs with k1 = 0.16 and k2 = 0.65.
The calculated JIF = 7.8.
PLOS ONE
The best fit occurs with k1 = 0.31 and k2 = 2.
The calculated JIF = 3.7.
PLOS Biology
The best fit occurs with k1 = 0.16 and k2 = 0.57.
The calculated JIF = 8.0.
PLOS Genetics
The best fit occurs with k1 = 0.18 and k2 = 0.92.
The calculated JIF = 6.6.
Nature Communications
The best fit occurs with k1 = 0.13 and k2 = 0.66.
The calculated JIF = 9.2.
EMBO Journal
The best fit occurs with k1 = 0.16 and k2 = 0.37.
The calculated JIF = 9.0.
Proceedings of the Royal Society of London B
The best fit occurs with k1 = 0.24 and k2 = 1.42.
The calculated JIF = 4.9.
Journal of Informetrics
The best fit occurs with k1 = 0.32 and k2 = 2.
The calculated JIF = 3.6.
Scientific Reports
The best fit occurs with k1 = 0.22 and k2 = 2.
The calculated JIF = 5.0.
Analysis of calculated and observed journal impact factors
The calculated journal impact factors are well correlated with the observed values as shown below:
A line with slope 1 is shown for comparison. The overall Pearson correlation coefficient is 0.999. Fitting all 11 data points to a line through the origin yields a slope of 0.746. The fact that this slope is substantially less than 1 is largely driven by the values for Science and Nature which, as noted above, are lower than the reported values owing to the elimination of the effect of papers with more than 100 citations. If these two points are eliminated, the slope of a fitted line increases to 0.924.
Conclusions
We have demonstrated that a function formed as the difference of two exponential functions can be used to fit observed distributions of the numbers of papers with different number of citations. Fitting this functional form to data from 11 journals reproduces the curves well and generates journal impact factors that agree well with published values. The largest differences are in journals such as Science and Nature that have substantial numbers of papers with more than 100 citations over the 2-year period. This emphasizes again how these outlier papers can affect journal impact factor values.
Next post
In my next post, I will demonstrate how this model can be refined to produce an algorithm that will generate a unique citation distribution given a journal impact factor. This sets the stage for more interesting analyses.
Available code and documents
The R Markdown file that generates this post including the R code for fitting the citation distributions is available. The data from Larivière et al. is provided as a .csv file. A mathematical appendix showing the derivation of some key formulae is also available.