### Introduction to *Sciencehound*

Welcome to my new blog at *Science*. I began blogging when I was Director of the National Institute of General Medical Sciences (NIGMS) at the US National Institutes of Health (NIH). Our blog was called the *NIGMS Feedback Loop*. I found this to be a very effective way of sharing information and data with NIGMS stakeholders. A couple of years after leaving NIGMS, I started a new blog called *Datahound*. There, I have continued sharing data and analyses about programs of interest to the scientific community. I greatly appreciated those who took the time to comment, providing feedback and sometimes raising important questions. I am starting *Sciencehound* with the same intent, providing data and analyses and, importantly, initiating discussions with the readers of *Science* and the *Science* family of journals. Enjoy and join in!

### Journal impact factors

Journal impact factors are used as metrics for the quality of academic journals. In addition, they are (ab)used as metrics for individual publications or individual scientists (see my editorial in *Science*). The journal impact factor is defined as the average number of times articles published in a given journal over the past 2 years are cited in a given year. This average is derived from a relatively broad distribution of publications with different numbers of citations. Recently, Larivière *et al.* posted on BioRxiv a proposal recommending sharing these full distributions . This manuscript includes 2015 distributions for 11 journals (in a readily downloadable format). The distribution for *Science* magazine is shown below:

Note that the point at 100 represents the sum of the numbers of all papers that received 100 or more citations.

### Fitting citation curves as the difference of exponential functions

This curve rises quickly and then falls more slowly. As a chemist, this reminded me of the curves representing the concentration of an intermediate B in a reaction of the form

#### A -> B -> C.

The concentration of B rises when A is converted to B and then falls when B is transformed into C.

Solving equations for the kinetics of this scheme results in a function that is the difference between two exponential functions with negative exponents, that is,

#### P(c) = N(exp(-k_{1}c) – exp(-k_{2}c)) with k_{1} < k_{2}.

Here, c is the number of citations, P(c) is the population of papers with c citations, k_{1 }and k_{2} are adjustable constants, and N is a scale factor. The curve rises with an initial slope proportional to (1/k_{2} – 1/k_{1}) and falls expontially approximately as exp(-k_{1}c).

Before fitting the citation curve to this function, we first normalize the curve so that the area under the curve is 1.0 and the *y-axis* is the fraction of the number of total papers.

This normalized curve can now be fit to the difference of exponential functions. It is easy to show that the normalization constant for the difference of exponential functions is N = k_{1}k_{2}/(k_{2} – k_{1}) (see mathematical appendix).

The best fit occurs with k_{1} = 0.05 and k_{2} = 0.19.

The apparent journal impact factor can be calculated from these parameters (See mathematical appendix). It can be shown that the journal impact factor (JIF) is:

#### JIF = (k_{1} + k_{2}) / k_{1}k_{2}.

The calculated JIF = 25.3.

Note that this value is smaller that the journal impact factor that is reported (34.7). This is because highly cited papers (with more than 100 citations) have a substantial effect on the journal impact factor but are not well fit by the difference of exponential functions.

### Results for a collection of journals

With this fitting protocol in place we can now fit the distributions for the other 10 journals.

*Nature*

The best fit occurs with k_{1} = 0.07 and k_{2} = 0.08.

The calculated JIF = 26.8.

*eLife*

The best fit occurs with k_{1} = 0.16 and k_{2} = 0.65.

The calculated JIF = 7.8.

*PLOS ONE*

The best fit occurs with k_{1} = 0.31 and k_{2} = 2.

The calculated JIF = 3.7.

*PLOS Biology*

The best fit occurs with k_{1} = 0.16 and k_{2} = 0.57.

The calculated JIF = 8.0.

*PLOS Genetics*

The best fit occurs with k_{1} = 0.18 and k_{2} = 0.92.

The calculated JIF = 6.6.

*Nature Communications*

The best fit occurs with k_{1} = 0.13 and k_{2} = 0.66.

The calculated JIF = 9.2.

*EMBO Journal*

The best fit occurs with k_{1} = 0.16 and k_{2} = 0.37.

The calculated JIF = 9.0.

*Proceedings of the Royal Society of London B*

The best fit occurs with k_{1} = 0.24 and k_{2} = 1.42.

The calculated JIF = 4.9.

*Journal of Informetrics*

The best fit occurs with k_{1} = 0.32 and k_{2} = 2.

The calculated JIF = 3.6.

*Scientific Reports*

The best fit occurs with k_{1} = 0.22 and k_{2} = 2.

The calculated JIF = 5.0.

### Analysis of calculated and observed journal impact factors

The calculated journal impact factors are well correlated with the observed values as shown below:

A line with slope 1 is shown for comparison. The overall Pearson correlation coefficient is 0.999. Fitting all 11 data points to a line through the origin yields a slope of 0.746. The fact that this slope is substantially less than 1 is largely driven by the values for *Science* and *Nature* which, as noted above, are lower than the reported values owing to the elimination of the effect of papers with more than 100 citations. If these two points are eliminated, the slope of a fitted line increases to 0.924.

### Conclusions

We have demonstrated that a function formed as the difference of two exponential functions can be used to fit observed distributions of the numbers of papers with different number of citations. Fitting this functional form to data from 11 journals reproduces the curves well and generates journal impact factors that agree well with published values. The largest differences are in journals such as *Science* and *Nature* that have substantial numbers of papers with more than 100 citations over the 2-year period. This emphasizes again how these outlier papers can affect journal impact factor values.

### Next post

In my next post, I will demonstrate how this model can be refined to produce an algorithm that will generate a unique citation distribution given a journal impact factor. This sets the stage for more interesting analyses.

### Available code and documents

The R Markdown file that generates this post including the R code for fitting the citation distributions is available. The data from Larivière *et al.* is provided as a .csv file. A mathematical appendix showing the derivation of some key formulae is also available.