Skip to Content

Journal impact factors –Fitting citation distribution curves

Introduction to Sciencehound

Welcome to my new blog at Science. I began blogging when I was Director of the National Institute of General Medical Sciences (NIGMS) at the US National Institutes of Health (NIH). Our blog was called the NIGMS Feedback Loop. I found this to be a very effective way of sharing information and data with NIGMS stakeholders. A couple of years after leaving NIGMS, I started a new blog called Datahound. There, I have continued sharing data and analyses about programs of interest to the scientific community. I greatly appreciated those who took the time to comment, providing feedback and sometimes raising important questions. I am starting Sciencehound with the same intent, providing data and analyses and, importantly, initiating discussions with the readers of Science and the Science family of journals. Enjoy and join in!

Journal impact factors

Journal impact factors are used as metrics for the quality of academic journals. In addition, they are (ab)used as metrics for individual publications or individual scientists (see my editorial in Science). The journal impact factor is defined as the average number of times articles published in a given journal over the past 2 years are cited in a given year. This average is derived from a relatively broad distribution of publications with different numbers of citations. Recently, Larivière et al. posted on BioRxiv a proposal recommending sharing these full distributions . This manuscript includes 2015 distributions for 11 journals (in a readily downloadable format). The distribution for Science magazine is shown below:

Science_Citation_Plot

Note that the point at 100 represents the sum of the numbers of all papers that received 100 or more citations.

Fitting citation curves as the difference of exponential functions

This curve rises quickly and then falls more slowly. As a chemist, this reminded me of the curves representing the concentration of an intermediate B in a reaction of the form

A -> B -> C.

The concentration of B rises when A is converted to B and then falls when B is transformed into C.

Solving equations for the kinetics of this scheme results in a function that is the difference between two exponential functions with negative exponents, that is,

P(c) = N(exp(-k1c) – exp(-k2c)) with k1 < k2.

Here, c is the number of citations, P(c) is the population of papers with c citations, kand k2 are adjustable constants, and N is a scale factor. The curve rises with an initial slope proportional to (1/k2 – 1/k1) and falls expontially approximately as exp(-k1c).

Before fitting the citation curve to this function, we first normalize the curve so that the area under the curve is 1.0 and the y-axis is the fraction of the number of total papers.

Science_Citations_Norm_Plot

This normalized curve can now be fit to the difference of exponential functions. It is easy to show that the normalization constant for the difference of exponential functions is N = k1k2/(k2 – k1) (see mathematical appendix).

Science_Citation_Fit_Plot_rev

The best fit occurs with k1 = 0.05 and k2 = 0.19.

The apparent journal impact factor can be calculated from these parameters (See mathematical appendix). It can be shown that the journal impact factor (JIF) is:

JIF = (k1 + k2) / k1k2.

The calculated JIF = 25.3.

Note that this value is smaller that the journal impact factor that is reported (34.7). This is because highly cited papers (with more than 100 citations) have a substantial effect on the journal impact factor but are not well fit by the difference of exponential functions.

Results for a collection of journals

With this fitting protocol in place we can now fit the distributions for the other 10 journals.

Nature

Nature_Plot_rev

The best fit occurs with k1 = 0.07 and k2 = 0.08.

The calculated JIF = 26.8.

eLife

eLife_Plot_rev

The best fit occurs with k1 = 0.16 and k2 = 0.65.

The calculated JIF = 7.8.

PLOS ONE

PLOS_ONE_Plot_rev

The best fit occurs with k1 = 0.31 and k2 = 2.

The calculated JIF = 3.7.

PLOS Biology

PLOS_Biol_Plot_rev

The best fit occurs with k1 = 0.16 and k2 = 0.57.

The calculated JIF = 8.0.

PLOS Genetics

PLOS_Genet_Plot_rev

The best fit occurs with k1 = 0.18 and k2 = 0.92.

The calculated JIF = 6.6.

Nature Communications

Nature_Comm_Plot_rev

The best fit occurs with k1 = 0.13 and k2 = 0.66.

The calculated JIF = 9.2.

EMBO Journal

EMBO_J_Plot_rev

The best fit occurs with k1 = 0.16 and k2 = 0.37.

The calculated JIF = 9.0.

Proceedings of the Royal Society of London B

Proc_R_Soc_B_Plot_rev

The best fit occurs with k1 = 0.24 and k2 = 1.42.

The calculated JIF = 4.9.

Journal of Informetrics

J_Informatics_Plot_rev

The best fit occurs with k1 = 0.32 and k2 = 2.

The calculated JIF = 3.6.

Scientific Reports

Sci_Rep_Plot_rev

The best fit occurs with k1 = 0.22 and k2 = 2.

The calculated JIF = 5.0.

Analysis of calculated and observed journal impact factors

The calculated journal impact factors are well correlated with the observed values as shown below:

JIF_Comparison_Plot

A line with slope 1 is shown for comparison. The overall Pearson correlation coefficient is 0.999. Fitting all 11 data points to a line through the origin yields a slope of 0.746. The fact that this slope is substantially less than 1 is largely driven by the values for Science and Nature which, as noted above, are lower than the reported values owing to the elimination of the effect of papers with more than 100 citations. If these two points are eliminated, the slope of a fitted line increases to 0.924.

Conclusions

We have demonstrated that a function formed as the difference of two exponential functions can be used to fit observed distributions of the numbers of papers with different number of citations. Fitting this functional form to data from 11 journals reproduces the curves well and generates journal impact factors that agree well with published values. The largest differences are in journals such as Science and Nature that have substantial numbers of papers with more than 100 citations over the 2-year period. This emphasizes again how these outlier papers can affect journal impact factor values.

Available code and documents

The R Markdown file that generates this post including the R code for fitting the citation distributions is available. The data from Larivière et al. is provided as a .csv file. A mathematical appendix showing the derivation of some key formulae is also available.

Additional Files

  • SaG

    Do you think that the use of this equation tell us anything deeper about journal citations? Or is it just happenstance that that a chemical reaction formula works so well?

    • Jeremy Berg

      Welcome SaG. I do not think there is any terribly deep reason that the formula works. The tail going to higher citation number is approximately exponential so the fact that my formula fits that part is not surprising. The rise at low citation numbers occurs over only a few points so that many functional forms could probably be made to work. I will continue to think (and perhaps model) why these distributions have the shapes that they do.

      • Laura

        Do you have a way to ignore the autocitations in the calculations? I think the initial rise may be due to the fact that harvesting a few citations is easy, b/c they are mainly autocitations. If that is so, the rise would be less sharp for low numers of citations, I think.

        • Jeremy Berg

          This is an interesting hypothesis. I can check to see if anyone has looked this and, if not, do some initial analysis.

          • Laura

            Let me know! 😀

  • NATHAN

    how related is this model to Greenwood 2007? http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2206035/

    • Jeremy Berg

      Nathan: Thanks for pointing this out. I was not familiar with this paper. I need to look at it more carefully, but I think this is a deeper analysis of the precision (or lack thereof) of journal impact factors. This is a separate point (although I certainly agree with the author’s conclusions based on my analysis. My analysis more empirically fits the distribution underlying journal impact factors.

  • Gopher63

    Jeremy, (ab)use (love the use of the term) of personal citation metrics, including the h-index, etc. are also controversial with a rapidly developing literature. Bibliometrics and altimetrics in general are also rapidly expanding fields with at least two books published recently. (I’ve reviewed the books and published an article in J Chem Ed on the pedagogic application of these metrics.) Will you also be discussing these controversies?

    • Jeremy Berg

      They are certainly on my list for future topics (but it is a long list).

  • Jeremy Berg

    As noted on Twitter (see https://twitter.com/stuartcantrill/status/761962007646134272), two older blog posts discuss some of the problems of concluding much about the number of citations for any given paper based on the journal impact factor. See https://stuartcantrill.com/2016/01/23/imperfect-impact/ and http://occamstypewriter.org/scurry/2012/08/13/sick-of-impact-factors/ . There are others as well.

  • Pingback: Generation of predicted citation distributions from journal impact factor values | Sciencehound()

  • Pingback: Comparing individual papers from journals with different journal impact factors | Sciencehound()