In the previous post, I developed a tool that could generate an approximate citation distribution for a journal given its journal impact factor (JIF). We can now use this tool to address a key question:
If you select one paper randomly from a distribution associated with a journal impact factor JIF_1 (Journal_1) and another paper randomly from a distribution associated with a journal impact factor JIF_2 (Journal_2), what is the probability that the first paper has more citations than the second paper?
In the first post in this series, I demonstrated that normalized plots of the number of publications versus the number of citations could be fit to functions of the form:
P(c) = N(exp(-k1c) – exp(-k2c)) with k1 < k2
where c is the number of citations, P(c) is the population of papers with c citations, and N is a normalization factor set so that the integrated total population is 1.
Using the tool developed in the previous post, values for k1 and k2 can be estimated given only a value for the JIF.
We now turn to the question at hand. Suppose that a paper randomly selected from Journal_1 has x citations. The probability that this paper has more citations than a paper from Journal_2 is shown graphically below:
The blue shaded region represents the fraction of papers in Journal_2 that have x or fewer citations. Note that the citation distribution curves are presented as continuous functions for clarity (although in actuality, the numbers of citations must be integers). The use of continuous functions will also substantially simplify subsequent analysis and is unlikely to change any conclusions significantly.
In mathematical terms, the area of the blue shaded region is given by F(x) = ∫P2(c)dc from 0 to x, where P2(c) is the normalized citation curve for Journal_2, that is, P2(c) = N2(exp(-k12c) + exp(-k22c)) with k12 and k22 determined from JIF_2 as described in the previous post.
The integral can be readily solved analytically as shown in the mathematical appendix. In this way, it can be shown that the fraction of papers from Journal_2 with x or fewer citations is given by
F(x) = N2((1/k12)(1 – exp(-k12x)) – (1/k22)(1 – exp(-k22x))).
This function is shown graphically below:
This fraction curve has the expected shape. For small values of x, only a small fraction of papers in the other journal have x or fewer citations. For larger values of x, this fraction increases. Finally, for the largest values of x, the fraction approaches 1.00, that is, almost all papers in Journal_2 have x or fewer citations.
The probability that a paper with x citations is randomly chosen from Journal_1 is given by P1(x) = N1(exp(-k11x) + exp(-k21x)). Thus, to answer our question, we need only calculate the average of the fraction curve, F(x), weighted by the probability of different values of x. This is given by the integral Probability(JIF_1, JIF_2) = ∫P1(x)F(x) dx from 0 to infinity.
The functions to be integrated, P1(x)F(x), are plotted below for JIF_2 = 3 and JIF_1 = 1, 5, 10, and 20.
The areas under these curves can be estimated. From the graph, the anticipated results are relatively clear. For JIF_1 = 1, the area is relatively small so that the probability is relatively low. For JIF_1 = 3, the area increases. Because JIF_1 = JIF_2 at this point, the area should be 0.50, that is, the probability that the number of citations from a paper from one journal is less than that for the other should be 50%. The areas continue to increase for JIF_1 = 5 through JIF_1 = 20, approaching 1.00.
The expression can be integrated analytically in a straightforward way (as shown in the mathematical appendix), although the algebra is a bit involved. Thus, the desired probability (the integrated value) is given by
Probability(JIF_1, JIF_2) = N1N2((1/k12)((1/k11) – (1/k21) – (1/(k11 + k12)) + (1/(k21 + k12))) – (1/k22)((1/k11) – (1/k21) – (1/(k11 + k22)) + (1/(k21 + k22)))).
This function is plotted below for JIF_2 = 3 with JIF_1 values ranging from 1 to 30, with the values corresponding to the plot above indicated.
The curve represents that answer to our question for a journal with a JIF of 3. For example, a paper selected randomly from a journal with a JIF of 5 would be expected to have more citations than a randomly selected paper from a journal with a JIF of 3 only 65% of the time. This is only twofold different from that expected if the difference was completely random, illustrating the lack of justification for interpreting small (or even fairly large) differences in JIFs when judging individual papers.
To amplify this further, the analogous plot for JIF_2 = 10 is shown below:
This plot demonstrates that a paper randomly selected from a journal with JIF = 10 will have more citations than a randomly selected paper from a journal with JIF = 5 approximately 30% of the time. Similarly, a paper randomly selected from a journal with JIF = 10 will have fewer citations than a randomly paper from a journal with JIF = 20 approximately 75% of the time. These modest differences in probability for doubling JIFs highlight the folly of interpreting small differences in JIFs that are sometimes reported with three decimal places. From a scientific perspective, such false precision is utterly inappropriate. I hope that this analysis, along with the many other extant criticisms of the use and abuse of JIFs, will encourage scientists and administrators to use JIFs only in contexts for which they are appropriate.
Available code and documents
The R Markdown file that generates this post including the R code is available. The parameters from the linear fits from the previous post are available as a .csv file. A mathematical appendix showing the derivation of key formulae is also available.