###Generation of predicted citation distributions from journal impact factor values
In [my previous post](http://blogs.sciencemag.org/sciencehound/2016/08/04/journal-impact-factors-fitting-citation-distribution-curves/), I demonstrated how the citation distribution for a given journal could be fit to a function defined as the difference of two exponentials. This function is characterized by two parameters, k~1~ and k~2~. I showed how these two parameters could be used to derive the journal impact factor (JIF). However, for a variety of purposes, it will be useful to be able to generate an approximate citation distribution given the JIF value, that is, to solve the inverse problem. This should be possible if we can discern relationships between the JIF value and the parameters k~1~ and k~2~.
###The relationships between k~1~ and k~2~ and JIF
First, let us plot k~1~ values versus the JIF values calculated from the fits to the observed distributions. I chose to use the calculated JIF values rather than the actual JIF values since the later are distorted by the same number of papers with more that 100 citations and these lie off the distribution.
```{r Read in data and plot k1 versus JIF_calc, echo = FALSE, cache = FALSE}
suppressMessages(library(ggplot2))
Exp_coeff <- read.csv("Exp_coeff.csv")
p_1 <- ggplot(Exp_coeff, aes(x = k1, y = JIF_calc))
p_1 <- p_1 + xlab("Journal impact factor")
p_1 <- p_1 + ylab("k1")
p_1 <- p_1 + geom_point(aes(x = k1, y = JIF_calc))
# Save png of plot for blog
suppressMessages(ggsave("k1_vs_JIF_plot.png", dpi = 600))
print(p_1)
```
As might be anticipated, the relationship between k~1~ and the JIF value is approximately exponential. This can be confirmed by plotting k~1~ versus the logarithm of the JIF value. The results can be fit to a line.
```{r Plot k1 versus log(JIF_calc), echo = FALSE, cache = FALSE}
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
Exp_coeff <- mutate(Exp_coeff, Log_JIF_calc = log(JIF_calc))
fit_k1_log_JIF <- lm(k1 ~ Log_JIF_calc, data = Exp_coeff)
p_2 <- ggplot(Exp_coeff, aes(x = k1, y = Log_JIF_calc))
p_2 <- p_2 + xlab("log(Journal impact factor)")
p_2 <- p_2 + ylab("k1")
p_2 <- p_2 + geom_point(aes(x = k1, y = Log_JIF_calc))
p_2 <- p_2 + geom_smooth(method = "lm", se = FALSE)
# Save png of plot for blog
suppressMessages(ggsave("k1_vs_log(JIF)_plot.png", dpi = 600))
print(p_2)
```
Similarly, k~2~ is also approximately related to the log of the JIF value although there is somewhat more scatter.
```{r Plot k2 versus log(JIF_calc), echo = FALSE, cache = FALSE}
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
fit_k2_log_JIF <- lm(k2 ~ Log_JIF_calc, data = Exp_coeff)
p_3 <- ggplot(Exp_coeff, aes(x = k2, y = Log_JIF_calc))
p_3 <- p_3 + xlab("log(Journal impact factor)")
p_3 <- p_3 + ylab("k2")
p_3 <- p_3 + geom_point(aes(x = k2, y = Log_JIF_calc))
p_3 <- p_3 + geom_smooth(method = "lm", se = FALSE)
# Save png of plot for blog
suppressMessages(ggsave("k2_vs_log(JIF)_plot.png", dpi = 600))
print(p_3)
```
Using these two linear fits, we can estimate the values of k~1~ and k~2~ given a value for the JIF.
```{r Calculate k1 and k2 and adjust them, echo = FALSE, warning = FALSE, cache = FALSE}
suppressMessages(library(dplyr))
# A function to calculate k1 and k2 from fits
Calculate_k1_k2 <- function(JIF_in) {
log_JIF_in <- log(JIF_in)
k1_calc <- coefficients(fit_k1_log_JIF)[1] + coefficients(fit_k1_log_JIF)[2] * log_JIF_in
k2_calc <- coefficients(fit_k2_log_JIF)[1] + coefficients(fit_k2_log_JIF)[2] * log_JIF_in
# Save coefficients for later use
Linear_fit_coefficients <- numeric()
Linear_fit_coefficients[1] <- coefficients(fit_k1_log_JIF)[1]
Linear_fit_coefficients[2] <- coefficients(fit_k1_log_JIF)[2]
Linear_fit_coefficients[3] <- coefficients(fit_k2_log_JIF)[1]
Linear_fit_coefficients[4] <- coefficients(fit_k2_log_JIF)[2]
write.csv(Linear_fit_coefficients, file = "Linear_fit_coefficients.csv")
# k1 and k2 will be adjusted by a constant (small) amount to ensure that the JIF calculated from them matches the input JIF
# Calculate correction (from quadratic equation)
a_1 <- JIF_in
b_1 <- ((k1_calc + k2_calc) * JIF_in - 2)
c_1 <- k1_calc * k2_calc * JIF_in - k1_calc - k2_calc
r_1 <- (-b_1 + sqrt(b_1 ^ 2 - 4 * a_1 * c_1)) / (2 * a_1)
# Adjust k1 and k2 and output result
k12_calc <- c(k1_calc, k2_calc, r_1, k1_calc + r_1, k2_calc + r_1)
}
k12_calc_out <- Calculate_k1_k2(2.0)
k12_calc_out <- rbind(k12_calc_out, Calculate_k1_k2(3.0))
k12_calc_out <- rbind(k12_calc_out, Calculate_k1_k2(4.0))
k12_calc_out <- rbind(k12_calc_out, Calculate_k1_k2(5.0))
k12_calc_out <- rbind(k12_calc_out, Calculate_k1_k2(6.0))
k12_calc_out <- rbind(k12_calc_out, Calculate_k1_k2(7.0))
k12_calc_out <- rbind(k12_calc_out, Calculate_k1_k2(8.0))
k12_calc_out <- rbind(k12_calc_out, Calculate_k1_k2(9.0))
k12_calc_out <- rbind(k12_calc_out, Calculate_k1_k2(10.0))
k12_calc_out_df <- data.frame(k12_calc_out)
colnames(k12_calc_out_df) <- c("k1_calc", "k2_calc", "r_1", "k1_calc + r_1", "k2_calc + r_1")
k12_calc_out_df <- mutate(k12_calc_out_df, k1_change <- abs(r_1/k1_calc))
k12_calc_out_df <- mutate(k12_calc_out_df, k2_change <- abs(r_1/k2_calc))
```
To ensure that the deduced values of k~1~ and k~2~ will generate the same value of the JIF using the formula from the [previous post](http://blogs.sciencemag.org/sciencehound/2016/08/04/journal-impact-factors-fitting-citation-distribution-curves/), we can adjust the values of k~1~ and k~2~ slightly by a constant value. Thus, we need to find delta(k) such that k~1'~ = k~1~ + delta(k) and k~2'~ = k~2~ + delta(k) such that JIF = (k~1'~ + k~2'~)/k~1'~k~2'~. The derivation for the optimal value of delta(k) is shown in the mathematical appendix. For JIF values from 2 to 10, this correction averages `r round(mean(k12_calc_out_df$k1_change) * 100, 0)` percent for k~1~ and `r round(mean(k12_calc_out_df$k2_change) * 100, 0)` percent for k~2~.
###Estimating a citation distribution from a JIF value
Consider the case of *EMBO Journal* which has a reported JIF of 9.6 and a JIF calculated from the fit values of k~1~ and k~2~ of 9.0. The observed distribution is compared with the distribution predicted from the calculated JIF value below:
```{r Analyze EMBO J. fits, echo = FALSE, warning = FALSE, cache = FALSE}
IF_df <- read.csv("Impact_Factor_Data_Lariviere.csv", stringsAsFactors = FALSE)
# A function to build data frame for a selected journal
Build_df <- function(Journal_name) {
Citations <- "Citations"
IF_df_select <- IF_df[0:101, c(Citations, Journal_name)]
IF_df_select[ , Citations] <- as.numeric(IF_df[ , Citations])
IF_df_select[ , Journal_name] <- as.numeric(IF_df[ , Journal_name])
IF_df_select
}
# A function to normalize the observed citation histogram
Normalize_citations <- function(select_df) {
Paper_total <- sum(select_df[ , 2])
Fraction_of_Total_Papers = select_df[ , 2] / Paper_total
select_df <- cbind(select_df, Fraction_of_Total_Papers)
}
# A function to graph the fit between observed and calculated number of paper distributions}
Num_dist_graph <- function(k1, k2, select_df){
suppressMessages(library(ggplot2))
suppressMessages(library(reshape2))
citations <- 0:200
Num_dist <- data.frame(citations)
Norm_constant <- (k1 * k2) / (k2 - k1)
Fraction_papers_calc <- Norm_constant * (exp(-k1 * citations) - exp(-k2 * citations))
Num_dist <- cbind(Num_dist, Fraction_papers_calc)
high_cites <- sum(Fraction_papers_calc[102:201])
Num_dist[101, 2] <- high_cites
Num_dist <- Num_dist[1:101, ]
Fraction_papers_obs <- select_df[ , 3]
Num_dist <- cbind(Num_dist, Fraction_papers_obs)
Num_dist_melt <- melt(Num_dist, id="citations")
p_comp <- ggplot(Num_dist_melt, aes(x = citations, y = value))
p_comp <- p_comp + xlab("Number of Citations")
p_comp <- p_comp + ylab("Fraction of Total Papers")
p_comp <- p_comp + geom_point(aes(x = citations, y = value, color = variable))
p_comp <- p_comp + geom_line(aes(x = citations, y = value, color = variable))
p_comp <- p_comp + scale_color_manual(values=c("black","red"))
}
IF_EMBO_J_df <- Build_df("EMBO_J")
IF_EMBO_J_df <- Normalize_citations(IF_EMBO_J_df)
JIF_EMBO_J <- 9.0
k_12_EMBO_J <- Calculate_k1_k2(JIF_EMBO_J)
p_EMBO_J <- Num_dist_graph(k_12_EMBO_J[4], k_12_EMBO_J[5], IF_EMBO_J_df)
# Save png of plot for blog
suppressMessages(ggsave("EMBO_J_JIF_plot.png", dpi = 600))
print(p_EMBO_J)
```
The agreement appears reasonable. Similar results are observed for other journals (not shown). Thus, we have developed a tool with which can estimate the distribution of citations given only a journal impact factor.
###Next post
In my next post, I will use this tool to address a key question:
####If you select one paper randomly from a distribution associated with a journal impact factor JIF_1 and another paper randomly from a distribution associated with a journal impact factor JIF_2, what is the probability that the first paper has more citations than the second paper?
###Available code and documents
The [R Markdown](http://rmarkdown.rstudio.com) file that generates this post including the R code is available. The data from LariviÃ¨re *et al.* is provided as a .csv file. A mathematical appendix showing the derivation of a key formula is also available.