The “gender” method is based on an examination of the frequency of genders associated with particular first names in large databases. The “gender” package yields a default gender assignment based on a cutoff of 0.50 for the frequencies of names within the database but also returns the actual frequencies. The default method produced reasonable accuracy for a test set of 4789 individuals with user-provided gender information. For 3403 of these individuals, gender could be inferred using the “gender” package. The accuracy of this gender inference was different for males and females, with an accuracy of 98.1% for males but only 84.0% for females. This suggests that potential *Science *authors are more likely to be male than would be expected based on the database used by the “gender” package and that gender inference accuracy could possibly be improved by adjusting the frequency cutoff used for gender inference.

A histogram of the name frequency data for the 3403 individuals is shown below.

This histogram shows clear separation between male and female individuals in terms of male frequency in the database. Almost all of the individuals on the right of the graph with male frequencies greater than 0.50 are male (blue), whereas almost all of the individuals on the left with male frequencies less than 0.50 are female (red).

The cutoff parameter of 0.50 can be adjusted. If a higher cutoff is used, the gender of more individuals will be inferred to be female. However, many of these individuals will actually be male. This will increase the accuracy of inference for males, but will decrease it for females. Conversely, if a lower cutoff is used, the gender of more individuals will be inferred to be male, with the opposite effect on accuracy.

The trade-off can be depicted through a receiver operating characteristic (ROC) curve. The ROC curve is a plot of the sensitivity (the rate of true positives) versus 1 minus the specificity (the rate of false positives). The ROC curve for inferring female gender is shown below.

The ROC curve would have a perfect right angle in the upper left-hand corner for a test that was perfectly accurate.

We can directly examine the variation in the accuracy of inferred genders as a function of the cutoff parameter.

As could be anticipated from the relatively clean separation shown in the first figure, the accuracy curve is relatively constant over a wide range of cutoff values. However, the maximum accuracy is found at a cutoff of 0.11 (shown with the red arrow) rather than 0.50. The accuracy for this cutoff is 95.2%, somewhat higher than for the earlier results with a cutoff of 0.50 (93.7%). Although this may seem like a small change, it represents a reduction is the gender inference error rate from 6.3 to 4.8%—a reduction in relative terms of 24%. With a cutoff of 0.11, the accuracy for males was 96.7% and for females, 91.0%. Thus, with the new cutoff, the accuracy for males decreased slightly (by 1.4%), but the accuracy for females increased by 7.0%. Using a cutoff of 0.11 both improved the overall accuracy and decreased the discrepancy in accuracy for females compared to males, suggesting that it may be more suitable than the default cutoff for use in further analyses.

We can apply this new cutoff to the dataset of 2182 (out of 3568) names from papers published and rejected in 2015 for which genders were determined from web searches and could be inferred using the “gender”-based tool. For this dataset, the accuracy with a cutoff of 0.11 is 95.7%, somewhat higher than for the earlier results with a cutoff of 0.50 (94.7%). With a cutoff of 0.11, the accuracy for males was 97.2% and for females, 90.6%. Again, with the new cutoff, the accuracy for males decreased slightly and the accuracy for females increased.

Treating the 2015 dataset as an independent reveals that the maximum accuracy is found at a cutoff 0.19 rather than 0.50. The accuracy for this cutoff is 95.8%, only very slightly improved over that obtained with a cutoff of 0.11. For future calculations, we will use the average of these cutoff values, 0.15.

With the gender inference tool optimized, we now turn to questions regarding correlations between the genders of the corresponding authors of Reports in *Science *and the genders of the first author. More specifically, we will examine Reports submitted to *Science *from 2010 to 2017 that have a single corresponding author for which the genders of both the corresponding author and the first author could be inferred using the “gender”-based method. Of the 71,275 Reports with first and corresponding authors submitted over this period, 66,057 have a single corresponding author. Note that the percentage of Reports with a single corresponding author dropped from 100 to 76% from 2010 to 2017 because of changes in journal policy and data capture, and levels of collaboration in the scientific community.

Within this set of Reports, we first excluded Reports where the corresponding author and the first author were the same individual. Reports with different corresponding and first authors accounted for 59% of total submissions. This varied considerably between fields (Reports were divided among fields based on the editors who handled the submissions) from 77% in the life sciences, to 69% in the physical sciences, to 24% in other fields (including ecology, evolution, social sciences, and others), highlighting differences in both scientific processes and author-order practices in different disciplines.

For the submitted Reports with different corresponding and first authors (both of which had genders that could be inferred by the “gender”-based tool), we calculated the number of submissions with the four possible combinations of male and female first authors and male and female corresponding authors and divided each by the number of the combination expected if the genders of the first and corresponding authors were independent from one another. More explicitly, these paired-author ratios are defined as follows:

Ratio_FF = Number of submissions with female corresponding author and female first author / (Total number of submissions × fraction of females among corresponding authors) × (fraction of females among first authors)

Ratio_FM = Number of submissions with female corresponding author and male first author / (Total number of submissions × fraction of females among corresponding authors) × (fraction of males among first authors)

Ratio_MF = Number of submissions with male corresponding author and female first author / (Total number of submissions × fraction of males among corresponding authors) × (fraction of females among first authors

Ratio_MM = Number of submissions with male corresponding author and male first author / (Total number of submissions × fraction of males among corresponding authors) × (fraction of males among first authors)

To estimate the uncertainties due to error rates for the gender inferences (approximately 3% for males and 9% for females), we performed simulations where inferred genders were randomly varied guided by these probabilities, and the paired-author ratios were calculated.

For the full dataset, the paired-author ratios were as follows:

Ratio_FF = 1.197 ± 0.028

Ratio_FM = 0.912 ± 0.014

Ratio_MF = 0.958 ± 0.006

Ratio_MM = 1.018 ± 0.003

where the uncertainties are reported as 95% confidence intervals based on the error rates for gender inference.

The paired-author ratio for female corresponding authors and female first authors is by far the largest of the four ratios. This indicates that there are 20% more Reports submitted by pairs of a female corresponding author and a female first author than would be expected if corresponding author gender and first author gender were uncorrelated. By contrast, the paired-author ratio for male corresponding authors and male first authors is barely different from 1.0. The paired-author ratios for female corresponding author and male first author pairs and male corresponding author and female first author pairs are less than 1.0, as must be the case because the weighted average of the four ratios must be 1.0.

The paired-author ratios also vary across fields. These ratio minus 1.0 (to better visualize the differences) are plotted for different fields below.

In all fields, the ratio_FF is substantially larger than the ratio_MM. These ratios also vary substantially from field to field, being largest in the “other” category (fields other than life and physical sciences) and smallest in the physical sciences. Indeed, in the physical sciences, none of the ratios are substantially different from 1.0 when the errors associated with gender inference are taken into account. There are a variety of possible explanations for these variations, but these cannot be evaluated from these data alone.

These results can also be examined over time (although the uncertainties due to errors in gender inference become larger owing to smaller sample sizes). The results for ratio_FF and ratio_MM are plotted below for the entire dataset and by field below.

In general, the paired-author ratios are relatively stable from year to year, with no notable trends (given the remaining gender uncertainty). This uncertainly is quite dominant for ratio_FF for the physical sciences and other categories owing to the relatively low numbers of females among these authors.

As a final comparison, we looked at these ratios for published Reports compared to overall submissions. These values are shown below.

Overall submission paired-author ratios:

Ratio_FF = 1.197 ± 0.028

Ratio_FM = 0.912 ± 0.014

Ratio_MF = 0.958 ± 0.006

Ratio_MM = 1.018 ± 0.003

Published Report paired-author ratios:

Ratio_FF 1.236 ± 0.091

Ratio_FM 0.913 ± 0.033

Ratio_MF 0.952 ± 0.019

Ratio_MM 1.018 ± 0.007

The values for Ratio_FM, Ratio_MF, and Ratio_MM are essentially identical between the overall submissions and published Report datasets. The value of Ratio_FF appears to be very slightly larger in the published Report dataset, but the difference is well within the overlap of 95% confidence intervals, which are largest for this category because of the relatively smaller numbers of female authors.

We have refined the gender inference tool based on the “gender” package by adjusting the cutoff parameter. This increases the overall accuracy in gender inference from 93.7 to 95.2%, with a substantial increase in the accuracy of inferring female gender. Using this tool, we have examined a set of approximately 39,000 Reports submitted from 2010 to 2017 that have both corresponding and first authors for whom gender could be inferred. We found that Reports with a female corresponding author and a female first author present at a level 20% higher than would be expected if the genders of the corresponding and first authors were uncorrelated, although this was not true to nearly the same extent for male authors. This phenomenon was most pronounced in fields other than life sciences or physical sciences and was of very low magnitude in the physical sciences. This phenomenon has been relatively stable over the period of time studied.

]]>Since publishing is one of the most important measures of scientific accomplishment and a key parameter in professional advancement, the gender distributions of authors for scientific papers is a topic of considerable importance, potentially revealing informaton about gender distributions across the communities of science as well as highlighting possible points of gender bias. Early in my time at *Science*, we reported in an editorial and a *Sciencehound* post, an initial analysis of the gender distributions for papers published in *Science* in 2015, together with a randomly selected set of similar number of papers that had been submitted but rejected for publication in *Science* over the same time period. The genders for the first and corresponding authors for these papers were inferred manually through web searches. The major conclusions from this study were that the fractions of female authors were relatively low at 16% for corresponding authors and 27% for first authors, and that there was no strong evidence of significant differences in acceptance rates between female and male corresponding or first authors. We were interested in pursuing this and related analysis further, but manual gender inference methods do not scale well for larger analyses such as looking at trends over time (see the related editorial).

To address this data deficiency, we started to collect gender information voluntarily from authors and reviewers who worked on papers submitted to *Science* or other members of the *Science* family of journals. This approach has been somewhat successful and we have accumulated a dataset that includes 4789 individuals with genders provided as either male or female. Of these, there are 3530 males and 1259 females. Of course, this dataset is still much smaller than the full complement of *Science* authors. However, there are automated computational methods based on the use of first names with known genders from large databases such as those from the United States Social Security Administation.

We used the package r package “gender” to infer genders and to compare these inferred gender assignments with the author-provided genders. For the initial analysis, we used the default for “gender”, which is to assign the gender “male” whenever more than 50% of the entries in the database used are male and the gender “female” whenever more than 50% of the entries in the database used are female. For names for which no data are available, the gender is assigned as “unknown.” We applied this method to the dataset of names with user-provided genders.

Of the 4789 individuals, genders could be inferred for 3403. Thus, gender could be inferred for 71.1% of the individuals with 2491 inferred to be male and 912 inferred to be female. Of the individuals with inferred genders, 2372 or 69.7% were actually male and 1031 or 30.3% were actually female. Of the individuals for which gender could not be inferred, 1158 or 83.5% were male and 228 or 16.5% were female. Note that this pool is somewhat enriched in males compared to the population for which genders could be inferred.

Let us consider the accuracy of the inferred genders. Of those males for whom gender could be inferred, the accuracy was 98.1%. Of those females for whom gender could be inferred, the accuracy was 84.0%. Overall, the accuracy was 93.8%.

The above values represent accuracy at the individual level. However, for many analyses in which we are most interested, data at the population level is sufficient and these results may be more accurate. To check this, we applied these methods to the 2015 dataset used in our initial study. Of the 3568 individuals in this dataset, 2277 were male and 587 were female. We can compare these results with a calculation based on the user-provided gender data above in the following manner:

- Apply the gender inference tool to the overall dataset.
- Correct the number of individuals inferred to be males by subtracting the estimated number of actual females inferred to be male (0.160) and adding the estimated number of actual males who were inferred to be female (0.019).
- Correct the number of individuals inferred to be females by subtracting the estimated number of actual males inferred to be female (0.019) and adding the estimated number of actual females who were inferred to be male (0.160).
- Estimate the numbers of males in the pool of individuals for whom no gender could be inferred by multiplying the size of this pool by the estimated fraction of males in the pool of individuals with inferred genders with a correction factor of 0.14 added to account for the observation that the pool with no inferred gender tends to have a larger fraction of males than does the pool with inferred genders.

Applying this approach to the 2015 data we obtain these results:

Number of individuals inferred to be males: 1628.

Number of individuals inferred to be females: 554.

Number of individuals with no inferred gender: 682.

Fraction of males in pool with inferred genders: 0.746.

Corrected number of males: 1628 – 0.019 × 1628 + 0.160 × 554 + 0.886 × 682 = 1628 – 32 + 89 + 604 = 2289.

Corrected number of females: 554 – 0.160 × 554 + 0.019 × 1628 + 0.114 × 682 = 554 – 89 + 32 + 78 = 575.

These results can be compared with the actual values (based on web searches) of 2277 males and 587 females.

Thus, the fraction of males in this dataset is estimated to be 0.799 compared with the actual value of 0.795. The value calculated with the corrections based on the user-provided gender dataset is more accurate than the value obtained using only the individuals for which genders could be inferred of 0.746.

Because these calculations depend on parameters estimated from the user-provided gender dataset that are somewhat uncertain, we can refine these calculations through simulations that use small variations from these parameters selected at random from normal distributions. This will allow estimation of uncertainties introduced from the incomplete gender information.

With these tools in hand, we can now estimate the fractions of males and females in the pool of authors who submitted papers to *Science*. We first consider the pool of first authors for all Reports submitted to *Science* by year from 2010 to 2017. The gender population distributions are plotted below. The bands show 95% confidence ranges based on the uncertainties in inferring author gender.

The fractions of male and female first authors have been quite constant over this period of time with 24.2% of the first authors being female, reasonably consistent with the results of the 2015 manual study.

We can extend this analysis by examining the pools of first authors for Reports that were published in *Science* versus that for Reports that were rejected from *Science* from the same time period. These results can be combined to yield the acceptance rates for Reports with male versus female first authors, defined as the fraction of published Reports with first authors with a given gender divided by the total number of Reports (both published and rejected) with first authors of the same gender.

This plot reveals that there were differences in the acceptance rates for Reports with female versus male first authors for 2011 to 2015, favoring male first authors. There was essentially no difference for Reports submitted in 2016 and 2017. This was not due to any explicit policy change at *Science* other than the continuation of discussions regarding gender equity issues.

*Science* publishes papers across a wide range of disciplines with communities with different gender makeups. As a first step toward characterizing these differences, we can separate Reports according to the editor who was primarily responsible for handling the submission, dividing the editors into three groups: Life Sciences (molecular biology, biochemistry, cell biology, neuroscience, biomedicine, etc.), Physical Sciences (chemistry, physics, astronomy, etc.), and Other (ecology, evolution, earth sciences, social sciences, etc.). The Life Sciences, Physical Sciences, and Other categories account for 48, 22, and 30% of submissions, respectively.

The fractions of first authors by gender for these groups are shown below.

As might have been anticipated, the fraction of female first authors in the Life Science submissions, 0.295, is higher than the overall average. The fraction of female first authors in the Physical Sciences, 0.153, is lower by approximately a factor of 2. For the remaining fields (Other), the fraction of female first authors is 0.220.

We can also examine the acceptance rates as a function of the gender of the first author by field as shown below.

The plots for the different fields each follows approximately the same trend as that seen for the overall population. The acceptance rate for female first authors in the Life Sciences appears to be higher than that for male first authors for papers submitted in 2016 and 2017.

We can now perform the same analyses for corresponding authors (as opposed to first authors) on Reports.

The fractions of female and male corresponding authors overall and by field are shown below.

Again, these fractions are relatively constant over this time period. Overall, the fraction of female corresponding authors, 0.172 is lower than that for female first authors.

The fraction of female corresponding authors in the Life Science submissions, 0.181, is higher than the overall average. The fraction of female first authors in the Physical Sciences is 0.124 while for the remaining fields (Other), the fraction of female first authors is 0.194.

The acceptance rates for corresponding authors by gender are shown below.

These acceptance rate plots are relatively similar to those for first authors. Differences in the acceptance rates that occurred in early years are no longer evident for Reports submitted in 2017.

This is just an initial analysis of gender differences across the Science family of journals. The gender inference tools and datasets that we have develop will enable additional analyses of gender effects for other article types as well as for other characteristics of papers. These will be reported in subsequent posts.

]]>Elections for the House of Representatives are particularly useful because they involve a large number of separate but somewhat correlated elections across 435 congressional districts across the United States. FiveThirtyEight produced three forecasts for the recent congressional elections: one based on polls only (Lite); one based on polls plus fundraising, district history, and historical trends (Classic); and one that added ratings from experts to the Classic forecast (Deluxe). I begin by examining how these forecasts compare to one another using the final forecasts before results from the elections were available.

The fundamental basis for these forecasts is the estimation of the probabilities for the percentages of votes for each candidate. For the polls only (Lite) forecast, available polls are used as primary data, combining various polls using corrections and weighting factors based on the polling methodology used and the historical accuracy of different pollsters. The distribution for the percentages of votes from the Lite forecast for Democratic candidates is shown in Figure 1. Note that some races had candidates running unopposed, one congressional race in California paired two Republican candidates, one race in Washington paired two Democratic candidates, races in Louisiana had numerous candidates from various parties, and many races included third-party candidates.

In almost exactly half of the 435 races, the Democratic candidate was forecast to receive 50% or more of the vote, with an additional 106 forecast to receive between 40 and 50%.

I now consider the more elaborate forecasts from FiveThirtyEight. The predicted percentages for Democratic candidates for the Lite and Classic forecasts are compared in Figure 2.

The average difference between the Lite and Classic forecasts is 0.52% (with the Lite forecast on average higher) with a standard deviation of 3.16%, and the correlation coefficient between the two forecasts is 0.9901.

The Deluxe forecast is only slightly different from the Classic forecast, with an average difference between the Classic and Deluxe forecasts of 0.16% (with the Classic forecast on average higher) with a standard deviation of 0.59%. The correlation coefficient between these two forecasts is 0.9996.

I now compare the percentages from the FiveThirtyEight forecasts with the results from the election. These results were obtained from politico.com on 15 November. Although these results were not certified at the time of this writing, the percentages are very unlikely to change enough to affect the analysis below. I will focus on the FiveThirtyEight Deluxe forecast.

The actual percentages from the election are compared with those from the Deluxe forecast in Figure 3.

Overall, the correlation coefficient between the actual results and those from the forecast is 0.9874.

The differences between the actual percentages and those from the Deluxe forecast are shown in Figure 4, and a histogram of these differences is shown in Figure 5.

The average difference between the actual percentage and that from the Deluxe forecast is -0.63% (Deluxe forecast higher) with a standard deviation of 3.16%.

How do these results compare with those for the other two forecasts? For the Lite forecast, the correlation coefficient with the election results is 0.9788, and the average difference is -1.31% with a standard deviation of 4.48%. Similarly, for the Classic forecast, the correlation coefficient with the election results is 0.9873, and the average difference is -0.79% with a standard deviation of 3.14%. Thus, the Lite forecast performed substantially worse than the Deluxe and Classic forecasts. The Deluxe forecast performed very slightly better than the Classic forecast.

Although the success in estimating voting percentages is impressive, elections are decided by which candidate receives the most votes. Correctly predicting that one candidate is likely to receive 75% of the vote compared with 65% is of no importance because this candidate will win the election in either case. Thus, the accuracy and precision of predictions in the vicinity of 50% (in a two-person race) are of critical importance. If these predictions were highly accurate and precise, then predicting elections would be straightforward by simply determining which candidate was predicted to get a higher percentage of votes.

Winners have been declared in 429 out of 435 congressional races as of this writing, with the remaining races too close to call. For the purpose of this analysis, I am assuming that the current vote leader will eventually be declared the winner in the remaining races. Overall, the candidate predicted in the Deluxe forecast to receive the most votes won in 425 races, corresponding to 97.7%.

However, predictions from polls and other data are imprecise, with uncertainties of several percentage points or more. Forecasts such as those performed by FiveThirtyEight deal with these uncertainties by performing thousands of election simulations in which each candidate’s percentage is allowed to vary from its predicted value. These variations can have multiple components: independent variations that affect only one race; broader variations that can affect multiple races (reflecting, for example, overall national trends at the time of the election) or regional effects; and variations that reflect other factors such as incumbency that can influence polling accuracy. Once thousands of such simulations are performed, the probability that any given candidate will win can be estimated by determining the fraction of simulations in which she or he received a higher percentage of votes than her or his opponents. For example, if a candidate is predicted in the baseline prediction to receive 70% of the votes, then the uncertainties will accumulate to cause this candidate to lose in no or very few simulations, and this candidate can be forecast to win with high probability. On the other hand, if a candidate is predicted to receive 50% of the vote in a two-person race, then this candidate might win in half of the similations and lose in the others, yielding a probability of winning of 50%. Given the uncertainty, a forecast may predict that each of the candidates in a certain number of races has a probability of winning of 50%. The forecast is deemed to be accurate if half of these candidates win their races, even if the forecast is silent about which half.

First, consider elections for which one candidate was strongly favored to win in the Deluxe forecast. There were 192 races for which the probability of the Democratic candidate winning was between 0 and 25%. The average probability across this pool was 3.6%. Among these, there were three races where the Democratic candidate won, corresponding to 1.6%, lower than but in reasonable agreement with the expectation. Similarly, there were 208 races for which the probability of the Democratic candidate winning was between 75 and 100%. The average probability across this pool was 98.9%. Among these, the Democratic candidate won in all races. Thus, elections produced somewhat fewer major upsets than were predicted by the forecast, although this observation would have been affected by changes in only a few races.

Now, let us consider the 35 races for which the probability of the Democratic candidate winning was between 25 and 75%. Of these, there were 15 races for which the probability was between 25 and 50%. The average probability across this window was 36.7%. Among these, the Democratic candidate won in five, or 33.3%, of them. Similarly, there were 20 races for which the probability was between 50 and 75%. The average probability across this window was 61.4%. Among these, the Democratic candidate won in 16, or 80%, of them.

An alternative way of displaying the results is as follows. Races are sorted based on the Deluxe forecast probability of a Democratic winner from lowest to highest. Starting with a window from race 1 to race *n*(with *n*empirically set at 15), two parameters are calculated. The first is the average probability calculated over all races in the window. Second, the fraction of Democratic winners across races in the window is calculated by dividing the number of wins by the window size. The window is then moved to races 2 to *n*+1, and the calculations are repeated. This is repeated as the window is moved across the entire set of races. These results are shown in Figure 6.

If the results of the election perfectly matched the forecast probabilities, this plot would be a straight line, although some variation is anticipated because of the probabilistic nature of the forecast. The curve does approximately pass through the center of the plot, reflecting that the forecast did quite well in predicting races with probabilities near 50%. The slight S-shape of the curve is due to the lower number of major upsets that occurred than would have been expected from those probabilities.

The FiveThirtyEight forecasts of the 2018 congressional elections were quite accurate in predicting the percentages of votes received by the candidates and in estimating the probabilities for particular election outcomes. The inclusion of data in addition to weighted and corrected polling data improved the accuracy of the predictions. As with all scientific analyses, it is important to keep in mind both the core predictions and the associated uncertainties in these predictions.

]]>In September 2017, the second major TOP guidelines workshop hosted by the Center for Open Science led to a position paper suggesting a standardized approach for reporting, provisionally entitled the TOP Statement.

Based on discussions at that meeting and at the 2017 Peer Review Congress, in December 2017 we convened a working group of journal editors and experts to support this overall effort by developing a minimal set of reporting standards for research in the life sciences. This framework could both inform the TOP statement and serve in other contexts where better reporting can improve reproducibility.

In this “minimal standards” working group, we aim to draw from the collective experience of journals implementing a range of different approaches designed to enhance reporting and reproducibility (e.g. STAR Methods), existing life science checklists (e.g. the Nature Research reporting summary), and results of recent meta-research studying the efficacy of such interventions (e.g. Macleod et al. 2017; Han et al. 2017); to devise a set of minimal expectations that journals could agree to ask their authors to meet.

An advantage of aligning on minimal standards is consistency in policies and expectations across journals, which is beneficial for authors as they prepare papers for publication and for reviewers as they assess them. We also hope that other major stakeholders engaged in the research cycle, including institutional review bodies and funders, will see the value of agreeing on this type of reporting standard as a minimal expectation, as broad-based endorsement from an early stage in the research life cycle would provide important support for overall adoption and implementation.

The working group will provide three key deliverables:

- A “minimal standards” framework setting out minimal expectations across four core areas of materials (including data and code), design, analysis and reporting (MDAR)
- A “minimal standards” checklist intended to operationalize the framework by serving as an implementation tool to aid authors in complying with journal policies, and editors and reviewers in assessing reporting and compliance with policies
- An “elaboration” document or user guide providing context for the “minimal standards” framework and checklist

While all three outputs are intended to provide tools to help journals, researchers and other stakeholders with adoption of the minimal standards framework, we do not intend to be prescriptive about the precise mechanism of implementation and we anticipate that in many cases they will be used as a yardstick within the context of an existing reporting system. Nevertheless, we hope these tools will provide a consolidated view to help raise reporting standards across the life sciences.

We anticipate completing draft versions of these tools by spring 2019. We also hope to work with a wider group of journals, as well as funders, institutions, and researchers to gather feedback and seek consensus towards defining and applying these minimal standards. As part of this feedback stage, we will conduct a “community pilot” involving interested journals to test application of the tools we provide within the context of their procedures and community. Editors or publishers who are interested in participating are encouraged to contact Veronique Kiermer and Sowmya Swaminathan for more information.

In the current working group, we have focused our efforts on life science papers because of extensive previous activity in this field in devising reporting standards for research and publication. However, once the life science guidelines are in place we hope that we and others will be able to extend this effort to other areas of science and devise similar tools for other fields. Ultimately, we believe that a shared understanding of expectations and clear information about experimental and analytical procedures have the potential to benefit many different areas of research as we all work towards greater transparency and the support that it provides for the progress of science.

We are posting this notification across multiple venues to maximize communication and outreach, to give as many people as possible an opportunity to influence our thinking. We welcome comments and suggestions within the context of any of these posts or in other venues. If you have additional questions about our work, would like to be informed of progress, or would like to volunteer to provide input, please contact Veronique Kiermer and Sowmya Swaminathan.

On behalf of the “minimal standards” working group:

Karen Chambers (Wiley)

Andy Collings (eLife)

Chris Graf (Wiley)

Veronique Kiermer (Public Library of Science; vkiermer@plos.org)

David Mellor (Center for Open Science)

Malcolm Macleod (University of Edinburgh)

Sowmya Swaminathan (Nature Research/Springer Nature; s.swaminathan@us.nature.com)

Deborah Sweet (Cell Press/Elsevier)

Valda Vinson (Science/AAAS)

]]>Although our analysis focused on PIs from one institute funded in a single year, NIH Deputy Director for Extramural Research Mike Lauer and his colleagues have extended the analysis across NIH over a much longer period of time. This group recently posted their analysis, including all of the underlying data, on bioRxiv. Publicly sharing this data set is a very good practice that allows others to examine the results more thoroughly and to extend the analysis.

A key graph from this analysis, and one that has attracted much attention, is shown below:

The graph shows a curve fit to data for research productivity versus grant support using recently developed measures for these parameters. Annual grant support is measured by the Grant Support Index (GSI). This measure was developed as an alternative to funding level in dollars in an attempt to show that some types of research are more expensive than others. The GSI assigns point values to each grant type, with 7 points for an R01 grant with a single PI, 5 points for a more limited R21 grant with a single PI, and so on. Research productivity is measured with the Relative Citation Ratio (RCR), a metric based on citations, developed to correct for differences in citation behavior between fields. Both metrics are presented on logarithmic scales in this graph.

The most noteworthy aspect of this curve is that it rises with a steeper slope at lower values of GSI than it does at higher levels. This suggests that, on average, the increase in productivity associated with funding an additional grant to an already well-funded investigator would be less than that for providing a grant to an investigator with no funding or providing a second grant to an investigator with only a modest amount of funding. The separation between this observed curve and a hypothetical straight line (with productivity strictly proportional to research support) has been referred to as “unrealized productivity.”

Before delving further into this point, let us take advantage of the data that were made available to plot the relationship, with two changes. First, rather than just plotting the curve fit to the data, we show the data points themselves (for all 71,936 investigators used in the analysis). Second, we plot the data with linear rather than logarithmic scales to avoid any distortion associated with this transformation. The results are shown below, with the top graph showing all of the data points and the bottom graph enlarging the region that includes almost all investigators and showing a “spline” curve fit to these data along with a linear fit for comparison.

These plots reveal that the underlying data show a large amount of scatter, consistent with my earlier observations with the NIGMS-only data set as well as with the intuitive sense that laboratories with similar amounts of funding can vary substantially in their output. The curve fit to these data again reveals that the slope of the productivity versus grant support relationship decreases somewhat at higher levels of grant support.

With these observations in hand, we can now examine some expected results of proposed NIH policies. Suppose an investigator with an annual GSI of 28 (corresponding to four R01 grants) is reduced to an annual GSI of 21 (corresponding to three R01 grants) and that these resources are used to fund a previously unfunded investigator (to move to GSI = 7). According to the fit curve, the expected annual weighted RCR values are 9.0 for GSI = 28, 7.1 for GSI = 21, and 2.6 for GSI = 7. The anticipated change in annual weighted RCR is (–9.0 + 7.1 – 0 + 2.6) = 0.7. Thus, the transfer of funding is predicted to increase productivity (measured by weighted RCR). This appears to be one of the primary foundations for the proposed NIH policy.

This approach depends on the accuracy of the fitted curve in representing the behavior of the population which, as noted, shows considerable scatter. An alternative method involves directly simulating the effects of the proposed policy on the population. For example, one can take the 968 investigators with annual GSI values over 21 and reduce them to annual GSI values of 21, scaling each investigator’s weighted RCR output by the reduction in annual GSI. The total number of annual GSI points over the threshold of 21 for these investigators is 4709. This corresponds to the ability to fund an additional 672 R01 grants. If these grants are distributed to previously unfunded investigators, the anticipated weighted RCR output can be estimated by choosing a random set of 672 investigators with annual GSI values near 7 (say 6 to 8). Because of this random element, this simulation can be repeated many times to generate a population of anticipated outcomes. This results in the distribution shown below, with an average increase in weighted RCR of 0.3.

For most simulations, there is an increase in average weighted RCR, with the average being somewhat less than that anticipated from the analysis based on the fit curve alone (0.3 versus 0.7). There are several possible explanations for this difference, including limitations in the fit curve to capture the features of the highly scattered distribution and the approach to modeling the reduction in the anticipated output from the well-funded investigators.

The same simulation method can be applied to funding an additional 672 PIs with one R01 so that they each have two R01s by selecting 672 random PIs with an annual GSI of ~7 (6 to 8), removing them from the population, and adding 672 chosen from the population with an annual GSI of ~14 (13 to 15). The results are shown below, with an average increase in weighted RCR of 0.4.

These simulations appear to confirm that, on average, transferring funding from very well-funded PIs to less well-funded PIs may result in a small increase in weighted RCR output.

**Conclusions**

I strongly favor examination of such appropriate data to guide policy development. Understanding the relationships between grant support and research output is, of course, one of the most fundamental questions for any funding agency. The attempts by NIH to tackle this issue are laudable. However, as I discussed above, the presentation of simple curves fit to the data masks the considerable variation in output for PIs at all levels of funding. The development of policies based on a hard cap at a particular level of GSI seems to me to be problematic. Well-funded investigators always have substantial histories of research accomplishments. NIH program officers and advisory councils should have access to data about previous research accomplishments and productivity when making recommendations about potentially funding additional grants and should be encouraged to examine such data critically, even when the application under consideration has an outstanding peer review score. The opportunity costs for providing additional funding to an already well-funded PI at the expense of an early or mid-career PI with less funding are considerable. In addition, the use of a hard cap amplifies the importance of the details of how the GSI is calculated, with the selection of particular parameters potentially discouraging collaboration, training, and other desirable outputs, as has been the topic of ongoing discussions. It seems unwise to convert the highly nuanced information contained in lists of grant support and publications and other outputs into points on a graph rather than empowering the trained scientists who serve as program officials and advisory council members to use their judgment to help fulfill the mission of the NIH.

]]>Note that this is a preprint that has not been peer reviewed. Nonetheless, the result is straightforward, and the remarkably good fit of this potentially complicated data set to a simple function has several important implications. First, the fit enables a simple forecasting approach. In this case, the forecast is that there will be over 300,000 opioid overdose deaths in the 5-year period from 2016 to 2020 across the United States. Such a high number highlights the importance of developing effective strategies for addressing the epidemic and slowing its exponential growth. Second, the observation of deviations from exponential behavior such as the acceleration from 2002 to 2006 provides clues about the changes that might be driving the growth of the epidemic.

This new result once again illustrates the importance of data in public health, as discussed in my recent editorial.

]]>

The project focused on Research Reports, Research Articles, and Reviews published in 2015. The primary goal was to compare the percentages of women among individuals whose papers were published compared with those from papers that were submitted during the same period but were not selected for publication. A major challenge in performing these analyses is determining the gender of the authors in question because such data are not collected as part of the submission process. The genders of authors were assigned through individual Internet searches.

The project focused on the authors in the first position and the last position in each author list. The individuals who were listed first and last in the author list were classified (again, based on Internet searches) as to whether they appeared to be in established positions (faculty or similar positions), hereafter referred to as “senior authors,” or if they were graduate students, postdoctoral fellows, or in similar temporary roles, hereafter referred to as “junior authors.” For the purposes of this analysis, senior authors in the first or last position were included, as were junior authors in the first position. Using this approach, 862 senior authors and 471 junior authors were identified and used in the subsequent analysis.

For comparison, a group of manuscripts from 2015 were randomly chosen from those that had not been selected for publication, to match the balance of Research Reports, Research Articles, and Reviews in the published set. The genders and levels of seniority were determined through Internet searches as described, resulting in 883 senior authors and 434 junior authors for use in subsequent analysis.

Among the published papers, 24.8% (117 out of 471) of the junior authors were women. In the comparison group of manuscripts that were not selected for publication, 30.0% (130 out of 434) of the junior authors were women. Although this suggests a trend disfavoring women authors, the difference has a p value of 0.086, larger than p = 0.05.

For the published papers, 16.8% (145 out of 862) of the senior authors were women, while in the comparison group, the proportion was 14.7% (130 out of 883). Any trend favors women authors, although the difference is quite modest, with a p value of 0.237.

These data can be divided into three components corresponding to Research Reports (a total of 987 authors in the published paper group and 963 authors in the comparison group), Research Articles (228 authors in the published paper group and 206 authors in the comparison group), and Reviews (118 authors in the published paper group and 148 authors in the control group). In each of these sets, the same trends are observed as for the authors overall. The results are summarized below:

This preliminary study reveals two major sets of findings. First, as discussed above, the data do not reveal that the review and editorial processes at *Science* introduce substantial gender disparities. Second, the percentages of women among authors submitting to and published in *Science* are relatively low, ~27% for junior authors and 16% for senior authors. To put these values in context, I examined data from the United States National Science Foundation regarding the percentages of women in faculty positions and enrolled in graduate school. In 2010, the percentages of women in all faculty positions were 21% in the physical sciences, 42% in the life sciences, and 39% in the social sciences, whereas the percentages of women in senior faculty positions (associate professor and above) were 16% in the physical sciences, 34% in the life sciences, and 33% in the social sciences. In 2011, the percentages of women enrolled in graduate school were 33% in the physical sciences, 57% in the biological sciences, and 60% in the social sciences.

For papers submitted to *Science*, we estimate that ~40% are in the physical sciences, 55% are in the biological sciences, and 5% are in the social sciences. Using these as weighting factors, the anticipated percentage of women in all faculty positions submitting to *Science* would be (0.40)(21%) + (0.55)(42%) + 0.05(39%) = 33%. Similarly, for women in senior faculty positions, the anticipated percentage of women is 27% and the anticipated percentage of women among graduate students is 47%. In all cases, the percentages of women who submitted to *Science* are lower than these estimates. The estimates certainly could be inaccurate given that they are based on many assumptions that could influence the results, including assumptions about career stage, the use of data from the United States only despite the international authorship of *Science*, differences in the institutions in the general and authorship pools, and so on. Refinement of these estimates may reveal the sources of some aspects of the gender disparity, which could help guide additional analyses and, eventually, policy suggestions.

**Contracting the initial corpus of abstracts from Science**

To build the corpus, abstracts for more than 2200 research papers from *Science* from 2013 through 2015 were assembled. The first task is to read in the data set and convert the data into a corpus for analysis.

**Calculating similarities between all pairs of abstracts**

The next step is to calculate the similarity matrix. This can be done using TF-IDF (term frequency–inverse document frequency) weighting. This weights terms that occurred rarely in the corpus more highly than common terms. The similarity index is the so-called “cosine similarity.” It will be necessary to convert this to a distance metric subsequently.

With this similarity matrix in hand, we calculate distances using the formula distance = 2*arccos(similarity)/pi.

**Representing the relative distances in two dimensions**

Finally, these distances are projected to two dimensions using this method based on multidimensional scaling, also known as principal coordinate analysis.

The results can then be plotted.

This plot reveals an interesting three-pointed structure. Note that only the shape of this figure is meaningful; the orientation is arbitrary. Examination of the abstracts that correspond to the three points reveals that these correspond to biomedical sciences, physical sciences, and Earth sciences.

**Extending the analysis to the other Science family journals**

With this framework in place, we can now expand the corpus to include papers published in the other *Science* family journals. For this purpose, we will use most of the papers published in *Science Advances*, *Science Signaling*, *Science Translational Medicine*, *Science Immunology*, and *Science Robotics* in 2016.

We now plot each journal separately.

The orientations of these figures are slightly different from that produced by the initial *Science*-only corpus but, as noted above, this orientation is arbitrary. Several points emerge from examining these plots. First, the breadth of disciplines covered by *Science Advances* is essentially the same as that covered by *Science*. Comparison with more papers from *Science Advances* may reveal differences in emphasis between these two broad journals. Second, the papers from *Science Signaling*, *Science Translational Medicine*, and *Science Immunology* lie in the same general region in the biomedical arm of the plot. More detailed analysis should reveal more nuanced differences between the content of these journals.

This analysis represents a first step toward using these tools for unbiased analysis of the contents of the *Science* family of journals. More refined analysis is in progress.

**Additional documents and code**

The abstracts used in this analysis are available in six .csv files. The R Markdown file that generates this post including the analysis is also available.

]]>We now turn to the second component, a model for the number of grant applications submitted and reviewed. The number of NIH research project grants reviewed each year from 1990 to 2015 is plotted below:

This curve is somewhat reminiscent of the curve for the NIH appropriation as a function of time shown in an earlier post. The drop in the number of applications that occurs in 2008–2009 is an artifact due to the effects of the American Recovery and Reinvestment Act (ARRA). The funding associated with the ARRA was not included in the appropriations data, and applications that were considered for ARRA funding were also removed.

The grant application number and appropriation curves are compared directly below, plotted as fractional changes since 1990 to facilitate comparison.

The curves are similar in shape, although the increase in the NIH appropriation curve is larger by approximately a factor of 2 than is the grant application number curve. The curves, normalized so that they have the same overall height, are compared below.

Examination of the curves reveals that the grant application number curve is shifted to later years by ~2 years compared with the NIH appropriation curve. This makes mechanistic sense in that a relatively large increase in the NIH appropriation might cause institutions to hire more faculty who then apply for grants and might cause individual investigators to submit more applications. However, these responses do not take place instantaneously but require a year or more for the applications to be written and submitted.

A linear model can now be fit to predict the grant application number curve as a linear combination of the appropriation curves shifted by 1 and 2 years, including a constant term.

The number of grant applications can be calculated from the appropriation curves by m_{1}(appropriation-1 year offset)* + *m_{2}(appropriation-2 year offset) + b, where m_{1} = –0.18, m_{2} = 0.61, and b = 0.57.

The agreement is reasonable. The major differences occur in years 2008–2009 due to the impact of ARRA noted above. The overall Pearson correlation coefficient is 0.983.

A model has been developed that allows the prediction of the number of NIH grant applications from the appropriations history. This model can be used in conjuction with the previously described model for the number of grants awarded to predict grant success rates, for actual appropriation histories or for hypothetical ones.

The grant application number model was developed empirically, based on observed similarities between the grant application number curve and the appropriation curve. Although it is not truly mechanism-based, the model is consistent with a simple mechanistic interpretation as noted. It is interesting that grant application numbers increased more or less monotonically. Thus, it would have been difficult to develop a model from the inflation-corrected appropriation curve because this peaked in 2003 and has been falling almost every year since. This raises an interesting point. Application numbers have gone up when the appropriation increases by more than inflation or when the appropriation increase is less than inflation. This could be interpreted in terms of two dynamic drivers. When the appropriation increases by more than inflation, institutions and investigators sense opportunity and submit more applications; when the appropriation increases by less than inflation, institutions and investigators sense tough times with lower success rates and submit more applications to increase their chances of competing successfully for funding.

It will be interesting to compare how well this empirical model does in predicting grant application numbers in future years.

An R Markdown file that generates this post, including the R code, is available.

To begin, I examine the appropriations history for NIH from 1990 to the present. NIH appropriations from 1990 to 2015 are shown below:

For comparison, success rates for grants (RPGs) are shown below:

These two parameters are negatively correlated with a correlation coefficient of -0.66. In other words, as the size of the appropriation increased, the success rate tended to decrease.

One possible adjustment that might improve the correlation involves correcting the appropriation data for inflation. Inflation is best measured in terms of the Biomedical Research and Development Price Index (BRDPI), a parameter calculated annually by the Department of Commerce on behalf of the NIH.

The NIH appropriation curves in nominal terms and in constant 1990 dollars are plotted below:

The constant dollar appropriations and success rate data are still negatively correlated with a correlation coefficient of -0.381.

Thus, the simple notion that the success rate should increase with increases in the NIH appropriation is empirically false over time.

There are two reasons why this is true. The first involves the manner in which NIH grants are funded. Grants average 4 years in duration which are almost always paid out in 4 consecutive fiscal years. Thus, if a 4-year grant is funded in a given fiscal year, the NIH is committed to paying the “out-years” for this grant over the next 3 fiscal years. Because of this, ~75% (actually more than 80% due to other commitments) of the NIH appropriation for a given year is already committed to ongoing projects, and only less than 20% of the appropriation is available for new and competing projects. This makes the size of the pool for new and competing projects very sensitive to the year-to-year change in the appropriation level.

The observed numbers of new and competing grants are plotted below:

To put these effects in quantitative terms, a model has been developed to estimate the number of grants funded each year, given NIH appropriation and BRDPI data over time.

The assumptions used in building the model are:

- NIH funds grants with an average length of 4.0 years.

For the purposes of this model, we will assume 1/4 of the grants have a duration of 3 years, 1/2 of the grants have a duration of 4 years, and 1/4 of the grants have a duration of 5 years. Using a single pool of grants, all of which have 4-year durations, is both contrary to fact and would likely lead to artifacts. It is unlikely that the model will depend significantly on the details of the distribution. When a grant completes its last year, the funds are freed up to fund new and competing grants in the next year.

- The average grant size increases according to the BRDPI on a year-to-year basis.

This assumption has been less true in recent years owing to highly constrained NIH budgets, but this it a reasonable approximation (and still represents good practice).

- Fifty percent of the overall NIH appropriation each year is invested in RPGs. This is consistent with the average percentage of RPG investments over time.
- The system begins with an equal distribution of grants at each stage (first, second, … year of a multiyear grant) ~10 years before the portion used for analysis.

We will start in 1990. The comparison between the actual numbers of grants funding and those predicted by the model are shown below:

The agreement between the observed and predicted curves is remarkable. The correlation coefficient is 0.894.

The largest difference between the curves occurs at the beginning of the doubling period (1998-2003) where the model predicts a large increase in the number of grants that was not observed. This is due to the fact that NIH initiated a number of larger non–RPG-based programs when substantial new funding was available rather than simply funding more RPGs (although they did this to some extent). For example, in 1998, NIH invested $17 million through the Specialized Center–Cooperative Agreements (U54) mechanism. This grew to $146 million in 1999, $188 million in 2000, $298 million in 2001, $336 million in 2002, and $396 million in 2003. Note that the change each year matters for the number of new and competing grants that can be made because, for a given year, it does not matter whether funds have been previously committed to RPGs or to other mechanisms.

The second substantial difference occurs in 2013 when the budget sequestration led to a substantial drop in the NIH appropriation. To avoid having the number of RPGs that could be funded drop too precipitously, NIH cut noncompeting grants substantially. Noncompeting grants are grants for which commitments have been made and the awarding of a grant depends only on the submission of an acceptable progress report. The average size [in terms of total costs, that is, direct costs as well as indirect (facilities and administration) costs] of a noncompeting R01 grant was $393,000 in 2011, grew to $405,000 in 2012, a 2.9% increase, and then dropped to $392,000 in 2013, a 3.3% drop. Given that there are approximately three times as many noncompeting grants as there are new and competing grants, this change from a 2.9% increase to a 3.3% decrease for noncompeting grants increased the pool of funds for new and competing grants by ~3(2.9 + 3.3) = 18.6%. However, cutting noncompeting grants means that existing programs with research underway and staff in place had to find ways for dealing with unexpected budget cuts.

At this point, I have developed a reasonable model for estimating the number of new and competing awards that can be made given annual appropriation and BRDPI data. In the next post, I will examine a model for the number of grant applications submitted each year.

The R Markdown file that generates this post, including the R code, is available.