Skip to main content

New tools for gender analysis

Initial gender study at Science

Since publishing is one of the most important measures of scientific accomplishment and a key parameter in professional advancement, the gender distributions of authors for scientific papers is a topic of considerable importance, potentially revealing informaton about gender distributions across the communities of science as well as highlighting possible points of gender bias. Early in my time at Science, we reported in an editorial and a Sciencehound post, an initial analysis of the gender distributions for papers published in Science in 2015, together with a randomly selected set of similar number of papers that had been submitted but rejected for publication in Science over the same time period. The genders for the first and corresponding authors for these papers were inferred manually through web searches. The major conclusions from this study were that the fractions of female authors were relatively low at 16% for corresponding authors and 27% for first authors, and that there was no strong evidence of significant differences in acceptance rates between female and male corresponding or first authors. We were interested in pursuing this and related analysis further, but manual gender inference methods do not scale well for larger analyses such as looking at trends over time (see the related editorial).

Collecting user-provided gender information

To address this data deficiency, we started to collect gender information voluntarily from authors and reviewers who worked on papers submitted to Science or other members of the Science family of journals. This approach has been somewhat successful and we have accumulated a dataset that includes 4789 individuals with genders provided as either male or female. Of these, there are 3530 males and 1259 females. Of course, this dataset is still much smaller than the full complement of Science authors. However, there are automated computational methods based on the use of first names with known genders from large databases such as those from the United States Social Security Administation.

Automated gender inference

We used the package r package “gender” to infer genders and to compare these inferred gender assignments with the author-provided genders. For the initial analysis, we used the default for “gender”, which is to assign the gender “male” whenever more than 50% of the entries in the database used are male and the gender “female” whenever more than 50% of the entries in the database used are female. For names for which no data are available, the gender is assigned as “unknown.” We applied this method to the dataset of names with user-provided genders.

Of the 4789 individuals, genders could be inferred for 3403. Thus, gender could be inferred for 71.1% of the individuals with 2491 inferred to be male and 912 inferred to be female. Of the individuals with inferred genders, 2372 or 69.7% were actually male and 1031 or 30.3% were actually female. Of the individuals for which gender could not be inferred, 1158 or 83.5% were male and 228 or 16.5% were female. Note that this pool is somewhat enriched in males compared to the population for which genders could be inferred.

Let us consider the accuracy of the inferred genders. Of those males for whom gender could be inferred, the accuracy was 98.1%. Of those females for whom gender could be inferred, the accuracy was 84.0%. Overall, the accuracy was 93.8%.

Validation and accuracy at the population level

The above values represent accuracy at the individual level. However, for many analyses in which we are most interested, data at the population level is sufficient and these results may be more accurate. To check this, we applied these methods to the 2015 dataset used in our initial study. Of the 3568 individuals in this dataset, 2277 were male and 587 were female. We can compare these results with a calculation based on the user-provided gender data above in the following manner:

  • Apply the gender inference tool to the overall dataset.
  • Correct the number of individuals inferred to be males by subtracting the estimated number of actual females inferred to be male (0.160) and adding the estimated number of actual males who were inferred to be female (0.019).
  • Correct the number of individuals inferred to be females by subtracting the estimated number of actual males inferred to be female (0.019) and adding the estimated number of actual females who were inferred to be male (0.160).
  • Estimate the numbers of males in the pool of individuals for whom no gender could be inferred by multiplying the size of this pool by the estimated fraction of males in the pool of individuals with inferred genders with a correction factor of 0.14 added to account for the observation that the pool with no inferred gender tends to have a larger fraction of males than does the pool with inferred genders.

Applying this approach to the 2015 data we obtain these results:

Number of individuals inferred to be males: 1628.

Number of individuals inferred to be females: 554.

Number of individuals with no inferred gender: 682.

Fraction of males in pool with inferred genders: 0.746.

Corrected number of males: 1628 – 0.019 × 1628 + 0.160 × 554 + 0.886 × 682 = 1628 – 32 + 89 + 604 = 2289.

Corrected number of females: 554 – 0.160 × 554 + 0.019 × 1628 + 0.114 × 682 = 554 – 89 + 32 + 78 = 575.

These results can be compared with the actual values (based on web searches) of 2277 males and 587 females.

Thus, the fraction of males in this dataset is estimated to be 0.799 compared with the actual value of 0.795. The value calculated with the corrections based on the user-provided gender dataset is more accurate than the value obtained using only the individuals for which genders could be inferred of 0.746.

Because these calculations depend on parameters estimated from the user-provided gender dataset that are somewhat uncertain, we can refine these calculations through simulations that use small variations from these parameters selected at random from normal distributions. This will allow estimation of uncertainties introduced from the incomplete gender information.

Application to first authors of Reports from Science from 2010 to 2017

With these tools in hand, we can now estimate the fractions of males and females in the pool of authors who submitted papers to Science. We first consider the pool of first authors for all Reports submitted to Science by year from 2010 to 2017. The gender population distributions are plotted below. The bands show 95% confidence ranges based on the uncertainties in inferring author gender.

The fractions of male and female first authors have been quite constant over this period of time with 24.2% of the first authors being female, reasonably consistent with the results of the 2015 manual study.

We can extend this analysis by examining the pools of first authors for Reports that were published in Science versus that for Reports that were rejected from Science from the same time period. These results can be combined to yield the acceptance rates for Reports with male versus female first authors, defined as the fraction of published Reports with first authors with a given gender divided by the total number of Reports (both published and rejected) with first authors of the same gender.

This plot reveals that there were differences in the acceptance rates for Reports with female versus male first authors for 2011 to 2015, favoring male first authors. There was essentially no difference for Reports submitted in 2016 and 2017. This was not due to any explicit policy change at Science other than the continuation of discussions regarding gender equity issues.

Science publishes papers across a wide range of disciplines with communities with different gender makeups. As a first step toward characterizing these differences, we can separate Reports according to the editor who was primarily responsible for handling the submission, dividing the editors into three groups: Life Sciences (molecular biology, biochemistry, cell biology, neuroscience, biomedicine, etc.), Physical Sciences (chemistry, physics, astronomy, etc.), and Other (ecology, evolution, earth sciences, social sciences, etc.). The Life Sciences, Physical Sciences, and Other categories account for 48, 22, and 30% of submissions, respectively.

The fractions of first authors by gender for these groups are shown below.

As might have been anticipated, the fraction of female first authors in the Life Science submissions, 0.295, is higher than the overall average. The fraction of female first authors in the Physical Sciences, 0.153, is lower by approximately a factor of 2. For the remaining fields (Other), the fraction of female first authors is 0.220.

We can also examine the acceptance rates as a function of the gender of the first author by field as shown below.

The plots for the different fields each follows approximately the same trend as that seen for the overall population. The acceptance rate for female first authors in the Life Sciences appears to be higher than that for male first authors for papers submitted in 2016 and 2017.

Gender analysis of corresponding authors on Reports

We can now perform the same analyses for corresponding authors (as opposed to first authors) on Reports.

The fractions of female and male corresponding authors overall and by field are shown below.

Again, these fractions are relatively constant over this time period. Overall, the fraction of female corresponding authors, 0.172 is lower than that for female first authors.

The fraction of female corresponding authors in the Life Science submissions, 0.181, is higher than the overall average. The fraction of female first authors in the Physical Sciences is 0.124 while for the remaining fields (Other), the fraction of female first authors is 0.194.

The acceptance rates for corresponding authors by gender are shown below.

These acceptance rate plots are relatively similar to those for first authors. Differences in the acceptance rates that occurred in early years are no longer evident for Reports submitted in 2017.

Future directions

This is just an initial analysis of gender differences across the Science family of journals. The gender inference tools and datasets that we have develop will enable additional analyses of gender effects for other article types as well as for other characteristics of papers. These will be reported in subsequent posts.