Establishing the prevalence of the gender dimension in research
September 16, 2016
, , , , , , , ,

About 7% of social science research and about 4% of medicine and humanities integrate a gender dimension. In agricultural and natural sciences as well as engineering and technology there is a vanishingly small amount of research involving gender. Sweden’s research has the strongest commitment to researching gender issues, with almost 9% of social sciences and 7.6% of humanities addressing gender issues and even showing activity in the natural sciences. Croatia and Finland are close followers, with a similar broad profile.

These insights are reported in She Figures 2015, released this year by the European Commission. The data were compiled by Science-Metrix. Intriguing as these results are, my first question was: how did they do that? Papers aren’t tagged with a meta-data field signaling the presence of gender concerns. No single keyword would capture it. Topic modeling the corpus might produce some clusters with words related to gender, but nothing precise enough to produce with any confidence a table by country, field and year. So I dug into the methodology to find out how the numbers were produced.

No surprise, the method starts with delineating the scope, or definitions. The EU definition is:

Gender dimension in research is a concept regrouping the various elements concerning biological characteristics and social/cultural factors of both women and men into the development of research policy, programmes and projects. (p. 55, Methodology)

The study understood this definition as follows:

The gender dimension in research content includes both the concepts of sex and gender as well as the concept of sex/gender analysis in humans. As such, in addition to research outputs focused on a well-defined gender topic (e.g. feminism, gender pay gap, gender equality, LGBT), research content in which a distinction or a comparison is made between men and women either in the title, abstract, or author keywords of scientific publications were deemed relevant. (p. 55, Methodology).

Excluded were studies of the animal kingdom or of plants. Also excluded were medical studies of conditions specific to one gender, such as menopause or erectile dysfunction.

With this definition in hand, the Science-Metrix team built a core set of papers. For this, they used the Web of Science (WoS) database. First they pulled the papers in fields related to gender research such as the social science field of Gender Studies—6,023 papers. The team confirmed that 100 randomly selected papers in this set did indeed contain a gender dimension. Next they pulled papers in journals whose names contained the word “gender.” This added about 2,150 papers to the set. Next they pulled all papers from journals that published articles classified in the subfield of Gender Studies, adding 3,700 to the set. Finally, Medline MeSH terms were searched. A search for MeSH terms containing every variant of gender, femin*, women and men found 18 possibly relevant MeSH terms. After eliminating feminization and pregnant women, the team used the remaining terms to pull Medline papers, and found those that were indexed in the WoS. The final dataset contained 17,900 papers. This was called the “seed” dataset.

The seed dataset was expanded by searching for gender-related terminology in title, abstract and author keywords in WoS indexed papers. Highly relevant terms were identified using the term frequency-inverse document frequency (TF-IDF) metric. Specifically, the number of times a term or phrase appeared in the seed dataset was divided by the number of times the term appeared in the whole WoS. Terms scoring high on this measure will appear often in the seed dataset but infrequently in the full database. Two lists of terms were analyzed: first, 10 million noun phrases extracted from titles, abstracts and keywords in the WoS, and second, a draft thesaurus of gender equality terms of the European Institute for Gender Equality (EIGE). A TF-IDF weight was calculated for 150,000 WoS expressions and 650 EIGE keywords. The large number of terms tested made it possible to detect unsuspected search terms. The top scoring WoS noun phrases were sex guilt, gender identity disorder, gid, ipv, gender nonconformity, gender identity, woman movement, hegemonic masculinity, abuse woman and batter woman. The top scoring EIGE keywords were violence against women, masculinities, hegemonic masculinity, feminism, femininities, gender role, intimate partner violence, gender equality, heteronormativity, and women empowerment.

The full WoS was searched for papers containing the top scoring terms in their title, abstract or author keywords. For each word, a random sample of records was manually checked to tune the search. For example, the keyword gender role retrieved articles studying male and female animals in the biology field. As a result, biology papers were filtered out from the results of the search for this term. The final dataset built out in this fashion contained 212,600 papers including a gender dimension in their research content.

And so, by clever use of database meta data, information science metrics, and manual labor, Science-Metrix managed to devise a reasonable analysis of frequency with which research in a country and a field addresses gender-related issues. The terms are listed in the appendix to the report for anybody who would like to continue this line of work.

European Commission & Directorate-General for Research and Innovation, 2016, She Figures 2015, doi:10.2777/744106.

Science-Metrix & ICF International, 2015, She Figures 2015: Comprehensive Methodology – New Research & Innovation Output Indicators, report submitted to European Commission, Directorate-General for Research and Innovation.


All views expressed are those of the individual author and are not necessarily those of Science-Metrix, 1science or Georgia Tech.


About the author

Diana Hicks

Professor Diana Hicks specializes in science and technology policy as well as in the innovative use of large databases of patents and papers to address questions of broad interest at the intersection of science and technology. Her recent work focuses on the challenges of bibliometric analysis in the social sciences and humanities and on developing broad understanding of national performance-based research funding systems and their consequences. Professor Hicks’s work has appeared in such journals as Policy Sciences, Social Studies of Science, Nature, Research Policy, Science and Public Policy, Research Evaluation, Research Technology Management, R&D Management, Scientometrics, Revue Economique Industrielle, Science Technology and Human Values, Industrial and Corporate Change, Japan Journal for Science, and Technology and Society. She was also lead author of the Leiden Manifesto (2015, see, which presented 10 principles for guiding research evaluation and has been translated into 11 languages. Hicks is a Professor in the Georgia Tech School of Public Policy, and previously chaired the School between 2003 and 2013. Prior to this, Professor Hicks was the Senior Policy Analyst at CHI Research, Inc. She was also on the faculty of SPRU, University of Sussex (UK), for almost 10 years, taught at the Haas School of Business at the University of California, Berkeley, and worked at the National Institute of Science and Technology Policy (NISTEP) in Tokyo. Visit to view Professor Hicks’s publications.

Related items

/ You may check this items as well

Rationalizing the extremes: introducing the citation distribution index

The distribution of citations among the scientific...

Read more

1findr: discovery for the world of research

As of last week, 1science is offering public acces...

Read more

Positional analysis: from boring tables to sweet visuals

At Science-Metrix we are obviously very focused on...

Read more

There are 0 comments