Corpus Design & Selection Criteria
This document outlines the main principles adopted to ensure the integrity and continued relevance of the Oslo Medical Corpus to a wide range of researchers in the broad field of medicine and healthcare, and to those interested in examining the intersection of health and sustainability.
- Corpus Design
- Representativeness and Balance
- Selection Criteria
- Guiding Principles for Purposeful Sampling
- Transparency and Accountability
A corpus is an electronic collection of texts built according to specific design criteria and for a specific purpose.
Corpus design and selection criteria vary, depending on the type of corpus being compiled. The Oslo Medical Corpus is an open-ended corpus that is intended to grow dynamically and organically as new priorities arise and as the textual universe it aims to capture continues to expand and change. It is a freely accessible resource that is designed to support a wide range of studies on health and health-related topics, including the intersection of health and sustainability, and to facilitate the analysis of very large collections of texts and millions of running words. The difficulty of drawing clear boundaries around either of the overlapping fields of medicine and healthcare aside, the overall topic of health is broadly conceived, and the project strives to maintain an optimal balance between different types of medical discourse on an ongoing basis.
The medical field is no stranger to textual analysis. The highly regulated domain of systematic reviews in particular has propelled a variant of text analysis into a vital methodological tool in medical research. There is therefore room to explore complementary and alternative methods of text analysis. In this respect, the OMC is intended to serve as a connective and complementary tool: to allow scholars to study health and health-related discourses from diverse perspectives, while also facilitating a variety of interdisciplinary encounters.
Unlike most corpora, the OMC is not designed to support linguistic analysis per se but rather to enable researchers and students in the field of health to analyse the evolution and contestation of key concepts in their specific domain. Examples of such concepts include evidence, equity, viability, sustainable development, degrowth and preparedness, among others.
Open-ended corpora such as the OMC generally give more priority to size and currency than to applying stringent criteria to achieve representativeness and balance.
Representativeness and Balance
Representativeness and balance are key considerations in corpus design. They concern the number of texts (or tokens) we decide to include and the proportions in which we include them. As far as balance is concerned, the corpus builder has to decide whether the balance to be achieved is internal to the corpus, meaning that the proportions of different variables (document types, range of sources, etc.) should be roughly the same irrespective of their level of influence or the proportion they represent of the relevant domain, or whether it should reflect what we estimate to be the proportions of these variables in the textual universe to be represented. This is not an exact science; the idea of the process of building an open-ended corpus such as the OMC being conceived as “fluid, organic and cyclical” is therefore considered “the bottom line in corpus design” (Biber 1993:256).
In practice, then, representativeness and balance are ideals we strive for but can never fully achieve in an open-ended, dynamic corpus such as the OMC. There are several reasons for this.
First, the extent to which a corpus can claim to be representative depends on a clear definition of the population under study. But the size of the population to be represented – in this case all texts about health and health-related topics – can never be delineated. No one knows precisely how many texts are available on these topics at any one time, nor can produce a full list of all the sources they may be drawn from.
Second, for an open-ended corpus like the OMC, the textual universe we are attempting to capture is not fixed. It is constantly changing as more texts are produced and new priorities and topics arise. Consider, for example, the extent to which that entire textual universe has changed following the outbreak of Covid-19 in 2020. We believe that an open-ended corpus such as the OMC must be allowed to grow and change parameters if it is to continue to reflect a constantly changing universe and remain responsive to the needs of the research community.
Third, balance is usually defined as a measure of the internal consistency of a corpus in terms of the proportions that are contributed by each variable. This is often (but not always) understood to require the corpus builder to approximate to the actual proportions of the different types of text that exist in the domain they wish to represent. How many texts are produced by policy makers such as WHO and ECDC, for instance, as opposed to texts produced by journals such as The Lancet or by grassroots organisations such as Doctors for Extinction Rebellion or Health Poverty Action? What percentage of these texts are reports, (draft) resolutions, blogs, journal articles, books, or other formats? No one has accurate statistics on these variables at any one time, a situation that is further complicated by the fact that these proportions are not fixed given the fluidity of the entire textual universe. As Sinclair (2004) asserts, “there are no such things as ‘correct proportions’ of components of an unlimited population”.
Fourth, the attempt to improve the representativeness and balance of a corpus are further complicated by more pragmatic considerations, chief among which are copyright restrictions in the case of a corpus such as the OMC, which is designed to be freely accessible to the research community. Other pragmatic considerations include the relative difficulty of acquiring and preparing particular types of text for inclusion in a corpus. Including spoken encounters such as clinical interactions, for instance, requires addressing issues of privacy and confidentiality and involves far more investment in time and effort than including written documents. Even the heavily funded 100-million-word British National Corpus consisted of 90% written and only 10% spoken language (Weisser 2022:90; Rees 2022:394), and the Corpus of Contemporary American English is 80% written and 20% spoken language (Weisser 2022:90). This despite the fact that we are all exposed to and engage in producing far more spoken than written discourse.
Beyond the general issues of representativeness and balance, corpus builders have to select individual sources and texts on the basis of clear, transparent criteria. These may be divided into external and internal criteria.
External criteria are based on evidence external to the body of the text proper and are less dependent on subjective judgement than internal criteria. They guide the initial selection of sources and of individual texts to be included in the corpus. In the case of the OMC, external criteria include the following parameters. Details under each heading are indicative only (they do not constitute exhaustive lists).
Specific sources are selected on the basis of their relevance to the field of health in general and/or to priority topic areas (see item 1 under internal evidence). Examples include the following:
WHO, UNAIDS, ECDC, CDC, Wellcome Trust
The Lancet, BMJ, New England Journal of Medicine, BJGP Open
Amnesty International, Oxfam
Abortion Rights Campaign, Doctors for Extinction Rebellion, Advocates for Youth
The Conversation, OpenDemocracy, The Nation
Jason Hickel blog, Science-based Medicine
2. Document format
Reports, (draft) resolutions, journal articles, articles in online magazines, blogs, books
3. Time span
All things being equal, priority is generally given to more recent publications. But because the selection of individual items is guided by the topic areas identified as priorities for the SHE community (see below), the time span for selecting texts varies depending on the nature of each topic and the needs of a particular project. In the case of MEDRA, for instance, the starting point is 1973 for the US, 1983 for Ireland and 1985 for Argentina.
This relates to both the origin of a document and the geographical region on which the text focuses. In this sense, region is both an external and an internal criterion.
Many of the documents in the corpus are sourced from international or pan-national organisations such as WHO or ECDC and mostly focus on the global context. Others are produced by groups and institutions located in and addressing issues relevant to a particular region. Examples include Abortion Rights Campaign (Ireland), Doctors in Unite (UK), and Asociación Médica Argentina. Different regions are prioritised for different topic areas and different projects.
5. Copyright status
Material included in the corpus must be in the public domain, published under a CC licence that allows for inclusion in an electronic corpus, or is covered by explicit permission granted to the OMC by the copyright holder.
Internal criteria are drawn from closer examination of individual texts to determine their relevance and fit within the overall design of the corpus.
1. Topic Area
For the OMC, priority is given to specific topic areas considered of particular interest to the SHE community of scholars and students. These currently include: pandemics/epidemics; health and environmental sustainability; reproductive & sexual health & rights; and adolescent & young people’s health. Priority sub-topics are identified under each area and guide the search for and selection of individual texts. For example, abortion is a key subtopic under reproductive & sexual health & rights; HIV and polio are among a number of key subtopics under pandemics/epidemics.
As mentioned under external criteria above, region is both an external and internal criterion. What region of the world the content of a document focuses on is taken into consideration and the selection is guided primarily by its relevance to priority topic areas or to a specific project such as MEDRA or Erasmus+.
3. Text and Graphics
Unlike the vast majority of corpora, the OMC is designed to capture both textual and visual material and to provide separate or complementary access to both through the software interface (continually under development). Nevertheless, corpora are designed primarily for the analysis of running text. For a text to be included in the OMC the balance must be largely in favour of running text.
Guiding Principles for Purposeful Sampling
Purposeful sampling, a strategy extensively used in qualitative research, involves identifying and choosing cases that are rich in information, thereby maximizing the effectiveness of limited resources (Patton 2002).
The OMC guiding principles for implementing purposeful sampling are as follows:
1. Expert Judgement
Selection of documents within a specified genre or from a specific source relies on expert judgement. Initial pre-screening is conducted, followed by consultation with experts in both corpus analysis and health science. This is to ensure that what is included is relevant to the subject matter and to identify any key documents that warrant prioritization.
2. Ongoing Monitoring of Content
The selection process is open-ended and ongoing. While it is impossible to cover every document within a domain, we continue to add material until saturation is reached within a particular category.
3. Maximum Variation
The OMC is not a corpus of the medical canon. It represents a variety of voices and is medical in content, not in terms of expertise. We deliberately attempt to represent both mainstream and non-mainstream sources and to select documents that exhibit the greatest possible diversity of voices and opinions. For instance, when exploring the topic of abortion, our aim is to ensure that our sample encompasses a wide spectrum of perspectives, including ‘pro-life’ and ‘pro-choice’ voices and various positions in between.
4. User Feedback
Given the open-endedness and ongoing expansion of the OMC, users of the corpus are encouraged to suggest additional content within any of the topic areas we prioritise. Please write to Mona Baker, Gabriela Saldanha or Kyung Hye Kim.
Transparency and Accountability
Details of the full list of texts included in the OMC at any one time are readily available to the research community through the website. Click on Contents of the Oslo Medical Corpus to access the full database.
Biber, D. (1993) ‘Representativeness in Corpus Design’, Literary and Linguistic Computing 8(4): 243-257.
Rees, G. (2022) ‘Using Corpora to Write Dictionaries’, in A. O’Keeffe and M.J. McCarthy (eds) The Routledge Handbook of Corpus Linguistics, second edition, Abingdon: Routledge, 387-404.
Patton M. Q. (2002) Qualitative Research and Evaluation Methods, third edition, Thousand Oaks, CA: Sage Publications
Sinclair, J. McH. (2004) ‘Corpus and Text: Basic Principles’, in M. Wynne (ed.) Developing Linguistic Corpora: A guide to good practice. Available at https://users.ox.ac.uk/~martinw/dlc/chapter1.htm#section4.
Weisser, M. (2022) ‘What Corpora Are Available?’, in A. O’Keeffe and M.J. McCarthy (eds) The Routledge Handbook of Corpus Linguistics, second edition, Abingdon: Routledge, 89-102.