Historian Paul Edwards has described climate science as A Vast Machine. Climate models incorporate knowledge from a wide range of disciplines, including atmospheric physics, chemistry, ecology, and economics. Additionally, they rely on measurements collected by a wide range of instruments and by a wide range of people with disparate priorities. These models are vital for organizations such as the Intergovernmental Panel on Climate Change (IPCC) to make predictions about the climate and to form policy recommendations.
Through citation analysis, my project aims to understand how climate scientists collectively produce knowledge and, by extension, to learn about science in general as a collective enterprise.
Motivation and Background
Do scientists self-organize in order to best produce significant knowledge? In their classic debate over the organization of science, chemist-turned-philosopher Michael Polanyi and physicist-turned-sociologist John Desmond Bernal clashed over whether research in Britain should be coordinated by a centralized body or left to the individual initiatives of scientists to produce spontaneous organization. Bernal argued:
Scientific activities have in detail grown in a spontaneous way, while the organizations to co-ordinate these activities have not been planned beforehand, but have grown up with the development of science itself, always at a slower rate than the activities they organize….
Informal methods of co-operation, though moderately successful inside any branch of science, almost completely break down between the sciences…. The result of this is an enormous lag in the appreciation of the relevance of one field of science to another. (Bernal 1939)
In contrast, Polanyi claimed:
Self-coordination of independent initiatives leads to a joint result which is unpremeditated by any of those who bring it about. Their coordination is guided as by ‘an invisible hand’ towards the joint discovery of a hidden system of things….
Any attempt to organise the group of helpers under a single authority would eliminate their independent initiatives and thus reduce their joint effectiveness to that of the single person directing them from the centre. (Polanyi 2000)
Although their debate occurred almost a century ago, it is still just as relevant today. How do different methods of funding and promotion affect scientists’ research strategies? How should large funding bodies such as the National Science Foundation evaluate grant proposals to best serve the needs of society? How can scientific publication be designed to best facilitate scientific communication?
These are some of the questions asked by “systems-oriented social epistemology”—the branch of philosophy of science devoted to evaluating and making recommendations about the organizational structure of science (A. Goldman I. 2011). Recently, building upon the early work of Goldman (A. I. Goldman and Shaked 1991) and Philip Kitcher (Kitcher 1990), social epistemologists have begun constructing formal models of scientific research. These models purport to reveal surprising and significant truths about the performance of science as an epistemic enterprise. Kitcher’s model, for instance, is meant to show that scientists with pecuniary motivations can collectively outperform scientists who are solely concerned with the truth. A model by Kevin Zollman claims to demonstrate that increased communication between scientists can reduce their chances of arriving at the correct consensus (Zollman 2007), and a model from Weisberg and Muldoon purports to demonstrate that communities with diverse epistemic strategies outperform those employing a single strategy (Weisberg and Muldoon 2009). If accurate, the results of these studies would be important not only for understanding science as a collective enterprise, but also for informing science policy.
However, all of these models are highly idealized. “Scientists” are simple agents following mechanical rules or maximizing utility functions. They exist in worlds that seem far removed from actual scientific practice, and they pursue aims that only vaguely resemble those of real scientists. While these models might plausibly capture significant features of scientific research, no effort has been made to evaluate them empirically or to apply them to concrete cases. Given the current state of these models, it is difficult to claim that any of their conclusions are genuinely informative about scientific research. Citation analysis has the potential to allow formal models of science to be confronted with data. For example, citation analysis could evaluate whether and how scientists can optimize citation counts in pursuit of further recognition and career advancement. In turn, this could inform how scientists are represented in abstract models. It could also be used to explore how scientists self-organize to pursue competing or complementary research strategies.
As Karin Knorr-Cetina has documented, scientific disciplines have come to different answers as to how to best organize themselves to pursue knowledge (Knorr-Cetina 1999). High energy physics, in her account, follows a corporatist model, with massive numbers of scientists working on a common project within a single, centrally-managed, organization. Molecular biology, by contrast, is organized into much smaller laboratory groups with no centralized coordinating body. Climate science is interesting partly because it lies between these two poles. Like molecular biologists, climate scientists are organized into relatively small, autonomous groups. But like high energy physicists, climate scientists have a common epistemic goal: to predict the possible future behavior of the climate.
Sociologist Mikaela Sundberg has argued that:
Because of their dominant position, climate models have the potential to unite several areas of environmental science through a process in which the interests of other scientists become translated into those of climate modellers.
Although experimentalists construct their problematizations on basis of climate models, and sometimes parameterizations, it is not necessarily the case that interessement [translation] takes place in practice. (Sundberg 2007)
If Sundberg’s observations are correct—that climate models have a dominant position in climate science and yet other specialties sometimes fail to produce results that can be easily used by modelers—this suggests that the spontaneous organization suggested by Polanyi might not operate as well as he supposed, and that climate science could benefit from the sort of centralized direction proposed by Bernal. However, Sundberg’s conclusions are based on interviews with a small group of climate scientists. Citation analysis has the potential both to corroborate Sundberg’s claims about the organizational structure of climate science (that climate models are dominant) and to evaluate her claim that there is sometimes a failure to translate results into a form that is useful to modelers. Further, citation analysis should reveal whether parameterizations (numerical approximations) occupy an intermediary position between climate models and experimental results.
In general, citation analysis offers a way to obtain a global view of the organization of climate science, one that can be used to corroborate or question conclusions drawn through other methods, and it offers an arena in which formal models of scientific research can be developed and tested.
Data and Methods
Citation analysis studies the relationships between objects (papers, journals, authors, disciplines, statements) through citation. Many have argued that citations are a kind of currency in science. As David Hull put it, scientists exchange recognition for use (Hull 1988), and citations are markers of that recognition. While scientists cite for many reasons, citation analysis assumes that, in aggregate, citations indicate an epistemic relationship between objects. When modeling the behavior of scientists, it also assumes that scientists are motivated by the accrual of citations, as citations are a means of recognition and scientists are assumed to be motivated by, among other things, a desire for recognition.
Although this project is still in its early stages, in this section I will discuss some of my early results and methods. This project uses citation data from several sources, primarily Web of Science, Crossref, and the Intergovernmental Panel on Climate Change (IPCC). Metadata on papers and journals are collected from these sources, including titles, authors, abstracts, publication dates, journals, keywords, citations, and sometimes full text. I am currently working with two databases: one exclusively based on Web of Science data that contains data on approximately 700,000 papers and one from a variety of sources that represent approximately 500,000 papers.
One of the most common uses of citation analysis is to identify communities of papers, journals, or authors and to study the relationships between these communities (Leydesdorff and Rafols 2009). Here is one such map generated from my data, showing papers cited by the IPCC 5th Assessment Report, Physical Science Basis:
Here, nodes represent papers (with size proportional to citation count), and edges represent common bibliography items. For ease of interpretation, only the most frequently cited papers and most significant bibliographic connections are depicted, though the map is arranged using all available information. Bibliographic coupling frequency, the technique used to construct this figure, considers two papers to be related to the extent that they have overlapping bibliographies. Communities of papers (indicated by differing colors as well as spacial arrangement) are delineated using the Louvain community detection algorithm, which recursively seeks to maximize the ratio of in-group to out-group links between objects.
From this figure, communities are identified using a combination of word frequency analysis of the text of papers’ abstracts and manual inspection of paper titles and abstracts:
Here, nodes represent communities of papers and edges indicate citations from one community to another. The thickness of arrows indicates the significance of the flow of citations: thickness is proportional to the number of standard deviations above the mean number of citations to randomly selected groups of papers from the database. This effectively normalizes edge weight. Again, only the most significant connections are shown. Without filtering, most nodes would show some flow of citations between them.
This figure corroborates Sundberg’s claim that models occupy a dominant position in climate science, at least from the perspective of the IPCC. While not the largest set of papers, the “models-evaluation, methodology” group is the most central, with most other groups citing papers from it. Further, a high proportion of papers in the other groups are also related to climate modeling—of ocean currents, precipitation, regional weather, and so on.
These first two figures depict papers directly cited by the IPCC 5th Assessment Report. Moving one level of citation lower, to papers cited by papers cited by the IPCC, expands and complexifies the picture:
The communities in these figures have been constructed similarly to those above, but edges on the left figure indicate that the connected papers have both been cited by a paper in the first group. This method, co-citation analysis, takes papers to be related when they commonly appear together in other papers’ bibliographies. The communities identified here are similar to the ones above (and contain many of the same papers), but there are notable differences. Models play a much less significant role, for instance, though as before there are modeling papers in all of the communities. In these communities, historical studies, including reconstructions, and studies of ocean dynamics play a much more dominant role. Spatially, communities at the top of the diagram (aerosols, ozone, and clouds) focus on atmospheric dynamics, those at the bottom (glaciers, arctic, southern hemisphere, and the two oceans communities) focus on oceans, and the carbon cycle community is concerned with terrestrial phenomena: forests, agriculture, and pollution, among other topics. The historical community has significant connections with nearly all the other communities, as does the main ocean community.
These maps reduce thousands of papers across dozens of journals to a comprehensible picture of some of the most important areas of climate research, and also inform us about the relationships between these specialties. Further, they show how different types of research become more or less prominent as research becomes more removed from the specific requirements of the IPCC reports.
While these figures are informative, they don’t have much potential for making evaluative judgments. One evaluative question we might ask is what proportion of climate research has some influence on the IPCC reports. To address this question, I recursively searched the database for papers cited by papers already connected to the IPCC report through citation. The diagrams above show papers that are one or two degrees removed from the IPCC report. What happens when we go further? Here is a chart showing the number of papers reached at each degree of separation from the IPCC report:
As we move further down the citation chain from the IPCC report, the number of papers linked to it through citation expands rapidly, with about 80,000 papers reached after just three degrees. What proportion of papers in the database are ultimately linked to the IPCC report in this way?
The chart on the left shows the fraction of papers connected to the IPCC through citation, by year from 1990-2012, for the entire database. The mean fraction of connected papers is 0.4. This includes, however, papers that were published in the years immediately preceding the 5th Assessment Report, which was published in 2013. Once papers reach maturity, about 10 years out from the report in the early 2000s, the average hovers around 0.5. The left figure includes journals such as Science and Nature, where many articles are never intended to contribute to climate science. The right includes only papers published in journals explicitly concerned with climatology. Here the mean fraction connected is 0.76, and is over 0.8 for mature papers. While I have not compared these results to review articles in other disciplines, this fraction seems very high. It suggests that climate scientists can truly be considered to be working together on a common project, and supports Polanyi’s contention that scientists spontaneously organize themselves to produce research that will be most useful to their peers. This fraction is even more impressive when you consider that my database does not achieve full coverage of climate science. It is possible that with a more comprehensive database, the fraction of papers that can trace some link to the IPCC would be even higher. It will also surely be higher once citations connecting to the other IPCC reports are considered.
These last figures demonstrate one way in which citation analysis can be used to make evaluative judgments about the organization of science. How might citation analysis inform formal models of science, such as those from Kitcher, Zollman, Mulddon, and Weisberg?
One way that citation analysis can inform models is by suggesting realistic ways to parameterize models: for example, setting realistic community sizes, publication frequencies, and degrees of connectedness between communities. Another way is to construct models that predict citation patterns and use citation analysis to evaluate the results of those models. In the first case, by making formal models more realistic, citation analysis can increase the credibility of their results. In the second case, by evaluating the predictions of models, citation analysis can increase our confidence that the causal structure of those models reflects that of science. Here is a simple example of the former:
This chart shows simulated citation counts for three strategies scientists might employ when targeting their papers. Sundberg described a possible disconnect between the needs of climate modelers and the data provided to them by experimentalists. If this disconnect is real, perhaps it is due to insufficient incentives for experimentalists to produce results that are accessible outside of their narrow specialty. Here, scientists can follow one of three strategies: produce papers that will be equally useful to those inside and outside their speciality (“Uniform Q”), papers that are specialized to a random degree (“Random A”), or papers that are extremely specialized (“Extreme A”). Then citation counts are generated according to a Pareto (long-tailed distribution). Here is the formal specification of the model:
Citation analysis informs this model by providing realistic values for the 𝜆 and 𝜅 parameters. Alternatively, one could posit degrees of specialization and test to see whether the results of the model match actual citation counts.
This is a simple model, but one can imagine much more elaborate models including agent-based simulations in which papers accrue citations through time, papers take time to produce, authors have varying utility functions, authors have varying talent, authors discover papers through previous citation, there is an adjustable reward structure, and so on. Ultimately, models need to be constructed in a way that is amenable to the type of data available from citation analysis, but there are a wide range of models of this sort that could conceivably be constructed.
So far, the results of my research, summarized above, are preliminary and suggestive. In the immediate future I aim to more rigorously test my preliminary conclusions and further develop my methods of identifying and classifying communities. In particular, I am investigating ways to identify papers that specifically report the work of climate modelers and experimentalists, to further test Sundberg’s conclusions. In the immediate future, I will also be expanding the range of papers examined by starting with the other IPCC 5th Assessement Reports (Impacts and Mitigation). After the analysis of this data, I should have a more complete picture of the epistemic structure of climate science, from which to base more normative explorations.
Ultimately the aim of this project is to use this structure to create and evaluate models of scientific activity and collaboration. The simple model discussed above is an initial experiment. Hopefully these models and other evaluative measures will produce results that are useful to both social epistemologists who are interested in the cognitive division of labor in science and policymakers and scientists who are interested in the operation of climate science.