Working with ISI data:
Beware of Categorisation Problems
© Copyright 2010 Anne-Wil Harzing. All rights reserved.
First version, 20 January 2010.
An updated and much extended version of this white paper was published as:
Harzing, A.W. (2013) Document categories in the ISI Web of Knowledge: Misunderstanding the Social Sciences?, Scientometrics, vol. 93, no. 1, pp. 23-34. Available online...
In this brief white paper I discuss a categorisation problem in Thomson Reuter's ISI Web of Knowledge. ISI appears to regularly misclassify journal articles containing original research into the "review" or "proceedings paper" category, rather than assigning them to the appropriate "article" category. As many research benchmarking excercises and research projects only consider original research papers (those categorised in the article category), this categorisation problem results in an underestimation of research impact.
Keywords: bibliometrics, ISI document types, research impact, research quality, categorisation
Despite the availability of alternatives such as Scopus and Google Scholar, Thomson Reuter’s ISI Web of Knowledge (or ISI for short) is still used in the majority of research output benchmarking analyses and bibliometric research projects. Therefore, it is important to be aware of the limitations of the data provided by ISI. Harzing & van der Wal (2008) discuss a large number of ISI limitations relating mainly to the lack of comprehensive coverage, especially in the Social Sciences and Humanities.
This paper deals with another limitation that might impact all fields of study: ISI’s misclassification of journal articles containing original research into the review or proceedings paper category. In the ISI Web of Knowledge each item is categorised into a particular "document type" category. Overall, there are nearly 40 different document types, but the most frequently used are: "article", "review", and "proceedings paper".
The ISI Web of Knowledge does not provide a definition of any document type in their helpfile, but in various documents (e.g. Journal Citation Report Quick Reference Card), Thomson contrasts "review articles" with “original research articles”. There is no commonly agreed definition of review articles and different disciplines might value them differently. However, in general parlance review articles are defined as articles that do not contain original data and simply collect, review and synthesise earlier research.
Thomson does not define proceedings papers either, but one can only assume them to be papers published in conference proceedings. Conference proceedings are a very common and respected outlet in some disciplines, such as computer science. However, in Business & Economics they are normally seen as mere stepping stones to future publication in a peer reviewed journal. The more prestigious conferences (such as the Academy of Management and the Academy of International Business) either do not publish proceedings or publish only short abstracted papers.
In general, in most of the Social Sciences neither review articles or proceedings papers would be considered worthy of the quality stamp reserved for an original piece of research published in a peer reviewed journal.
In collecting 2009 data for a research project on editors and editorial board members (see my resarch program Quality and Impact of Academic Research), we noticed that some authors seemed to have a rather large number of papers categorised in the "review" and "proceedings paper" document type. This had not been the case for the same authors in our earlier data collection rounds.
A detailed investigation could find no indication that any papers in this category were in fact review papers or conference papers. Without exception they were full-length journal articles published in high level journals that only publish original research such as The Academy of Management Review, Journal of International Business Studies, Strategic Management Journal and Journal of Applied Psychology.
I will give two specific examples to illustrate the extent of this specific categorisation problem. First, Michael Lounsbury, co-editor of Organization Studies, has published seventeen articles in ISI listed journals. However, no less than ten of these seventeen articles are categorised as reviews (6) or proceedings papers (4), leaving him with a much less impressive seven pieces of original research.
Two of Lounsbury's six "review papers" were published in The Academy of Management Journal, a journal that is well-known for only accepting papers that make a very strong original theoretical and empirical contribution. The other four papers were published in Strategic Management Journal, Organization, Organization Studies and Social Forces, all journals that would definitely not publish any articles that simply synthesised previous research. So why were these articles categorised as review papers?
Two of Lounsbury’s four "proceedings papers" were published in the American Behavioural Scientist, whilst the other two appeared in Accounting, Organization & Society and Journal of Management Studies. Clearly none of these journals would be categorised as collections of conference proceedings. So why were these articles categorised as proceedings papers?
Our second example concerns Jacqueline Coyle-Shapiro, Senior editor of Journal of Organizational Behavior. She has published 12 articles in ISI listed journals. However, half of them are categorised as proceedings papers. These papers were published in the following journals: Journal of Vocational Behavior (twice), Journal of Applied Psychology, Journal of Organizational Behavior, and Journal of Management Studies (twice). As anyone in the field knows, none of these journals are collections of conference papers. So why were these articles categorised as proceedings papers?
The answer to this question presented itself in an FAQ (Why has the number of articles in the Web of Science gone down and the number of proceedings papers gone up) provided by Thomson Reuters. According to Thomson Reuters a ‘Proceedings Paper’ is:
a document in a journal or book that notes the work was presented - in whole or in part - at a conference. This is a statement of the association of a work with a conference. Prior to October 2008, these items displayed as "Article" in the Web of Science product.
Indeed, when verifying the “proceedings papers” by Lounsbury and Coyle-Shapiro, we found that the acknowledgements in their articles carried innocent notes such as “A portion of this paper was presented at the annual meeting of the Academy of Management, San Diego, 1998” or “An earlier version of this paper was presented at the Annual Meeting of the Academy of Management, Chicago, 1999” or "This paper builds on and extends remarks and arguments made as part of a 2006 Keynote Address at the Interdisciplinary Perspectives on Accounting Conference held in Cardiff, UK". Most of these papers were published before 2008. Hence ISI seems to have changed these classifications retroactively.
So wait a minute: simply presenting an early version of your ideas in a 10-15 minute (or shorter) slot at a conference or workshop (some of the acknowledgments even referred to small workshops), perhaps attended by less than a dozen people, appears to mean that your paper is downgraded by ISI to be a “conference proceedings paper” even though the conference in question doesn’t even publish proceedings?
Perhaps more disturbingly, such categorisation also seem to shows a rather limited understanding of the research process in the field of Management (and many other fields). Any research paper worth its salt will have been presented in at least one conference or workshop. In fact, I would consider it very unlikely that more than an incidental paper would be accepted for publication in a top journal in our field without ever having been presented publicly to receive feedback.
Does that mean that from 2008 onwards all of the papers published in our top journals are categorised as conference papers? No, this appears to happen only to those papers whose authors were honest enough to acknowledge that early versions of the paper had been presented at a conference, or to papers whose authors were kind enough to thank participants of a particular workshop for their input. A nice reward for being professional and collegial!
This categorisation process also appears to shows a rather limited understanding of the review process in top journals. Yes, early versions of a paper might have been presented at conferences. However, the paper that is subsequently submitted to a journal will normally be vastly different from the paper that was earlier presented at a conference. Conferences and workshops are often used as a means to test and polish ideas. Even if authors submit fairly polished papers to conferences, these papers will still generally need to go through two to four rounds of revisions before they are accepted for the journal.
A longer and more extensive process of revision is likely for the many papers that are not accepted by the first journal approached. As acceptance rates of top journals in our field are well below 10%, the reality is that papers are often submitted to several journals before they get their first revise & resubmit. Maturation of the author(s)’ ideas, reorientation toward different journals, as well as the review process itself means that virtually every paper published has been substantially revised. Hence, the end-product published by a journal often bears very little resemblance to the paper that was originally presented at a conference, years before publication.
With the conference proceedings problem "resolved", this leaves us with the puzzling review category. Why are papers that clearly present original research, published in the top journals in our field, categorised as derivative work that synthesises work of other academics? According to Thomson: simply because they have more than 100 references! No, I am not joking. Thomson says:
In the JCR system any article containing more than 100 references is coded as a review. Articles in "review" sections of research or clinical journals are also coded as reviews, as are articles whose titles contain the word "review" or "overview."
When verifying this criterion for the articles published by Michael Lounsbury, I found Thomson to have applied their criteria absolutely as described. Lounsbury's 2001 Administrative Science Quarterly article has 95 references and is categorised in the "article" document type, thus acknowledging it is original research. His 2004 article in Social Forces with 101 references is categorised in the "review" document type, even though the paper has sections titled “Theory and Hypotheses” and “Data and Methods”. In addition, the abstract and even the title clearly refer to empirical work. If this scholar wanted Thomson to recognise his work as original, maybe he should have been a bit less conscientious in identifying the contributions of other authors in his literature review?
This particular discovery also solved a query that had puzzled me for some time: why was one of my two articles in the 2007 special issue on International HRM in Human Resource Management categorised as an “original” research article and the other as “derivative” review, even though the latter was based on time consuming data collection amongst some 850 subsidiaries of multinational companies in three countries? The subtitle of the paper was “An Empirical Investigation of HRM practices in Foreign Subsidiaries” and the paper in question won the best paper award for the journal that year, not exactly an honour one would expect to be bestowed on derivative work. The answer turned out to be very simple. My co-author, Markus Pudelko, had displayed just a little too much of his wide reading and understanding of comparative HRM (at least according to Thomson), as the article included 103 references.
Thomson does not list any particular rationale for why papers with more than 100 references should be considered to be review articles that do not contain original research. It is true that a “real” review article providing, for instance, a literature review of 30 years of publications in a particular field will tend to have many references. However, the reverse certainly does not hold true, there are many papers with more than 100 references that are not review articles. One cannot presume that there is a direct relationship between the number of references contained in a paper and its level of originality. Thomson also does not provide any rationale for the seemingly arbitrary cut-off point. Perhaps Thomson simply saw 100 as a nicely convenient round figure?
This then brings us to our next question. Why has the number of papers categorised as proceedings papers and review articles increased over time? For proceedings papers, the answer to this question is very simple. In 2008 ISI integrated their separate conference proceedings database into the Web of Science in 2008. At that point in time many journal articles were retrospectively categorised as conference papers.
As we had collected our 1989-2004 data well before 2008, we had not experienced this problem before. However, any bibliometric research study conducted these days will need to comb through this document categorisation with a fine toothcomb in order to separate the real conference proceedings papers from regular articles whose authors happened to mention a conference presentation in their acknowledgements.
The reason for the increase of review papers is a bit less straightforward. On one level the answer is very simple: the number of papers with more than 100 references has increased. If we look at one of the very top journals in Management, the Academy of Management Review, we find that whilst in 1990 only 1 out of the 31 published articles was categorised as a review, in 2008 no less than half of the 42 published articles were categorised as reviews (because they had more than 100 references).
This escalation in the number of citations is not surprising given the current reward structure in the profession focused on counting citations: journals get ranked by the number of citations per article, professors get evaluated by their ability to publish in journals that have highly cited articles, libraries include journals that are more highly cited in their collections, universities, in part, achieve higher rankings when their faculty members publish in journals that are cited more often (and thus are higher ranked due to their citations), etc (see Adler & Harzing, 2009).
With an average of 109 references per article (84 for papers categorised as articles and 134 for papers categorised as reviews) in 2008, it appears to be only a matter of time before ISI will consider none of the research published in AMR to be "original research work". That would be rather a shame as some of the best conceptual work in the field is published in AMR. It will also be quite devastating for all those academics who are devoting years of their research career to “getting a hit in AMR” in the hope increasing their chances for promotion, tenure, and other academic rewards.
Of course there could be many reasons for the increasing number of references in articles. One of them could simply be the increasing availability of relevant literature online, making it easier to cite a larger number of relevant studies. Another reason could be the increase of multi-disciplinary research, which would necessitate the coverage of a broader literature base. Further, the ever increasing rigour of the reviewing process is likely to make reviewers suggest that additional bodies of literature that should be covered.
Finally, a less salubrious reason might be the increasing tendency for both reviewers and journal editors to chase the goal of achieving a "higher impact" by “suggesting” to authors, often indirectly, that they include additional references to their own work or that of the journal they are reviewer/editor for. Whatever the reason, it is clear that classifying articles as review articles simply because they reached an arbitrary number of references does an injustice to the collective research efforts in our field and seriously undermines the field's ability to assess its own progress.
What can we do to resolve these problems that undeniably bias evidence of academic impact? I have written to ISI to highlight the nature and seriousness of the two problems as outlined above and am waiting for their reply. It might help if more academics also write to ISI expressing their concern about the same issues.
There must be a better way to resolve the impact of these practices on individual scholars and on the field than simply being less honest in acknowledging the presentation of our work at conferences (so as to not have the subsequent publication categorised as a proceedings paper) and less rigorous in acknowledging previous literature in the field (so as to not be categorised as publishing a review paper).
Adler, N.; Harzing, A.W. (2009) When Knowledge Wins: Transcending the sense and nonsense of academic rankings, The Academy of Management Learning & Education, vol. 8, no. 1, pp. 72-95.
Harzing, A.W.; Wal, R. van der (2008) Google Scholar as a new source for citation analysis?, Ethics in Science and Environmental Politics, vol. 8, no. 1, pp. 62-71