Reflections on Google Scholar

Prof. Anne-Wil Harzing, University of Melbourne
Web: www.harzing.com
Email: pop@harzing.com

© Copyright 2007 Anne-Wil Harzing. All rights reserved.

Document link: http://www.harzing.com/pop_gs.htm

First version, 17 January 2007

Introduction

Instead of the Thomson ISI Web of Science (WoS), Publish or Perish uses Google Scholar (GS) data to calculate its various statistics. An important practical reason for this is that GS is freely available to anyone with an Internet connection. The WoS is only available to those academics whose institutions are able and willing to bear the (quite substantial) subscription costs of the WoS and other databases in Thomson ISI’s Web of Knowledge. As Pauly & Stergiou (2005:34) indicate “free access to […] data provided by GS provides an avenue for more transparency in tenure reviews, funding and other science policy issues, as it allows citation counts, and analyses based thereon, to be performed and duplicated by anyone”. However, there are several other good reasons to use GS to perform citation analyses, which will be covered in this note.

General caveat

The output of Publish or Perish is only as good as its input. Whilst I do believe that GS presents a more complete picture of an academic’s impact than the Thomson ISI WoS, all databases have their limitations. I would like to suggest the following general rule of thumb. If an academic shows good citation metrics, it is very likely that he or she has made a significant impact on the field. However, the reverse is not necessarily true. If an academic shows weak citation metrics, this may be caused a lack of impact on the field. However, it may also be caused by working in a small field, publishing in a language other than English (LOTE), or publishing mainly (in) books. Although GS performs better than the WoS in this respect, it is still not very good in capturing LOTE articles and citations, or citations in books or book chapters. As a result, citation metrics in the Social Sciences and even more so in the Humanities will always be underestimated as in these disciplines publications in LOTE and books/­book chapters are more likely than in the Sciences.

The disadvantage of using Thomson ISI Web of Science for citation analyses

The major disadvantage of the WoS is that it may provide a substantial underestimation of an individual academic’s actual citation impact. This is true equally for the “general search” function and for the WoS “cited reference” function, the two functions most generally used to perform citation analyses. However, the WoS “general search” function performs more poorly in this respect than the “cited reference” function. For example, the current (January 2007) number of citations to my own work is 97 with the “general search” function, 287 with the “cited reference” function and 658 with GS. My h-index is 7 with the “general search” function, 10 with the “cited reference” function and 13 with GS.

Differences will not be as dramatic for all scholars, but virtually all academics show a substantially higher number of citations in GS than in the WoS. For instance Nisonger (2004) found that (excluding self-citations) WoS captured only 28.8% of his total citations, 42.2% of his print citations, 20.3% of his citations from outside the United States, and a mere 2.3% of his non-English citations. He suggests that librarians and faculty should not rely solely on WoS author citation counts, especially when demonstration of international impact is important. Nisonger also summarises several other studies that found WoS citation data to be incomplete.

Meho & Yang (2006) conducted a large-scale comparison between WoS, Scopus (Elsevier’s alternative to Thomson ISI’s WoS) and GS covering citations of over 1,000 scholarly works of all 15 faculty members of the School of Library and Information Science at Indiana University Bloomington between 1996 and 2005. They found the overlap in citations between the three databases to be rather small. The overlap between WoS and Scopus was 58.2%. The overlap between GS and the union of WoS and Scopus was only 30.8%. This small overlap is largely caused by the fact that GS produced more than twice as many citations as WoS and nearly twice as many citations as Scopus. Many of those additional citations came from conference papers, doctoral dissertations, master’s theses and books and book chapters.

At the same time both sources (WoS and GS) have been shown to rank specific groups of scholars in a relatively similar way. Saad (2006) found that for his subset of 55 scientists in consumer research, the correlation between the two h-indices was 0.82. Please note that this does not invalidate the earlier argument as it simply means most academics’ h-indices are underestimated by a similar magnitude by WoS. Meho & Yang (2006) also found that when GS results were added to those of WoS and Scopus separately its results did not significantly change the ranking of the 15 academics in their survey. The correlation between GS and WoS was 0.874, between GS and the union of WoS and Scopus 0.976.

Meho & Yang (2006) conclude that GS can help identify a significant number of unique citations. These unique citations might not significantly alter one’s citation ranking in comparison to other academics in the same field and might not all be of the same quality as those found in the WoS or Scopus. However, they can be very useful in showing evidence of broader intellectual and international impact than is possible with WoS and Scopus. Hence they conclude GS could be particularly helpful for academics seeking promotion, tenure, faculty positions, research grants, etc.

Why Thomson ISI Web of Science underestimates true citation impact

There are a number of reasons for the underestimation of citation impact by Thomson ISI WoS.

  1. In the General Search function WoS only includes citations to journal articles published in ISI listed journals (Roediger, 2006). Citations to books, book chapters, dissertations, theses, working papers, reports, conference papers, and journal articles published in non-ISI journals are not included. Whilst in the Natural Sciences and Engineering (NSE) this may give a fairly comprehensive picture of an academic’s total output in the Social Sciences and Humanities (SSH) only a limited number of journals are ISI listed. Also, in both the Social Sciences and the Humanities books and book chapters are very important publication outlets. GS includes citations to all academic publications regardless of whether they appeared in ISI-listed journals (Belew, 2005, Meho & Yang, 2006).
  2. In the Cited Reference function WoS does include citations to non-ISI publications. However, it only includes citations from journals that are ISI-listed (Meho & Yang, 2006). As indicated before in SSH only a limited number of journals are ISI-listed. Butler (2006) analysed the distribution of publication output by field for Australian universities between 1999-2001. She finds that whereas for the Chemical, Biological, Physical and Medical/Health sciences between 69.3% and 84.6% of the publications are in ISI listed journals, for Social Sciences such as Management, History Education and Arts only 4.4%-18.7% of the publications are published in ISI listed journals. ISI estimates that of the 2000 new journals reviewed annually only 10-12% are selected to be included in the WoS (Testa, 2004). Archambault & Gagné (2004) found that US and UK-based journals are both significantly over-represented in the WoS in comparison to Ulrich’s journal database. This overrepresentation was stronger for the Social Sciences and Humanities than for the Natural Sciences and Engineering. In contrast to the WoS, GS includes citations from all academic publications regardless of where they appeared. However, it must be acknowledged that although GS captures more citations in books and book chapters than the WoS (which captures none), it is by no means comprehensive in this respect. Google Book Search may provide a better alternative for book searches.
  3. In the General Search function WoS does not include citations to the same work that have small mistakes in their referencing (which especially for books and book chapters occurs very frequently). In the Cited Reference function WoS does include these citations, but they are not aggregated with the other citations. GS appears to have a better aggregation mechanism than WoS. Even though duplicate publications that are referenced in a (slightly) different way still occur, GS has a grouping function that resolves the worst ambiguities. For instance, my 1996 publication with Geert Hofstede in the research annual Research in the Sociology of Organizations draws 15 WoS citations but these are spread over 7 different appearances. GS shows 23 citations and has only one appearance for the publication. Belew (2005) confirms that GS has lower citation noise than WoS. In the WoS only 60% of the articles were listed as unique entries (i.e. no citation variations), while for GS this was 85%. None of the articles in his sample had more than five separate listings within GS, while 13% had five or more entries in the WoS.
  4. Whilst the Cited Reference function of WoS does include citations to non-ISI journals, it only includes these publications for the first author. Hence any publications in non-ISI journals where the academic in question is the second or further author are not included. GS includes these publications for all listed authors. For instance, my 2003 publication with Alan Feely in Cross Cultural Management shows no citations in the WoS for my name, whilst it shows 10 citations in GS.
  5. The WoS includes only a very limited number of journals in languages other than English (LOTE) and hence citations in non-English journals are generally not included in any WoS citation analysis. Whilst GS’s LOTE coverage is far from comprehensive, it does include a larger number of publication in other languages and indexes documents in French, German, Spanish, Italian and Portuguese (Noruzi, 2005). Meho & Yang (2006) found that 6.94% of GS citations were from LOTE, while this was true for only 1.14% for the WoS and 0.70% for Scopus. Archambault & Gagné (2004) found that Thomson’s ISI’s journal selection favours English, a situation attributable to ISI’s inability to analyse the content of journals in LOTE.
  6. The WoS only includes citations in published journal articles. GS also includes citations in conference papers, working papers, and pre-prints of articles to appear in journals (Meho & Yang, 2006). As a results GS provides a more comprehensive picture of recent impact, especially for the Social Sciences and Humanities where more than five years can elapse between research appearing as a working or conference paper and research being published in a journal. This also means that GS usually gives a more accurate picture of impact for junior academics.

The disadvantage of using Google Scholar for citation analyses

There are, however, some disadvantages to the use of GS that are not shared by Thomson ISI WoS:

  1. GS sometimes includes non-scholarly citations, such as student handbooks, library guides or editorial notes. However, incidental problems in this regard are unlikely to distort citation metrics, especially robust ones such as the h-index. A casual inspection of my most cited paper (The persistent myth of high expatriate failure) shows that more than 75% of the citations are in academic journals, with the remainder appearing in books, conference papers, working papers and student theses. No non-scholarly citations were found. Moreover, I would argue that even a citation in student handbooks, library guides or editorial note shows that the academic has an impact on the field, even if the field is not narrowly defined as the academic’s scholarly colleagues.
  2. Not all scholarly journals are indexed in GS. Unfortunately, GS is not very open about its coverage and hence it is unclear what its sources are. It is generally believed that Elsevier journals are not included (Meho & Yang, 2006), because Elsevier has a competing commercial product in Scopus. However, I was able to find all Elsevier journals I have published in. On the other hand, Meho & Yang (2006) did find that GS missed 40.4% of the citations found by the union of WoS and Scopus, suggesting that GS does miss some important refereed citations. It must also be said though that the union of WoS and Scopus misses 61.04% of the citations in GS. Further, Meho & Yang (2006) found that most of the citations uniquely found by GS are from refereed sources.
  3. GS does not perform as well for older publications as these publications and the publications that cite them have not (yet) been posted on the web. Pauly & Stergiou (2005) found that GS had less than half of the citations of the WoS for a specific set of papers published in a variety of disciplines (mostly in the Sciences) between 1925-1989. However, for papers published in the 1990-2004 period both sources gave similar citation counts. The authors expect GS’s performance to improve for old articles as journals’ back issues are posted on the web. Meho & Yang (2006) found the majority of the citations from journals and conference papers in GS to be from after 1993. Below (2005) found GS to be competitive in terms of coverage for references published in the last 20 years, but the WoS superior before then. This means that GS might underestimate the impact of scholars who have mainly published before 1990.
  4. GS’s processing is done automatically without manual cleaning and hence sometimes provides nonsensical results. For instance one of the citations to my Managing the Multinationals book lists as its title “K., 1999”. The author of the citing paper listed my initials with a comma after the first two initials and hence GS interpreted the third initial and year as the title. However, incidental mistakes like this are unlikely to have a major impact on citation metrics, especially those as robust as the h-index. Morevover, GS is committed to fix mistakes (GS help function).

References

Archambault, E.; Gagné, E.V. (2004) The Use of Bibliometrics in Social Sciences and Humanities, Montreal: Social Sciences and Humanities Research Council of Canada (SSHRCC), August 2004.

Belew, R.K. (2005) Scientific impact quantity and quality: Analysis of two sources of bibliographic data, arXiv:cs.IR/0504036 v1, 11 April 2005.

Butler, L. (2006) RQF Pilot Study Project – History and Political Science Methodology for Citation Analysis, November 2006, accessed from: http://www.chass.org.au/papers/bibliometrics/CHASS_Methodology.pdf, 15 Jan 2007.

Meho, L.I.; Yang, K. (2006) A New Era in Citation and Bibliometric Analyses: Web of Science, Scopus, and Google Scholar, under review at Journal of the American Society for Information Science and Technology, accessed from: http://dlist.sir.arizona.edu/1695/, 15 Jan 2007.

Nisonger, T.E. (2004) Citation autobiography: An investigation of ISI database coverage in determining author citedness, College & Research Libraries, vol. 65, no. 2, pp. 152-163.

Noruzi, A. (2005) Google Scholar: The New Generation of Citation Indexes, LIBRI, vol. 55, no. 4, pp. 170-180.

Pauly, D.; Stergiou, K.I. (2005) Equivalence of results from two citation analyses: Thomson ISI’s Citation Index and Google Scholar’s service, Ethics in Science and Environmental Politics, December, pp. 33-35.

Roediger III, H.L. (2006) The h index in Science: A New Measure of Scholarly Contribution, APS Observer: The Academic Observer, vol. 19, no. 4.

Saad. G. (2006) Exploring the h-index at the author and journal levels using bibliometric data of productive consumer scholars and business-related journals respectively, Scientometrics, vol. 69, no. 1., pp. 117-120.

Testa, J. (2004) The Thomson Scientific Journal Selection Process, http://scientific.thomson.com/free/essays/selectionofmaterial/journalselection/, accessed 15 Jan 2007.