Google Scholar - a new data source for citation analysis

Prof. Anne-Wil Harzing, University of Melbourne
Web: www.harzing.com
Email: pop@harzing.com

© Copyright 2007-2008 Anne-Wil Harzing. All rights reserved.

Eighth version, 20 December 2008.

Introduction

Instead of the Thomson ISI Web of Science, Publish or Perish uses Google Scholar data to calculate its various statistics. An important practical reason for this is that Google Scholar is freely available to anyone with an Internet connection and is generally praised for its speed (Bosman et al. 2006). The Web of Science is only available to those academics whose institutions are able and willing to bear the (quite substantial) subscription costs of the Web of Science and other databases in Thomson ISI’s Web of Knowledge.

As Pauly & Stergiou (2005:34) indicate “free access to […] data provided by Google Scholar provides an avenue for more transparency in tenure reviews, funding and other science policy issues, as it allows citation counts, and analyses based thereon, to be performed and duplicated by anyone”.

Alastair Smith (2008) compared citation counts from Google Scholar to the research output from universities under New Zealand's PBRF (Performance Based Research Funding) research assessment exercise and found a very high (0.94) correlation between the PBRF output (defined as PBRF quality score times the FTE staff size) and the total number of citations returned by Google Scholar.

However, there are several other good reasons to use Google Scholar to perform citation analyses, which will be covered in this note.

General caveat

The output of Publish or Perish is only as good as its input. Whilst I do believe that in most cases Google Scholar presents a more complete picture of an academic’s impact than the Thomson ISI Web of Science, all databases have their own limitations, most of which are discussed in detail below.

More generally, citations are subject to many forms of error, from typographical errors in the source paper, to errors in Google Scholar parsing of the reference, to errors due to some nonstandard reference formats. Publications such as books or conference proceedings are treated inconsistently, both in the literature and in Google Scholar. Thus citations to these works can be complete, completely missing, or anywhere in between.

Several academics have been very critical of Google Scholar. Péter Jacsó in particular has published some highly critical papers in Online Information Review (Jacsó, 2005, 2006a/b) discussing a limited number of Google Scholar failures in great detail.

Whereas no doubt some of his critique is justified, I was unable to reproduce most of the Google Scholar failures detailed in his paper, suggesting that either they resulted from faulty searches or that Google Scholar has rectified these failures. Jacsó's claim that Google Scholar reports higher citation counts for certain disciplines, but not for the Social Sciences and Humanities is certainly inaccurate as much larger-scale studies (Bosman et al. 2006; Kousha & Thelwall, 2007) find the opposite result.

Most importantly, the bulk of Jacsó's critique is leveled at inconsistent results for keyword searches, which are not relevant for the author and journal impact searches conducted with Publish or Perish. In addition, the summary metrics in Publish or Perish (e.g. h-index, g-index) are fairly robust and insensitive to occasional errors.

When using Publish or Perish for citation analyses, I would like to suggest the following general rule of thumb.

  • If an academic shows good citation metrics, it is very likely that he or she has made a significant impact on the field.

However, the reverse is not necessarily true. If an academic shows weak citation metrics, this may be caused a lack of impact on the field. However, it may also be caused by:

  • working in a small field,
  • publishing in a language other than English (LOTE),
  • or publishing mainly (in) books.

Although Google Scholar performs better than the Web of Science in this respect, it is still not very good in capturing LOTE articles and citations, or citations in books or book chapters. As a result, citation metrics in the Social Sciences and even more so in the Humanities will always be underestimated as in these disciplines publications in LOTE and books/book chapters are more likely than in the Sciences.

Disadvantages of using Thomson ISI Web of Science for citation analyses

The major disadvantage of the Web of Science is that it may provide a substantial underestimation of an individual academic’s actual citation impact.

This is true equally for the “general search” function and for the Web of Science “cited reference” function, the two functions most generally used to perform citation analyses. However, the Web of Science “general search” function performs more poorly in this respect than the “cited reference” function.

For example, the current (August 2007) number of citations to my own work is around 120 with the “general search” function, around 310 with the “cited reference” function and 803 with Google Scholar. My h-index is 7 with the “general search” function, 12 with the “cited reference” function and 15 with Google Scholar.

Differences will not be as dramatic for all scholars, but many academics show a substantially higher number of citations in Google Scholar than in the Web of Science. For instance Nisonger (2004) found that (excluding self-citations) Web of Science captured only 28.8% of his total citations, 42.2% of his print citations, 20.3% of his citations from outside the United States, and a mere 2.3% of his non-English citations. He suggests that librarians and faculty should not rely solely on Web of Science author citation counts, especially when demonstration of international impact is important. Nisonger also summarises several other studies that found Web of Science citation data to be incomplete.

Meho & Yang (2007) conducted a large-scale comparison between Web of Science, Scopus (Elsevier’s alternative to Thomson ISI’s Web of Science) and Google Scholar covering citations of over 1,000 scholarly works of all 15 faculty members of the School of Library and Information Science at Indiana University Bloomington between 1996 and 2005. They found the overlap in citations between the three databases to be rather small. The overlap between Web of Science and Scopus was 58.2%. The overlap between Google Scholar and the union of Web of Science and Scopus was only 30.8%. This small overlap is largely caused by the fact that Google Scholar produced more than twice as many citations as Web of Science and nearly twice as many citations as Scopus. Many of those additional citations came from conference papers, doctoral dissertations, master’s theses and books and book chapters.

At the same time both sources (Web of Science and Google Scholar) have been shown to rank specific groups of scholars in a relatively similar way. Saad (2006) found that for his subset of 55 scientists in consumer research, the correlation between the two h-indices was 0.82. Please note that this does not invalidate the earlier argument as it simply means most academics’ h-indices are underestimated by a similar magnitude by Web of Science.

Meho & Yang (2007) also found that when Google Scholar results were added to those of Web of Science and Scopus separately its results did not significantly change the ranking of the 15 academics in their survey. The correlation between Google Scholar and Web of Science was 0.874, between Google Scholar and the union of Web of Science and Scopus 0.976.

Meho & Yang (2007) conclude that Google Scholar can help identify a significant number of unique citations. These unique citations might not significantly alter one’s citation ranking in comparison to other academics in the same field and might not all be of the same quality as those found in the Web of Science or Scopus. However, they can be very useful in showing evidence of broader intellectual and international impact than is possible with Web of Science and Scopus. Hence they conclude Google Scholar could be particularly helpful for academics seeking promotion, tenure, faculty positions, research grants, etc.

Why Thomson ISI Web of Science underestimates true citation impact

There are a number of reasons for the underestimation of citation impact by Thomson ISI Web of Science.

Web of Science General Search is limited to ISI-listed journals

In the General Search function Web of Science only includes citations to journal articles published in ISI listed journals (Roediger, 2006). Citations to books, book chapters, dissertations, theses, working papers, reports, conference papers, and journal articles published in non-ISI journals are not included.

Whilst in the Natural Sciences this may give a fairly comprehensive picture of an academic’s total output, in the Social Sciences and Humanities (SSH) only a limited number of journals are ISI listed. Also, in both the Social Sciences and the Humanities books and book chapters are very important publication outlets. Google Scholar includes citations to all academic publications regardless of whether they appeared in ISI-listed journals (Belew, 2005; Meho & Yang, 2007).

Web of Science Cited Reference is limited to citations from ISI-listed journals

In the Cited Reference function Web of Science does include citations to non-ISI publications. However, it only includes citations from journals that are ISI-listed (Meho & Yang, 2007). As indicated before in SSH only a limited number of journals are ISI-listed.

Butler (2006) analysed the distribution of publication output by field for Australian universities between 1999-2001. She finds that whereas for the Chemical, Biological, Physical and Medical/Health sciences between 69.3% and 84.6% of the publications are in ISI listed journals, for Social Sciences such as Management, History Education and Arts only 4.4%-18.7% of the publications are published in ISI listed journals. ISI estimates that of the 2000 new journals reviewed annually only 10-12% are selected to be included in the Web of Science (Testa, 2004).

Archambault & Gagné (2004) found that US and UK-based journals are both significantly over-represented in the Web of Science in comparison to Ulrich’s journal database. This overrepresentation was stronger for the Social Sciences and Humanities than for the Natural Sciences. Further, in many areas of engineering, conference proceedings are very important publication outlets. For example, one of the most cited computer scientists (Hector Garcia-Molina) gathers more than 20,000 citations in Google Scholar, with most of his papers being published and cited in conference proceedings. In Web of Science he has a mere 240 citations to his name!

In contrast to the Web of Science, Google Scholar includes citations from all academic publications regardless of where they appeared. As a results Google Scholar provides a more comprehensive picture of recent impact, especially for the Social Sciences and Humanities where more than five years can elapse between research appearing as a working or conference paper and research being published in a journal.

This also means that Google Scholar usually gives a more accurate picture of impact for junior academics. However, it must be acknowledged that although Google Scholar captures more citations in books and book chapters than the Web of Science (which captures none), it is by no means comprehensive in this respect. Google Book Search may provide a better alternative for book searches.

Web of Science Cited Reference counts citations to non-ISI journals only towards first author

Whilst the Cited Reference function of Web of Science does include citations to non-ISI journals, it only includes these publications for the first author. Hence any publications in non-ISI journals where the academic in question is the second or further author are not included.

Google Scholar includes these publications for all listed authors. For instance, my 2003 publication with Alan Feely in Cross Cultural Management shows no citations in the Web of Science for my name, whilst it shows 16 citations in Google Scholar.

Web of Science has poor aggregation of minor variations of the same title

In the General Search function Web of Science does not include citations to the same work that have small mistakes in their referencing (which especially for books and book chapters occurs very frequently). In the Cited Reference function Web of Science does include these citations, but they are not aggregated with the other citations.

Google Scholar appears to have a better aggregation mechanism than Web of Science. Even though duplicate publications that are referenced in a (slightly) different way still occur, Google Scholar has a grouping function that resolves the worst ambiguities. For instance, my 1996 publication with Geert Hofstede in the research annual Research in the Sociology of Organizations draws 15 Web of Science citations but these are spread over 7 different appearances. Google Scholar shows 29 citations and has only one appearance for the publication.

Belew (2005) confirms that Google Scholar has lower citation noise than Web of Science. In the Web of Science only 60% of the articles were listed as unique entries (i.e. no citation variations), while for Google Scholar this was 85%. None of the articles in his sample had more than five separate listings within Google Scholar, while 13% had five or more entries in the Web of Science.

Web of Science has very limited coverage of non-English sources

The Web of Science includes only a very limited number of journals in languages other than English (LOTE) and hence citations in non-English journals are generally not included in any Web of Science citation analysis. Whilst Google Scholar’s LOTE coverage is far from comprehensive, it does include a larger number of publication in other languages and indexes documents in French, German, Spanish, Italian and Portuguese (Noruzi, 2005).

Meho & Yang (2007) found that 6.94% of Google Scholar citations were from LOTE, while this was true for only 1.14% for the Web of Science and 0.70% for Scopus. Archambault & Gagné (2004) found that Thomson’s ISI’s journal selection favours English, a situation attributable to ISI’s inability to analyse the content of journals in LOTE.

Disadvantages of using Google Scholar for citation analyses

There are, however, some disadvantages to the use of Google Scholar that are not shared by Thomson ISI Web of Science.

Google Scholar includes some non-scholarly citations

Google Scholar sometimes includes non-scholarly citations such as student handbooks, library guides or editorial notes. However, incidental problems in this regard are unlikely to distort citation metrics, especially robust ones such as the h-index.

An inspection of my own papers shows that in general more than 75% of the citations are in academic journals, with the remainder appearing in books, conference papers, working papers and student theses. Few non-scholarly citations were found. Moreover, I would argue that even a citation in student handbooks, library guides or editorial note shows that the academic has an impact on the field, even if the field is not narrowly defined as the academic’s scholarly colleagues.

In a similar vein, Vaughan and Shaw (2008) argue that 92% of the citations identified by Google Scholar in the field of library and information science represented intellectual impact, primarily citations from journal articles.

Not all scholarly journals are indexed in Google Scholar

Not all scholarly journals are indexed in Google Scholar. Unfortunately, Google Scholar is not very open about its coverage and hence it is unclear what its sources are.

It is generally believed that Elsevier journals are not included (Meho & Yang, 2007), because Elsevier has a competing commercial product in Scopus. However, I was able to find all Elsevier journals I have published in.

On the other hand, Meho & Yang (2007) did find that Google Scholar missed 40.4% of the citations found by the union of Web of Science and Scopus, suggesting that Google Scholar does miss some important refereed citations. It must also be said though that the union of Web of Science and Scopus misses 61.04% of the citations in Google Scholar. Further, Meho & Yang (2007) found that most of the citations uniquely found by Google Scholar are from refereed sources.

Google Scholar coverage might be uneven across different fields of study

Although for reasons discussed above Google Scholar generally provides a higher citation count than ISI, this might not be true for all fields of studies.

  • The Social Sciences, Arts and Humanities, and Engineering in particular seem to benefit from Google Scholar's better coverage of (citations in) books, conference proceedings and a wider range of journals.
  • The Natural and Health Sciences are generally well covered in ISI and hence Google Scholar might not provide higher citation counts. In addition, for some disciplines in the Natural and Health Sciences Google Scholar's journal coverage seems to be very patchy. This leads to citation counts in these areas that might actually be much lower than those in ISI.

In a systematic comparison of a 64 articles in different disciplines, Bosman et al. (2006) found overall coverage of Google Scholar to be comparable with both Web of Science and Scopus and slightly better for articles published in 2000 than in 1995. However, huge variations were apparent between disciplines with Chemistry and Physics in particular showing very low Google Scholar coverage and Science and Medicine also showing lower coverage than in Web of Science.

Based on a sample of 1650 articles Kousha & Thelwall (2007, 2008) found Google Scholar coverage to be less comprehensive than ISI in the three Science disciplines included in their study (Biology, Chemistry and Physics), with Google Scholar showing a particularly low coverage for Chemistry. Google Scholar coverage for the four Social Sciences included in their study (Education, Economics, Sociology and Psychology) as well as Computing was significantly higher than ISI coverage. Similarly, Bar-Ilan (2008) finds the number of Google Scholar citations substantially higher than the WoS and Scopus for mathematicians and computer scientists, but lower for high-energy physicists.

More detailed comparisons by academics working in the respective areas would be necessary before we can draw general conclusions. However, as a general rule of thumb, I would suggest that using Google Scholar might be most beneficial for three of the Google Scholar categories: Business, Administration, Finance & Economics; Engineering, Computer Science & Mathematics; Social Sciences, Arts & Humanities. Although broad comparative searches can be done for other disciplines, we would not encourage heavy reliance on Google Scholar for individual academics working in other areas without verifying results with either Scopus or Web of Science.

Google Scholar does not perform as well for older publications

Google Scholar does not perform as well for older publications as these publications and the publications that cite them have not (yet) been posted on the web.

Pauly & Stergiou (2005) found that Google Scholar had less than half of the citations of the Web of Science for a specific set of papers published in a variety of disciplines (mostly in the Sciences) between 1925-1989. However, for papers published in the 1990-2004 period both sources gave similar citation counts. The authors expect Google Scholar’s performance to improve for old articles as journals’ back issues are posted on the web.

Meho & Yang (2007) found the majority of the citations from journals and conference papers in Google Scholar to be from after 1993.

Below (2005) found Google Scholar to be competitive in terms of coverage for references published in the last 20 years, but the Web of Science superior before then. This means that Google Scholar might underestimate the impact of scholars who have mainly published before 1990.

Google Scholar automatic processing creates occasional nonsensical results

Google Scholar’s processing is done automatically without manual cleaning and hence sometimes provides nonsensical results. For instance one of the citations to my Managing the Multinationals book lists as its title “K., 1999”. The author of the citing paper listed my initials with a comma after the first two initials and hence Google Scholar interpreted the third initial and year as the title.

Automatic processing can also result in double counting citations when two or three versions of the same paper are found online. However, incidental mistakes like this are unlikely to have a major impact on citation metrics, especially those as robust as the h-index. Moreover, Google Scholar is committed to fix mistakes.

Problems shared by Google Scholar and Thomson ISI Web of Science

Names with diacritics or apostrophes are problematic

Both Google Scholar and Thomson ISI Web of Science have problems with academics that have names including either diacritics (e.g. Özbilgin or Olivas-Luján) or apostrophes (e.g. O'Rourke).

  • In Thomson ISI Web of Science a search with diacritics provides an error message and no results.
  • In Google Scholar a search for the name with diacritics will generally not provide any results either.
  • For both databases doing a search without the diacritic will generally provide the best result.

A search for "O'Rourke K*" in Web of Science results in only one citation to the work of the economic historian Kevin H O'Rourke, whereas a search for "ORourke K*" results in more than 350 citations. Google Scholar performs much better. Orginally, a search for "K O'Rourke" in Google Scholar provided very few results as Google Scholar treated both K and O as initials and hence searched for KO Rourke. Adding an additional blank space before O'Rourke solved this problem. More recently, however, searches for "K O'Rourke" (without the additional blank space) result in more than 1850 citations.

Names with ligatures are problematic

If an academic's name includes a sequence of characters that is ligatured in traditional typesetting ("fi", "ff", "fl", and others in other languages) and he/she prepares papers with LaTeX (as do most in mathematics and computer science), then Google Scholar does not find the publications.

For example to find most of the publications of J* Bradfield, you need to search for J* Bradeld (omitting the "fi" ligature created by LaTeX). In Google Scholar "J* Bradfield" only results in some 190 cites for computer scientist Julian Bradfield, whereas "J* Bradeld" results in nearly 400 cites for the same person. It should be mentioned that Web of Science does not find the publications showing up under Bradeld either as they usually concerns books or conference proceedings. "Bradfield J*" results in only about 50 cites for Julian Bradfield in Web of Science, "Bradeld J*" results in none.

References

  1. Archambault, E.; Gagné, E.V. (2004) The Use of Bibliometrics in Social Sciences and Humanities, Montreal: Social Sciences and Humanities Research Council of Canada (SSHRCC), August 2004.
  2. Bar-Ilan, J. (2008) Which h-index? - A comparison of WoS, Scopus and Google Scholar, Scientometrics, vol. 74, no. 2., pp. 257-271.
  3. Belew, R.K. (2005) Scientific impact quantity and quality: Analysis of two sources of bibliographic data, arXiv:cs.IR/0504036 v1, 11 April 2005.
  4. Bosman, J, Mourik, I. van, Rasch, M.; Sieverts, E., Verhoeff, H. (2006) Scopus reviewed and compared. The coverage and functionality of the citation database Scopus, including comparisons with Web of Science and Google Scholar, Utrecht: Utrecht University Library, http://igitur-archive.library.uu.nl/DARLIN/2006-1220-200432/Scopus doorgelicht & vergeleken - translated.pdf.
  5. Butler, L. (2006) RQF Pilot Study Project – History and Political Science Methodology for Citation Analysis, November 2006, accessed from: http://www.chass.org.au/papers/bibliometrics/CHASS_Methodology.pdf, 15 Jan 2007.
  6. Jacsó, P. (2005) Google Scholar: the pros and the cons, Online Information Review, vol. 29, no. 2, pp. 208-214.
  7. Jacsó, P. (2006a) Dubious hit counts and cuckoo's eggs, Online Information Review, vol. 30, no. 2, pp. 188-193.
  8. Jacsó, P. (2006b) Deflated, inflated and phantom citation counts, Online Information Review, vol. 30, no. 3, pp. 297-309.
  9. Kousha, K.; Thelwall, M. (2007) Google Scholar Citations and Google Web/URL Citations: A Multi-Discipline Exploratory Analysis, Journal of the American Society for Information Science and Technology, vol. 58, no. 7, pp. 1055-1065.
  10. Kousha, K; Thelwall, M. (2008) Sources of Google Scholar citations outside the Science Citation Index: A comparison between four science disciplines, Scientometrics, vol. 74, no. 2., pp. 273-294.
  11. Meho, L.I.; Yang, K. (2007) A New Era in Citation and Bibliometric Analyses: Web of Science, Scopus, and Google Scholar, Journal of the American Society for Information Science and Technology, vol. 58, no. 13, pp. 2105–2125.
  12. Nisonger, T.E. (2004) Citation autobiography: An investigation of ISI database coverage in determining author citedness, College & Research Libraries, vol. 65, no. 2, pp. 152-163.
  13. Noruzi, A. (2005) Google Scholar: The New Generation of Citation Indexes, LIBRI, vol. 55, no. 4, pp. 170-180.
  14. Pauly, D.; Stergiou, K.I. (2005) Equivalence of results from two citation analyses: Thomson ISI’s Citation Index and Google Scholar’s service, Ethics in Science and Environmental Politics, December, pp. 33-35.
  15. Roediger III, H.L. (2006) The h index in Science: A New Measure of Scholarly Contribution, APS Observer: The Academic Observer, vol. 19, no. 4.
  16. Saad, G. (2006) Exploring the h-index at the author and journal levels using bibliometric data of productive consumer scholars and business-related journals respectively, Scientometrics, vol. 69, no. 1., pp. 117-120.
  17. Smith, A.G. (2008) Benchmarking Google Scholar with the New Zealand PBRF research assessment exercise, Scientometrics, vol. 74, No. 2., pp. 309-316.
  18. Testa, J. (2004) The Thomson Scientific Journal Selection Process, http://scientific.thomson.com/free/essays/selectionofmaterial/journalselection/, accessed 15 Jan 2007.
  19. Vaughan, L.; Shaw, D. (2008) A new look at evidence of scholarly citations in citation indexes and from web sources, Scientometrics, vol. 74, no. 2., pp. 317-330.

Related topics