Jump to main content

Computer Sleuth: Identification by Text Analysis

1
2
3
4
5
45 reviews

Abstract

Here's a project where you can try your hand at being a detective with your computer. In this project you'll write a program to do some basic analysis of features of written text (for example, counting the length of each word in the text, or the number of words in each sentence). Then you'll see if you can use the information from your text analysis program to find measurements that can distinguish one author from another. After analyzing known samples of several authors' writings, can your method match up unidentified writing samples with their correct authors?

Summary

Areas of Science
Difficulty
 
Time Required
Long (2-4 weeks)
Prerequisites
An understanding of the material covered in "Paragraph Stats: Writing a JavaScript Program to 'Measure' Text"
Material Availability
Readily available
Cost
Very Low (under $20)
Safety
No issues
Credits
Andrew Olson, Ph.D., Science Buddies

Objective

The goal of this project is to write a computer program to make some simple measurements on a block of text, and then to see if this information can be used to identify the author of the text.

Introduction

Your English teacher has probably told you that every author has an individual writing style—their own unique 'voice' on the page. Is it possible to find ways to identify that voice through computer analysis of written text?

A familiar case from history argues that it is indeed possible. When our forefathers, newly independent from Great Britain, were debating whether to do away with the Articles of Confederation and adopt the new Constitution written by a convention in Philadelphia, a series of essays was written to argue in favor of adopting the new government. These essays, now called The Federalist Papers, were signed "Publius," but are now attributed to Alexander Hamilton, James Madison, and John Jay. The authorship of 12 of the essays was claimed by both Hamilton and Madison. As Julie Rehmeyer writes in a recent Science News article (Rehmeyer, 2007): "Altogether, researchers have considered more than 1,000 features of writing style. Nearly all the analyses have vindicated Madison."

Relax, you won't need to analyze 1,000 different features for your science fair project. The Science Buddies project, Paragraph Stats: Writing a JavaScript Program to 'Measure' Text, shows you how to write a simple program to measure:
  1. the number of sentences contained in the text,
  2. the number of words in each sentence,
  3. the number of letters in each word,
  4. the average number of words per sentence, and
  5. the average word length.
With some simple modifications to the program, you can count the frequency of each word length and each sentence length in the text. Is this enough information to identify authorship? Try it and find out!

Terms and Concepts

To do this project, you should do research that enables you to understand the following terms and concepts:

Questions

Bibliography

  • For two articles on using text to identify authors see:
    • Klarreich, E. (2003). "Bookish Math: Statistical Tests Are Unraveling Knotty Literary Mysteries," Science News 164 (December 20): 392. Retrieved February 16, 2007.
    • Rehmeyer, J. (2007). "Digital Fingerprints: Tiny Behavioral Differences Can Reveal Your Identity Online," Science News 171 (January 13): 26–28. Retrieved February 14, 2007.
  • You can find a step-by-step JavaScript tutorial at the link below.
    Webteacher Software. (2007). JavaScript Tutorial for the Total Non-Programmer, Webteacher.com. Retrieved February 14, 2007.
  • For information on HTML FORMS, here is the official reference:
    Ragget, D., A. Le Hors, I. Jacobs, (eds.) (1999). HTML 4.01 Specification: 17. Forms, W3C Worldwide Web Consortium. Retrieved February 14, 2007.
  • This is a list of reserved words in JavaScript (you cannot use these words for function or variable names in your program, because they are reserved for the programming language itself):
    JavaScript Kit. (2004). Reserved Words, JavaScriptKit.com. Retrieved February 14, 2007.
  • If you get interested and start doing a lot of programming, you may want to try using a text editor that is a little more sophisticated than Notepad. An editor designed for programming can help with formatting, so that your code is more readable, but still produce plain text files. This type of editor can also do "syntax highlighting" (e.g., automatic color-coding of HTML) which can help you to find errors. Here is a free programmer's editor that you can try.
  • This 19th century article used a plot of word length vs. frequency to distinguish texts by different authors:
    Mendenhall, T.C. (1887). "The Characteristic Curves of Composition," Science 9 (11 March): 237–246.
  • For an online version of The Federalist Papers, see:
    Whitten, C. (2004). "The Federalist Papers," Founding Fathers Info. Retrieved February 14, 2007.

Materials and Equipment

To do this experiment you will need the following materials and equipment:

Experimental Procedure

  1. Write the program to analyze text.
    1. For help on writing the JavaScript program to analyze blocks of text, see the Science Buddies project Paragraph Stats: Writing a JavaScript Program to 'Measure' Text.
    2. You may decide that you want to improve the program so that you can make additional measurements. The Variations section has some suggestions for additional measurements, and you will probably come up with others on your own.
  2. Choose three or more authors and select representative samples of text by each (it's best to use at least 1000 words).
  3. Analyze each text sample with your program.
  4. Experiment with methods of graphing the results to create your own 'writeprint' (Rehmeyer, 2007) for each author.
    1. So that you can make fair comparisons between samples, all of your graphs should share the same scales (i.e., the same range for the x- and y-axes of each graph should be the same). So think carefully when you design your 'writeprint' and make sure that your x- and y-axes are designed to accommodate the full range of possible measurements.
    2. The key is to identify measurements that consistently reveal a difference between authors.
    3. For starters, you may want to try plotting the word length vs. frequency for each author (Mendenhall, 1887).
  5. Have your helper select additional paragraphs from each author. Your helper should also run the analysis on each additional sample, and give you the results, without identifying the authors. Can you determine the author of each unknown sample?
icon scientific method

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Variations

  • Here are some ideas for functions that you might want to add to your text measurement program:
    • count the frequency of different sentence lengths,
    • frequency of function words, such as prepositions (e.g., of, from, in) and conjunctions (e.g., and, but, or).
  • Let's say that one of your authors was J.K. Rowling, and all of your text samples came from the first Harry Potter book (Harry Potter and the Sorcerer's Stone). What happens if you use a text sample from a later book in the series, like Harry Potter and the Order of the Phoenix? Do your measurements still point to J.K. Rowling as the author?
  • How much text do you need to get an accurate 'writeprint' for an author? Design an experiment to find out.
  • Super-advanced students could explore representing the text analysis data as multidimensional vectors and using principal components analysis to differentiate between authors.

Careers

If you like this project, you might enjoy exploring these related careers:

Career Profile
Are you interested in developing cool video game software for computers? Would you like to learn how to make software run faster and more reliably on different kinds of computers and operating systems? Do you like to apply your computer science skills to solve problems? If so, then you might be interested in the career of a computer software engineer. Read more
Career Profile
Have you ever seen a story on the news about how a company or government agency was "hacked" and people's personal information, like names, addresses, or credit card numbers, was stolen? It is an information security analyst's job to prevent that from happening. Organizations hire information security analysts to analyze possible threats against their computer systems, which can range from malicious hackers trying to steal data to careless employees who accidentally forget to log out of a… Read more
Career Profile
Have you ever tried to read a scientific or technical article in a professional journal? They can be hard to decipher because they are full of technical terminology. But have you ever read a science article in a magazine that was geared for your age or for the general public? These tend to be a lot easier to read and more interesting because they have been written by a science writer. A science writer can take a complex subject and write a concise article in language that is easy for… Read more

News Feed on This Topic

 
, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Science Buddies Staff. "Computer Sleuth: Identification by Text Analysis." Science Buddies, 20 Nov. 2020, https://www.sciencebuddies.org/science-fair-projects/project-ideas/CompSci_p022/computer-science/computer-sleuth-identification-by-text-analysis. Accessed 19 Mar. 2024.

APA Style

Science Buddies Staff. (2020, November 20). Computer Sleuth: Identification by Text Analysis. Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/CompSci_p022/computer-science/computer-sleuth-identification-by-text-analysis


Last edit date: 2020-11-20
Top
We use cookies and those of third party providers to deliver the best possible web experience and to compile statistics.
By continuing and using the site, including the landing page, you agree to our Privacy Policy and Terms of Use.
OK, got it
Free science fair projects.