Computer Sleuth: Identification by Text Analysis
|Time Required||Long (2-4 weeks)|
|Material Availability||Readily available|
|Cost||Very Low (under $20)|
AbstractHere's a project where you can try your hand at being a detective with your computer. In this project you'll write a program to do some basic analysis of features of written text (for example, counting the length of each word in the text, or the number of words in each sentence). Then you'll see if you can use the information from your text analysis program to find measurements that can distinguish one author from another. After analyzing known samples of several authors' writings, can your method match up unidentified writing samples with their correct authors?
ObjectiveThe goal of this project is to write a computer program to make some simple measurements on a block of text, and then to see if this information can be used to identify the author of the text.
Cite This PageGeneral citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.
Last edit date: 2017-07-28
Your English teacher has probably told you that every author has an individual writing style—their own unique 'voice' on the page. Is it possible to find ways to identify that voice through computer analysis of written text?
- the number of sentences contained in the text,
- the number of words in each sentence,
- the number of letters in each word,
- the average number of words per sentence, and
- the average word length.
Terms and ConceptsTo do this project, you should do research that enables you to understand the following terms and concepts:
- frequency histogram.
- How would you calculate the frequency of five-letter words in a given block of text?
- For two articles on using text to identify authors see:
- Klarreich, E., 2003. "Bookish Math: Statistical Tests Are Unraveling Knotty Literary Mysteries," Science News 164 (December 20): 392, available online at [accessed February 16, 2007] http://web.mit.edu/allanmc/www/stylometrics.pdf.
- Rehmeyer, J., 2007. "Digital Fingerprints: Tiny Behavioral Differences Can Reveal Your Identity Online," Science News 171 (January 13): 26–28, available online at [accessed February 14, 2007] https://ai.arizona.edu/sites/ai/files/MIS596/sciencenews.pdf.
- For information on HTML FORMS, here is the official reference:
Ragget, D., A. Le Hors, I. Jacobs, (eds.), 1999. "HTML 4.01 Specification: 17. Forms," W3C Worldwide Web Consortium [accessed February 14, 2007] http://www.w3.org/TR/REC-html40/interact/forms.html.
- If you get interested and start doing a lot of programming, you may want to try using a text editor that is a little more sophisticated than Notepad. An editor designed for programming can help with formatting, so that your code is more readable, but still produce plain text files. This type of editor can also do "syntax highlighting" (e.g., automatic color-coding of HTML) which can help you to find errors. Here is a free programmer's editor that you can try:
- This 19th century article used a plot of word length vs. frequency to distinguish texts by different authors:
Mendenhall, T.C., 1887. "The Characteristic Curves of Composition," Science 9 (11 March): 237–246.
- For an online version of The Federalist Papers, see:
Whitten, C., 2004. "The Federalist Papers," Founding Fathers Info [accessed February 14, 2007] http://www.foundingfathers.info/federalistpapers/.
News Feed on This Topic
Materials and EquipmentTo do this experiment you will need the following materials and equipment:
- computer with web browser (e.g., Internet Explorer, Firefox),
- text editing program (e.g., Notepad),
- several samples of text by each of three (or more) authors, for example:
- sample paragraphs from books by different authors,
- e-mail or instant messages from friends.
- spreadsheet program (e.g., Excel or QuattroPro),
- graph paper or graphing software,
- a helper.
Computer Sleuth: Identification by Text Analysis
- Write the program to analyze text.
- You may decide that you want to improve the program so that you can make additional measurements. The Variations section has some suggestions for additional measurements, and you will probably come up with others on your own.
- Choose three or more authors and select representative samples of text by each (it's best to use at least 1000 words).
- Analyze each text sample with your program.
- Experiment with methods of graphing the results to create your own 'writeprint' (Rehmeyer, 2007) for each author.
- So that you can make fair comparisons between samples, all of your graphs should share the same scales (i.e., the same range for the x- and y-axes of each graph should be the same). So think carefully when you design your 'writeprint' and make sure that your x- and y-axes are designed to accommodate the full range of possible measurements.
- The key is to identify measurements that consistently reveal a difference between authors.
- For starters, you may want to try plotting the word length vs. frequency for each author (Mendenhall, 1887).
- Have your helper select additional paragraphs from each author. Your helper should also run the analysis on each additional sample, and give you the results, without identifying the authors. Can you determine the author of each unknown sample?
Keep the fun going! Find local opportunities related to this project.Register on ActivityHero
If you like this project, you might enjoy exploring these related careers:
Computer ProgrammerComputers are essential tools in the modern world, handling everything from traffic control, car welding, movie animation, shipping, aircraft design, and social networking to book publishing, business management, music mixing, health care, agriculture, and online shopping. Computer programmers are the people who write the instructions that tell computers what to do. Read more
Software Quality Assurance Engineer & TesterSoftware quality assurance engineers and testers oversee the quality of a piece of software's development over its entire life cycle. Their goal is to see to it that the final product meets the customer's requirements and expectations in both performance and value. During the software life cycle, they verify (officially state) that it is possible for the software to accomplish certain tasks. They detect problems that exist in the process of developing the software, or in the product itself. They try and make things not work (try to "break" the software) by creating errors or combinations of errors that a user might make. For example, if a user enters a period or a pound sign for a password, will that break the software? They seek to anticipate potential issues with the software before they become visible. At the end of the life cycle, they reflect upon how problems or bugs arose, and figure out ways to make the software development process better in the future. Read more
Computer Hardware EngineerWhether you are playing video games, surfing the Internet, or writing a term paper, computers are an integral part of our daily lives. Computer hardware engineers work to make computers faster, more robust, and more cost-effective. They design the microprocessor chips that make your computer function, along with the equipment that makes computing easy and fun to do. Read more
- Here are some ideas for functions that you might want to add to your text measurement program:
- count the frequency of different sentence lengths,
- frequency of function words, such as prepositions (e.g., of, from, in) and conjunctions (e.g., and, but, or).
- Let's say that one of your authors was J.K. Rowling, and all of your text samples came from the first Harry Potter book (Harry Potter and the Sorcerer's Stone). What happens if you use a text sample from a later book in the series, like Harry Potter and the Order of the Phoenix? Do your measurements still point to J.K. Rowling as the author?
- How much text do you need to get an accurate 'writeprint' for an author? Design an experiment to find out.
- Super-advanced students could explore representing the text analysis data as multidimensional vectors and using principal components analysis to differentiate between authors.
Recent Feedback Submissions
|Sort by Date||Sort by User Name|
What was the most important thing you learned?
What problems did you encounter?
Can you suggest any improvements or ideas?
Science Buddies materials are free for everyone to use, thanks to the support of our sponsors. What would you tell our sponsors about how Science Buddies helped you with your project?
Overall, how would you rate the quality of this project?
What is your enthusiasm for science after doing your project?
Compared to a typical science class, please tell us how much you learned doing this project.
About the same
|Do you agree?||Report Inappropriate Comment|
Ask an ExpertThe Ask an Expert Forum is intended to be a place where students can go to find answers to science questions that they have been unable to find using other resources. If you have specific questions about your science fair project or science fair, our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.
Ask an Expert
News Feed on This Topic
Looking for more science fun?
Try one of our science activities for quick, anytime science explorations. The perfect thing to liven up a rainy day, school vacation, or moment of boredom.Find an Activity