Data Analysis for Advanced Science Projects
Sandra Slutz, PhD, Staff Scientist, Science Buddies
Kenneth L. Hess, Founder and President, Science Buddies
Whether your goal is to present your findings to the public or publish your research in a scientific journal, it is imperative that data from advanced science projects be rigorously analyzed. Without careful data analysis to back up your conclusions, the results of your scientific research won't be taken seriously by other scientists. The sections below discuss techniques, tips, and resources for thorough scientific data analysis. Although this guide will mention various data-analysis principles and statistical tests, it is not meant to be an exhaustive textbook. Instead, you're encouraged to use this guide as a means of familiarizing yourself with the general principles of data analysis. Once you're familiar with the concepts, we encourage you to continue your exploration of the topics most relevant to your science project using the references listed in the Bibliography, as well as personal resources, such as your mentor and other science and math professionals, including your teachers. We also encourage you to read our accompanying articles about the Experimental Design for Advanced Science Projects and the Increasing the Ability of an Experiment to Measure an Effect. When used collectively, the information in these three articles will put you on the path towards a well-thought-out, top-quality research project.
Common Incorrect Assumptions About Data Analysis
Three common mistakes among young scientists are assuming that:
- Data analysis occurs only after you are done collecting all your data.
- Data analysis is quick—you pick your analysis methods, apply them in a "plug-in" fashion, and then you are done.
- Data can stand alone without additional context.
None of these things could be further from the truth. Data analysis is an ongoing process in a research project. Planning what kinds of analyses you're going to perform with your data is a critical part of designing your experiments. If you skip this step, you might find yourself with insufficient data to draw a meaningful conclusion. For more details on how successful data analysis and good experimental design are co-dependent, see the Science Buddies guide to Experimental Design for Advanced Science Projects.
Once you have designed your experiments and are carrying them out, it can be wise to do some data analysis, even while you are collecting your data, to ensure that the observations are within expected parameters. For example, you might calculate the yield of a DNA extraction in the midst of an experiment to make sure the procedure worked well before moving on to the next step. This kind of analysis prevents you from wasting valuable experimental time if something is wrong with your experimental procedure, and can eliminate confusion later over aberrant data. Data should also be analyzed between independent replicates in case the trends or observations from one experimental repeat offers insights on how to better design additional repeats.
Although it might be tempting to quickly plug your data into a spreadsheet, create a graph, print out the basic corresponding statistics, and celebrate your project as "finished," this methodology might lead you to miss relevant information. Instead, you should plan to spend a good chunk of time "playing" with your data. The more variables you test, the longer this "playing" takes. By looking at the data from various perspectives, trying different ways of organizing the data and representing it visually and mathematically, you might stumble upon connections or trends of which you were unaware when starting the project.
Lastly, it is always important to not just have your data stand alone, but to put it into context. Simply put, expressing your data relative to other data is much more enlightening. For example, the data in a study on the height of Japanese male professional basketball players might show that the average player height is 6 feet 5 inches. This data becomes more informative if you compare it to the average height of Japanese males, 5 feet 7 inches, thus allowing you to conclude that in Japan, basketball players are likely to be 14 percent taller than the average male. Similarly, if your research is a replicate of previous work or a methodological improvement on a process, it is critical to analyze your data in direct comparison with the previously published data.
Determining Standards in Your Field for Data Analysis
Every field has standards and norms for how to analyze data. Researchers, and others in the field who are reviewing your research, will expect you to be aware of and to emulate those standards where appropriate. That isn't to say they disapprove of new innovations or techniques—just be sure you're able to explain the advantages of your analytical methods over methods that are traditional to the field.
How do you conclude what the standard analytical techniques are in your field? The best way is to take a careful look at a wide range of papers in your field. Pay special attention to papers that are collecting the same types of data as you are. Make note of things like:
- How they organize their data,
- What types of trends they are seeing and how they are detecting those trends,
- Which statistical tests they use to evaluate the data, and
- What p values and/or confidence intervals are considered acceptable.
Once you're familiar with the types of analyses common to your field, you can pick and choose the ones that make the most sense in the context of your research project.
Three Different Ways to Examine Data
Generally speaking, scientific data analysis usually involves one or more of following three tasks:
- Generating tables,
- Converting data into graphs or other visual displays, and/or
- Using statistical tests.
Tables are used to organize data in one place. Relevant column and row headings facilitate finding information quickly. One of the greatest advantages of tables is that when data is organized, it can be easier to spot trends and anomalies. Another advantage is their versatility. Tables can be used to encapsulate either quantitative or qualitative data, or even a combination of the two. Data can be displayed in its raw form, or organized into data summaries with corresponding statistics.
Graphs are a visual means of representing data. They allow complex data to be represented in a way that is easier to spot trends by eye. There are many different types of graphs, the most common of which can be reviewed in this basic guide to graphs: Data Analysis & Graphs.
You might think of graphs as the primary way to present your data to others; although graphs are excellent ways of doing that (see the Science Buddies guide about Data Presentation Tips for Advanced Science Competitions for more details), they're also a good analytical mechanism. The process of manipulating the data into different visual forms often draws your attention to different aspects of the data and expands your thinking about it. In the process, you may stumble upon a pattern or trend that suggests something new about your science project that you hadn't thought of before. Seeing your data in different graphical formats might highlight new conclusions, new questions, or the need to go and gather additional data. It can also help you to identify outliers. These are data points that appear to be inconsistent with the other data points. Outliers can be the results of experimental error, like a malfunctioning measurement tool, data-entry errors, or rare events that actually happened (like a 70°F day in January in Montana), but don't reflect what is normal. When statistically analyzing your data, it is important to identify outliers and deal with them (see the Bibliography, below, for articles discussing how to deal with outliers) so that they don't disproportionally affect your conclusions. Identifying outliers also allows you to go back and assess whether they reflect rare events and whether such events are informative to your overall scientific conclusions.
If you are unsure of what kinds of graphs might best encapsulate your data, go back to published scientific articles with similar types of data. Observe how the authors graph and represent their data. Try analyzing your data using the same methods.
Statistics are the third general way of examining data. Often, statistical tests are used in some combination with tables and/or graphs. There are two broad categories of statistics: descriptive statistics and inferential statistics. Descriptive statistics are used to summarize the data and include things like average, range, standard deviation, and frequency. For a review of several basic descriptive statistical calculations consult the general guides to Summarizing Your Data and evaluating Variance & Standard Deviation. Inferential statistics rely on samples (the data you collect) to make inferences about a population. They're used to determine whether it is possible to draw general conclusions about a population, or predictions about the future based on your experimental data. Inferential statistics cover a wide variety of statistical concepts, such as: hypothesis testing, correlation, estimation, and modeling.
Beyond the basic descriptive statistics like mean, mode, and average, you might not have had much exposure to statistics. So how do you know what statistical tests to apply to your data? A good starting place is to refer back to published scientific articles in your field. The "Methods" sections of papers with similar types of data sets will discuss the statistical tests the authors used. Other tests might be referred to within data tables or figures. Try evaluating your data using similar tests. You might also find it useful to consult with statistical textbooks, math teachers, your science project mentor, and other science or engineering professionals. The Bibliography, below, also contains a list of resources for learning more about statistics and their applications.
This paper provides a discussion of how to choose the correct statistical test:
- Windish, D.M. and Diener-West, M. (2006). A Clinician-Educator's Roadmap to Choosing and Interpreting Statistical Tests. Journal of General Internal Medicine 21 (6): 656-660. Retrieved August 25, 2009 from http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=16808753
These resources provide additional information on what to do with outliers:
- Fallon, A. and Spada, C. (1997). Detection and Accommodation of Outliers in Normally Distributed Data Sets. Retrieved August 25, 2009, from http://www.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html
- High, R. (2000). Dealing with 'Outliers': How to Maintain Your Data's Integrity. Retrieved January 13, 2011, from http://rfd.uoregon.edu/files/rfd/StatisticalResources/outl.txt
Additional information about statistics is available in these online statistics textbooks:
- McDonald, J.H. (2008). Handbook of Biological Statistics. Retrieved August 25, 2009, from http://udel.edu/~mcdonald/statintro.html
- NIST/SEMATECH. (2006). NIST/SEMATECH e-Handbook of Statistical Methods. Retrieved August 25, 2009, from http://www.itl.nist.gov/div898/handbook/