Picture This: How Visualizing Data Can Lead to the Right (or Wrong) Conclusion
AbstractBig data is one of the most promising and hyped trends in technology and business today. Big data refers to data-analysis jobs that are too large and complex to be analyzed by applications that are traditionally used. Big data sets hold valuable information. Many publicly available data sets have the potential to improve our everyday lives by giving us insight into the things we care about. How well-equipped are we to extract information from the data? Visualizations and aggregations (or categorizing the data into groups) are frequently used tools to represent data in a manageable way. Unfortunately, when they are not used carefully, they might lead to false conclusions. In this science project, you will create and execute a survey to find out if participants are prone to draw dubious conclusions from data presented as an aggregation.
- Microsoft® is a registered trademark of Microsoft Corporation.
- Excel® is a registered trademark of Microsoft Corporation.
- Access® is a registered trademark of Microsoft Corporation.
Construct and execute a survey to evaluate how likely participants are to draw incorrect conclusions from a visualization representing big data.
Recently, storing data has become cheaper and cheaper, leading governments, corporations, and other organizations to create and store tons of data. This data is significant because it helps identify trends and can be used to find connections or to solve problems or answer questions we might have. In 2010 alone, the German Climate Computing Center generated 10 petabytes of climate data, which is 1015 bytes of data in one single year.
Some data sets are publicly available. Are we equipped to extract information we care about from these large data sets, or is specific training and practice needed? In this science project, you will look into what big data is, learn one way to extract information, and explore one common pitfall. It is a good opportunity for you to test your own skill for manipulating and extracting information from data as you explore data sets and prepare a survey.
Big data refers to data sets that are too large and complex to be analyzed by applications that are traditionally used to process and analyze the data set. A historical context might help explain the concept of big data. Up until about a century ago, scientists mainly used pen and paper to record data and make graphs. Can you imagine how limiting that would be when faced with studies like global climate change? In the 1980s, the first spreadsheet programs appeared. Spreadsheet programs like Microsoft® Excel® increased rapidly in capacity to handle data, and are currently widely used. They suddenly allowed us to record data electronically, and to automatically make graphs. This made the storage and processing of data significantly easier and more efficient. With the turn of the century, the cost to store data plummeted, leading companies, governments, and scientists to store massive amounts of data, data that would otherwise have been forgotten (like which advertisement you clicked on your computer, on a specific day and time) or never would have been recorded to begin with (like videos made by surveillance cameras). Note that data is not restricted to numbers; it can include text, audio, pictures, and video. As spreadsheets became insufficient to handle these massive data sets, other tools were created. Super computers, parallel computing (where calculations are executed simultaneously on different servers), and grid computing (where calculations are distributed over machines located at different locations) were invented to help handle the data.
Can you see how big data is a moving target? What is considered to be "big" today will not be so years from now. What is considered to be "big" for you as you do your science project will likely not be so for a scientist analyzing climate data. To read more about big data, check out the Science Buddies page What is Big Data?
Extracting valuable information from big data is not an easy task. The original data set—this is the set that has not been processed or manipulated—is often referred to as raw data or primary data. Visual representations and aggregations are tools that help transform the data into graphical representations of information—also called infographics—from which we can extract information. Charts, diagrams, bar graphs, and maps are all examples of visual representations, which can be created using computer software. Aggregations are created when you group the data into specific collections or categories based on one or a number of characteristics. An example of an aggregation is grouping incidents according to their location (country, state, police district) or time of the occurrence (month, day, hour of the day). Aggregations can drastically reduce the amount of data lines to store, making them easier to handle. On the other hand, they also reduce the amount of detail available. Most data tables available on the internet contain aggregated data. The World Bank website is just one place where you can find these. For this science project, we advise starting from non-aggregated data, like a list of crime incidents or a list of wild fires.
It is very important to keep the question you would like to get answered in mind when manipulating big data. Say you would like to know the safest districts of the city in which you live, where safety is measured by the likelihood of encountering a crime incident. Starting from a list of crime incidents (a big data set) found on the internet, you can create a visualization showing the number of crime incidents per district. Figure 1 shows an infographic for the crime incidents list for San Francisco in 2014.
There are 10 police districts represented on the graph and the district with the highest number of reported crimes is the southern district with just over 28,000 crimes reported. The district with the least amount of crimes reported is Richmond where just over 7,700 crimes were reported.
Figure 1. San Francisco crime incidents reported in 2014, aggregated by police district, created with Tableau Public from the San Francisco 2014 incidents list.
Which districts would you identify as the place where you are least likely to encounter a crime incident? Or would you conclude you cannot answer the question from the data presented and look for a different or more detailed visualization?
Would your answer be different if you were also presented with a geographic map of the police districts, as shown in Figure 2 for the San Francisco area?
Police districts in San Francisco segment the city into many oddly shaped sections. The largest district appears to be Taravel or Ingleside and the smallest is Tenderloin.
Figure 2. Map displaying the San Francisco police districts.
Exploring the same data in more detail might lead you to visualize the incident locations on a map, as shown in Figure 3. Would this infographic be better suited to answer the question at hand?
Figure 3. San Francisco crime incidents displayed on a map, where color indicates the different police districts. Each dot represents a crime incident at that location. Darker dots indicate more incidents at that location. Visualization created with Tableau Public from the San Francisco 2014 incidents list.
Looking only at Figure 1, people might be misled and identify a district with a lower total number of incidents as safer. Because some police districts are smaller than others, this infographic does not provide adequate information to identify safer districts. Adding a geographic map, like the one shown in Figure 2, might increase the awareness of how different in size the districts are. Figure 3, on the other hand, provides information on the density of incidents, which is what is needed to identify the safer districts. Take a moment to look carefully at Figures 1 and 3, and identify which misconceptions Figure 1 can introduce. Take district Tenderloin as an example. Figure 1 shows it has a fairly low total number of incidents, which might lead to the idea that this is a fairly safe area. Figure 2 can place this low number in context, as Tenderloin is a very small district. In Figure 3, you see that Tenderloin has quite a high density of incidents (a relatively small total number of incidents spread over a very small area provides a relatively high density of incidents, so a higher likelihood to encounter one). Figure 3 will lead you to correctly conclude that Tenderloin is not so safe after all.
Data tables available often contain aggregated data (like a count per country) or are presented in a format that allows for easy aggregation, making people prone to create these visualizations first. On the other hand, there are a lot of questions that can only be answered by infographics representing a density instead of a count. To investigate the likelihood of contracting Ebola when visiting a country, you need information on the density of cases per country, not the total number of cases in a country. As another example, the total waste produced by a school will not identify schools that successfully practice waste-reduction principles. In this case, an infographic showing the waste per student would be more informative. Similarly, the number of gun owners by country will not inform you about the likelihood of people in a country to own one or more guns. In this case, an infographic of gun ownership per capita (which means "per person", or sometimes "per 1,000 people") would be more informative.
In this science project, you will study how prone people are to make quick, invalid conclusions from infographics showing aggregated data over location, where density is actually needed to answer the question at hand. You will identify a data set containing incidents with their location, create visualizations of the data, and construct a question that can be answered by looking at only the density. For your survey, you will select three groups of volunteers. One group will be given a visualization displaying the aggregated data (like Figure 1) to answer the question. A second group will be provided the same visualization, together with a geographic map (like Figure 2), and a last group will be given an x-y scatter plot of incidents by geographic location (like Figure 3) to answer the question. Analyzing the responses will provide information about how easily people are misled by infographics showing an aggregation over location.
Terms and Concepts
- Big data
- Raw (or primary) data
- Visual representation
- Sample size
- What has sparked the existence of big data sets?
- You can find many data sets on the internet. Why are most results from manipulations of big data and rarely the raw data?
- What could be the advantages and disadvantages of having big data sets publicly available?
- Wikipedia contributors. (2015, January 29). Infographic. Wikipedia: The Free Encyclopedia. Retrieved February 11, 2015.
- Wikipedia contributors. (2015, January 28). Misleading graph: poor construction. Wikipedia: The Free Encyclopedia. Retrieved January 28, 2015.
- Statistics Canada. (2013, July 23). Statistics: Power from Data. Retrieved January 28, 2015.
You can download a free data visualization software from:
- Tableau Public. (n.d.). Download Tableau Public. Retrieved January 28, 2015.
Materials and Equipment
- Computer on which you have permission to download the free data-visualization tool Tableau Public. Note: You can use a different data-analysis tool if there is another one you prefer. However, you will need to adapt the procedure accordingly if you decide to use a different program.
- Three groups of volunteers to take a short test, at least 10 in each group
- Lab notebook
Working with Human Test Subjects
There are special considerations when designing an experiment involving human subjects. Fairs affiliated with Intel International Science and Engineering Fair (ISEF) often require an Informed Consent Form (permission sheet) for every participant who is questioned. Consult the rules and regulations of the science fair that you are entering, prior to performing experiments or surveys. Please refer to the Science Buddies documents Projects Involving Human Subjects and Scientific Review Committee for additional important requirements. If you are working with minors, you must get advance permission from the children's parents or guardians (and teachers if you are performing the test while they are in school) to make sure that it is all right for the children to participate in the science fair project. Here are suggested guidelines for obtaining permission for working with minors:
- Write a clear description of your science fair project, what you are studying, and what you hope to learn. Include how the child will be tested. Include a paragraph where you get a parent's or guardian's and/or teacher's signature.
- Print out as many copies as you need for each child you will be surveying.
- Pass out the permission sheet to the children or to the teachers of the children to give to the parents. You must have permission for all the children in order to be able to use them as test subjects.
Creating the Survey
The following procedure assumes you are using Tableau Public, a free data-visualization tool that you can download. Note: You can use a different data-analysis tool if there is another one you prefer. However, you will need to adapt the procedure accordingly if you decide to use a different program.
- Look for raw data containing incidents or events listing their geographic locations.
- Look for data sets that include the following information:
- Each row (or entry) represents an incident or event.
- Each row contains information on the geographic location of the incident or event. This could be latitude and longitude, or X and Y coordinates. Some data-analysis tools may allow you to display a street address.
- Each row contains a geographic grouping parameter, like district, city, county, or country. Although you could create this aggregation yourself, having it already available will make it much easier to use the data set. Note: This project works best if the geographic aggregates used (for instance, the countries included) are quite different in size, as this will create a bigger difference between a total number bar graph and a density graph.
- Look for data that has the option to be downloaded in a format supported by your data-analysis tool. Tableau public supports Microsoft Excel, Microsoft Access®, and text files.
Here are some options to get you started:
- Many United States cities provide a downloadable list of incidents of crimes or traffic violations reported, with their locations and districts. Here are some examples:
- The San Francisco 2014 crime incidents used to create the figures 1, 2, 3, 5, and 6. This data was downloaded from the San Francisco police department web page. Other cities release similar data. Look for links to raw data added to graphs or tables of crime statistics.
- Other cities that currently share raw data include Seattle, Washington; Chicago, Illinois; and Portland, Oregon.
- A list of United States wildfires, with their locations and districts, can be retrieved from Fire and Weather Zip Files.
- An internet search for "raw data"— adding the city, county, or area of your choice— can lead to other data sets.
- Many United States cities provide a downloadable list of incidents of crimes or traffic violations reported, with their locations and districts. Here are some examples:
- Look for data sets that include the following information:
- Download the chosen data to your computer.
- Prepare data to be loaded to Tableau Public.
- Tableau Public, as with many other data-analysis tools, expects the first row of your data file to provide the titles of the columns, followed by rows of data. Check your data file and make adjustments to accommodate this format where needed.
- Download and open Tableau Public. Their website also provides a good introductory tutorial, showing you how to load and use the data. Take a moment to watch the introduction video, and any others you think will help you.
- Load data to the analysis tool. Tableau Public tutorials can walk you through this step.
- Create Infographic A— An aggregation of incidents by geographic location.
- Identify which aggregation over location is present in your data set. Look for a district, county, state, or country indication, available as a variable in the data. We will refer to this variable as the region.
- Create a bar graph showing the number of incidents per region. In Tableau Public, you will select the variable representing the region as "Column" and the number of records as "Row." Figure 4 shows you how this can be done.
- Save your infographic.
The Tableau Public program has navigations bars at the top of the window and areas of different dimensions and measures that can be used on the left hand side of the screen. These dimensions and measures can be drag and dropped into various fields such as pages, filters, marks, columns and rows to customize a graph.
Figure 4. To create a bar graph displaying the number of incidents per district in Tableau Public, drag the "Number of Records" (highlighted in green) to the "Rows" option after selecting the "District" (highlighted in blue) as "Columns."
- Create Infographic B— An x-y scatter plot of incidents by geographic location.
- Locate the variables in your data set that represent the exact location. This is the latitude and longitude of the incident. Note that different data sets might use different words to label the columns holding these variables.
- Create a scatter plot showing the incidents by geographic latitude and longitude. This can be done by selecting the variable representing the longitude as "Column" and the variable representing the latitude as "Row."
- Adjust the color and size of your marks. You can do this using the "Color" and "Size" buttons located in the Marks screen. A smaller-sized mark allows for a better resolution. Figure 5 shows an example where the marks were set too big. Allowing the marks to be partially transparent allows you to identify overlapping marks (several incidents at the same location). Figure 6 displays an example of zero transparency. Notice how in Figure 3 in the Introduction color intensity and density of marks allow you to identify areas with a high number of reported incidents.
- Save your infographic.
The example visualization has dots to represent where a crime was reported but the size of the dots are too big. When looking at the graph it is hard to distinguish between individual dots.
Figure 5. In this scatter plot, the size of each mark is too big to create a good resolution. Visualization created with Tableau Public from the San Francisco 2014 incidents list.
The example visualization has dots to represent where a crime was reported but the dots are all opaque. When looking at the graph it is hard to distinguish if a single spots has multiple dots on top of each other.
Figure 6. In this scatter plot, the color of the marks has no transparency, which does not allow you to differentiate spots with multiple tags, one on top of the other. Visualization created with Tableau Public from the San Francisco 2014 incidents list.
- Identify a geographic map showing the aggregations used in Visualization A on a map. Figure 2 in the Introduction is an example where police districts are shown on a map. Note: An internet search will likely provide a good map.
- Create a good question. Find a question that can be answered using Infographic B, which is an x-y scatter plot of the incidents by geographic location.
- The Science Buddies designing a survey page can help you create a good question.
- You could use a partially structured question for this survey, like
"Which district(s) would you identify as being the safest, based on the year 2014 reported incidents data provided?"
( ) District(s): ...
( ) This visualization does not provide adequate information.
- An example of a structured question, using the example found in the Introduction, could be:
"Which district(s) in San Francisco would you identify as being the safest, based on the year 2014 reported incidents data provided?"
( ) Bayview
( ) Central
( ) Not enough information to answer the question.
- Note that the way you word the question is very important. Here are some example questions that would not work for the data found in the Introduction.
- A question like "Which district has the highest number of crime incidents?" would not be a good question for your survey. This question can be answered by using the bar graph, as it asks for a total number of incidents per districts. The density is irrelevant in this case.
- A question like "Which district has the most crime incidents per unit area?" makes it more obvious that the bar graph alone is lacking the area information. The goal of your survey is to see if people will think of this themselves, so you should be careful not to "give it away" with the question.
- Prepare a sample survey.
- Decide how you will run your survey; will you email people you know (electronic format), will you ask people in person (written format)?
- Once you know how you will approach your participants, you can prepare a survey, such as creating an email or a print version of your survey.
- Run a sample test to evaluate your survey:
- Did the participants understand what they were asked to do?
- Can you extract the information you need from the answers received? In other words, can you identify from the answer whether the participant felt he or she was given adequate information to answer the question?
Conducting the Survey
- Determine how many participants you need.
- The Science Buddies How many participants do I need? resource page can help determine your sample size or the number of randomly selected participants you will need for each group in your survey to be able to draw meaningful conclusions.
- Tips for conducting the survey:
- The survey results should be anonymous, so do not ask for or include any identifying information.
- To conduct the study, you could look for volunteers at a busy shopping center or other public gathering place (adult supervision highly recommended).
- Alternatively, you could seek permission to conduct the survey at school.
Tabulating and Analyzing Results
As you are only interested in knowing if the participants felt they were given adequate information to answer the question, you can group the answers into two main categories, those who identified they were not able to answer the question as no adequate data was provided and those who selected one or several specific districts. If needed, add a category "ambiguous data" to classify anything you feel does not fall under the two identified groups.
- Copy the following data table in your lab notebook. It will be used to tabulate your results.
|Answer Selected:||Infographic A Only||Infographic A |
|Infographic B Only|
|Not enough information|
|Total number of participants who responded|
- Analyze the results of the group given only Infographic A:
- Count the number of candidates who identified one or several districts.
- Count the number of candidates who identified they did not have enough information.
- Count the number of answers that were ambiguous, meaning from their answer, you were not able decide whether the person thought the data provided was enough to answer the question or not.
- Record your results in the table like Table 1.
- Repeat step 2 for the group who was given Infographic A and a geographic map.
- Repeat step 2 for the group given only Infographic B.
- Translate your counts into a percentage of candidates in this group (meaning, the number of completed surveys for this group).
- Calculate the margin of error for your survey, both for your counts and for your percentage values. How many participants do I need? provides information on how to calculate the margin of error. Record your margin in the table in the following format: "number ±margin of error".
- Graph your data with the groups on the x-axis (horizontal axis) and the counts on the y-axis (vertical axis). More-advanced students can indicate the margin of error on the graph by adding a lower and upper limit for each count.
- Repeat step 7, now graphing the percentages with their margin of error.
- Draw conclusions from your data and the graphs. Here some questions that might get you started:
- According to your data, were your test persons prone to draw false conclusions being given an infographic showing data aggregated by location?
- Were your test persons less-prone to draw false conclusions when they are presented with a geographical map, in addition to the aggregated data infographic?
- Were your test persons confident in drawing conclusions from the data displayed as incidents on a map?
- Thinking about your survey, what information or considerations could you add? Here some questions that might get you started:
- Could you use other infographics, or change your infographics so people could extract information and answer the question at hand faster and/or more accurately?
- Do you feel your survey circumstances mimic the environment in which people would use infographics?
- What other factors might lead people to make dubious conclusions from aggregated data?
Ask an Expert
- In this science project, you studied how prone people are to draw invalid conclusions when presented with an infographic showing data aggregated by location. You could add a study to see if people recover from their mistake when provided with the more-detailed representation of the data afterwards. To do this, first ask your participants to answer the question as described in this project, then provide them with the other visualization and ask them to answer the question again, now using both infographics A and B.
- In this science project, you only looked at the answers provided. You could add a speed component by measuring the time taken to answer the questions. Do people take more time to understand a bar graph over a distribution of incidents by geographic location? Is there a correlation between coming to false conclusions and the time spent answering?
- The Introduction mentions other questions that can only be answered using a density of incidents. You could change this science project to accommodate these types of data. As an example, in trying to assess the risk of coming in contact with Ebola, you could present participants of one group with an infographic showing the number of cases per region (such as country) and a second group an infographic showing the number of cases per capita for the region (such as countries). The number of cases per capita can be found by dividing the total number of cases in an area by the population of that same area.
- If you are interested in the effect of specific courses like statistics, critical thinking, or reasoning on our vulnerability to draw dubious conclusions from infographics, you could compare the results of a group of students that have taken a particular class to students not taking the class. Do they perform any better? Make sure you add your margin of error, as your sample size of students having taken a class might be relatively small. Be on the lookout for fast generalizations; not all statistics or reasoning classes are the same!
- This science project only examines one factor that could lead to misinterpretation of an infographic. The article from the Bibliography, misleading graphs, suggests many more. You could study the vulnerability of the public to any of these misleading factors.
- Mean values are often depicted in infographics. The article Bar graphs depicting averages are perceptually misinterpreted: the within-the-bar bias claims that people are quickly biased by these representations. Can you measure this factor in a survey?
If you like this project, you might enjoy exploring these related careers:
- Science Fair Project Guide
- Other Ideas Like This
- Human Behavior Project Ideas
- Big Data Project Ideas
- My Favorites
- Sample Size: How Many Survey Participants Do I Need?