Jump to main content

What is Big Data?

You might have read or heard the phrase big data used on the internet or in a television commercial. But what is big data? How is it defined, and how is it different from "regular" data? Can you use it for a science project? This reference page will help answer some of these questions and get you started exploring the world of big data.

Data is information, and there are many different types of information. If you have done a science project before, you probably collected information by writing down numbers in a data table, like a distance that you measured with a ruler, or a weight you measured with a scale. You might have used that data to make a chart or a graph, but data does not just mean numbers; it can also be words like names and addresses, or computer files like pictures and videos. For example, your school might store a list of names, ages, addresses, and an identification photo for each student. That list contains several different types of data. Can you think of other types of data that you might use in a science project or in everyday life?

In the past, most data was recorded, stored, and analyzed by hand. The invention and widespread use of computers in the 20th century, followed by the internet, dramatically changed the way we collect and store data. Computers allow us to store data electronically, and the internet allows us to rapidly transfer data from one place to another. Look at Table 1 and Figure 1, below, for some examples of how computers and the internet have changed our ability to collect, store, and analyze data.

Before Computers and the Internet After Computers and the Internet
Scientists (or students doing science projects!) would record data using a pencil and paper, and draw graphs by hand. Students and scientists can use spreadsheet programs like Microsoft® Excel® to record large amounts of data and automatically make graphs.
People had to rely on huge phone books (with hundreds, or even thousands, of pages) to look up names, phone numbers, and addresses. Phone books do still exist; despite being so large, each one only contains information for one city. You can use the internet to quickly look up an address anywhere in the world with a service like Google Maps. A single smartphone can store contact information for hundreds or thousands of people, so you can reach them without needing a phone book.
To keep track of your purchases, you would have to keep paper copies of receipts for everything you bought. You can make purchases with a credit or debit card. While you still have the option to get a paper receipt, these purchases are also recorded electronically, and you can log in to your bank account online to view a history of all the purchases you have made.
A large photo album could hold a couple hundred printouts of 4×6 pictures. To share a picture with someone, you would have to get an additional copy printed and mail it to them. The memory card for a smartphone or digital camera can hold thousands of images. You can instantaneously share them with many people at once using text messages, email, or social media.
To look up directions from one place to another, you would have to use a paper map and write down the directions. You can use a smartphone or GPS device to automatically find directions from one place to another.
Table 1. Some examples of how computers and the internet have changed the way we collect, store, and analyze data.

Example image shows the contents of a phonebook and photo album can now be stored on an SD card
Figure 1. Computers allow us to take entire sets of data that used to be stored separately—like in photo albums and phone books—and store all of that information on one tiny memory card.

However, just because the data is collected and stored electronically does not mean it is "big." Big data refers to the dramatic increase in our ability to collect, store, and process data, thanks to huge increases in computing power. Computing power can include the availability of storage space (like on hard drives, flash drives, and memory cards), the speed of computer processors, and the speed of internet connections, all of which have increased at an incredible rate through the beginning of the 21st century (see the Technical Note, below, for more details). This increase in computing power goes along with a huge increase in the number of devices that are connected to the internet, ranging from smartphones and computers to weather stations and traffic monitors. As a result, we are collecting more types of data more quickly than ever before. This leads to the "three V's" that are used to define big data: volume, velocity, and variety. A "big data" problem can have one or more of the following characteristics:

While these concepts can be used to help define big data, there is no strict definition of "big." What a small company with a few employees considers big might not be considered big by a large internet company with thousands of employees. Furthermore, what is generally considered big today might not be considered big 5 years from now, as computing power continues to increase. So, big data is a moving target, but as more and more devices are connected to the internet and we continue to collect more and more data, it will continue to be a challenge to deal with that data and do something useful with it instead of letting it go to waste. It will be up to future generations of data scientists, computer programmers, and statisticians to help us analyze big data.

To help you think more about big data, here are some examples of things you might experience in everyday life, and how they can become "big data" problems when viewed on a much bigger scale. Can you identify how the situations described below have a big volume, velocity, and/or variety of data? Remember that any specific "big data" situation must have one or more of the "three V's", but does not have to have all three.

An image from Google Maps shows different levels of traffic for each street in New York City

A overhead view of New York focuses on Mahattan island with each street highlighted in green, yellow or red. These colors indicate vehicle traffic on that particular street: green for light traffic, yellow for moderate traffic and red for heavy traffic. The map from Google shows that most visible freeways are clear of traffic while many side streets and avenues are heavily congested. Streets to the east of Central park have the highest concentration of high traffic roads.

Figure 2. This screenshot from Google Maps shows traffic data in New York City (green indicates light traffic, darker colors indicate heavy traffic). How do you think Google manages to record and display traffic information for that many roads all at once, and update it constantly? This is definitely a "big data" problem!

Do you now have a better grasp on what big data means? Are you ready to try and tackle big data for your own science project? If you need help getting started, here is a list of existing Science Buddies projects that utilize big data.

Technical Note:

Computing power can refer generally to several things, such as:

  • The availability of storage space, measured in gigabytes, terabytes, or even petabytes (if you are not familiar with metric prefixes, see this Wikipedia page for reference).
  • Increased speed of computer processors, measured in gigahertz (GHz), or operations per second.
  • Internet connection speeds, measured in megabytes per second (Mbps) or even gigabytes per second (Gbps).
  • Note: If you want to learn more about these topics, research what a typical hard drive size, processor speed, and internet connection speed were like in the 1990s compared to today. You might be amazed by the results!
Free science fair projects.