There was no singular moment of Big Data Bang, but we are living in and heading towards a time of seemingly endless and exponential data explosion—and the race to create solutions and strategies to help tame, store, organize, and make sense of big data is on. But even before there was a moniker, there were large data sets being created and used in fields like the life sciences. Today, those data sets continue to grow and evolve. Working with publicly-available big data sets offers students a chance to get in on the forefront of the big data movement and start tackling big data questions and helping create big data solutions.The phrase "big data" has made a correspondingly big splash in circles ranging from business to sports to school admissions boards. A few years ago, talk was about data mining. Today, no matter where you turn, you may run into discussions about big data.
With increasing amounts of data available and coming in from a variety of input points and multiplying over time, figuring out ways to best manage, organize, and manipulate data so that it can be analyzed to extract meaningful and useful information is a challenge that developers and researchers are racing to solve.
A Problem of Scale
The term big data highlights the fact that today's data is already vast, and it is ever-climbing upwards in size. The growth of data doesn't necessarily have an end point. Discussions of big data often talk about data sets that are in the terabytes, petabytes, or exabytes (a quintillion bytes). These are data sets that are too large to manage with ordinary tools, traditional spreadsheets, and manual number crunching, and the data sets are growing by the nanosecond.
Dealing with big data is sometimes labeled a "business" or "tech" problem, a reality for businesses that need, for example, to better understand their customers in order to best serve (and keep) them as loyal, satisfied, customers. Interpreting and acting upon gathered data can shine a light on problems, successes, and trends which may help a company make changes and decisions about future offerings, services, and products. The problem is that the data is, in many cases, so large that companies are struggling to figure out how to best use and analyze it.
Big Data is a business challenge—but it is bigger than that.
Bio Big Data
Science is no stranger to large data sets. Science, in fact, has been creating, gathering, and analyzing "large" data for years. What has changed and continues to change is that the amount of data continues to increase, and the tools available for extracting information from that data (or for processing the data) continue to evolve.
You can find examples of big data sets in many fields of science, including astronomy and meteorology, but for researchers in genetics, genomics, and biotechnology, dealing with big data is an important part of day-to-day projects. Luckily, there publicly available tools exist that allow researchers (including students) to plug in query information and extract results from big data sets. These tools make the data accessible. As the data continues to grow, scientists continue to develop new tools and projects that aim to better understand the data and that seek to find new answers and new approaches to analyzing big data.
The Human Genome Project
In the Life Sciences, the importance of big data and collective work toward central repositories of information can be seen in the Human Genome Project (HPG). A draft sequence was first published in 2000, and the completed sequence was released in 2003. According to the National Human Genome Research Institute, the completed HGP "gave us the ability, for the first time, to read nature's complete genetic blueprint for building a human being."
The HGP was groundbreaking for life sciences, but more than a decade later, the data is still being mined and combined with additional data to draw new results and conclusions. The "What's Next? Turning Genomics Vision Into Reality" page on the NHGRI site describes several projects that build upon the HPG. For example, the ENCyclopedia Of DNA Elements (ENCODE) project seeks to use HPG data to identify "all of the protein-coding genes, non-protein-coding genes and other sequence-based, functional elements contained in the human DNA sequence."
Though life sciences scientists are increasingly immersed in big data, the future will hold much more big data, data of (literally) mind-boggling proportions. For example, both the Brain Research through Advancing Innovative NeurotechnologiesSM (BRAIN) Initiative and the Human Brain Project are bioinformatics-driven big data projects focused on mapping the brain, an organ which has an estimated 86 billion neurons, each of which makes multiple connections.
With the creation of data sets like these underway, now is a great time for students to start digging into how to work with data sets, learning how to use existing tools, and gaining an understanding of the problems, issues, and possibilities of working with bioinformatics tools and big data.
Students interested in life sciences, genetics, genomics, or the challenges of Big Data can get in on the action with bioinformatics science projects like these:
- From Genes to Genetic Diseases: What Kinds of Mutations Matter?: a single DNA mutation can cause a genetic disease, and yet sometimes, a DNA mutation causes no harm. In this science project, students use the Genetics Home Reference and the National Center for Biotechnology Information (NCBI) Gene database to investigate the genetic mutations that cause cystic fibrosis. With an understanding of the bioinformatics tools and processes, students can study other genetic diseases. [See the Make It Your Own tab for suggestions.]
- Drugs & Genetics: Why Do Some People Respond to Drugs Differently than Others?: not all prescription drugs work the same for different people. In this science project students use the PharmGKB Pharmacogenomics Knowledge Base and focus on a chosen medication for a specific condition to explore how mutations in a DNA nucleotide (a Single Nucleotide Polymorphism or SNP) may be related to whether or not a medication will be effective for a patient.
- Hitting the Target: The Importance of Making Sure a Drug's Aim Is True: during testing of drugs, pharmaceutical companies have to ensure that a drug targets the intended problem but doesn't also create other problems. In this science project, students use the National Center for Biotechnology Information (NCBI) Gene database to explore how Ebola virus drugs could bind non-target proteins, an effect which could interfere with normal cellular and bodily functions.
- BLAST into the Past to Identify T. Rex's Closest Living Relative: what living species is most closely related to the T. Rex dinosaur? In this science fair project, students use the National Center for Biotechnology Information (NCBI) BLAST bioinformatics tool and the SwissProt protein database to compare genetic sequences of both extinct and living species.
- Bioinformatics—The Perfect Marriage of Computer Science & Medicine: many genetic diseases are caused by SNP mutations. In this science project, students use online databases to investigate the relationship between specific SNPs and specific diseases.
Talking Big Data with Students at the USASEF X-STEM
On April 24, the X-STEM Extreme Stem Symposium, sponsored by Northrop Grumman Foundation and MedImmune, will take place as part of this year's USA Science & Engineering Festival (USASEF) in Washington D.C. With dozens of speakers presenting on science, technology, engineering, and math (STEM) careers, X-STEM promises to be an "extreme" and exciting science career-themed symposium for students.
At X-STEM, Melissa Rhoads, a biotechnology strategist at Lockheed Martin, will be talking to students about the "big" challenges and "big" opportunities of big data—and related career paths.
As a preview of her X-STEM talk, Science Buddies talked with Melissa about big data in general and, more specifically, in the life sciences. Her answers highlight the ways in which students are already immersed in big data and how the sciences will be more and more reliant on big data strategies in the future.
Melissa: Public data sets of life sciences are incredible resources for students and researchers alike. I liken these data sets to the open source software (OSS) that has been available for decades. OSS 'democratized' software development, providing the opportunity for students and small companies to develop code for personal use or sale. In recent years, for example, app development for smartphones has allowed students and professionals to develop capabilities for personal use or for sale.
Tool sets like BLAST and the NCBI databases are similarly democratizing life sciences. By making genetic data available to the public, independent researchers can build on previous work and, as you've mentioned, students can conduct their own research. With so much data available online, there are a host of opportunities for software development to help understand the data and for researchers to build on, rather than repeat, previous work.
Science Buddies: Today's students are coming onto the scene at a point when data sets are "already" big. What do you think is most important for students to understand about how big data is continuing to change—and what the challenges (or opportunities) are for life sciences moving forward?
Melissa: There is already an enormous amount of data, but not all data is created equal. I think students already have a reasonable grasp of the challenges and opportunities of big data, even if they don't realize it. From the flurry of updates, posts, pokes, and shares from social media to the thousands of results generated from online research, students must cull through masses of information and identify what is most important to them.
The amount of data will only continue to grow as more people contribute more information. The challenge and opportunity is first identifying truth, and then pulling the data you need and presenting this information in a way that suits your needs. There will be so many opportunities for students and professionals in a wide range of fields to help meet these needs of search, organization and presentation of data.
Science Buddies: What kinds of questions do you think scientists can continue to explore and better understand with the development of better bioinformatics tools for managing and visualizing big data?
Melissa: Better bioinformatics will help us answer questions from how to provide better healthcare to how to preserve species and understand climate change. Taking a very big picture view, our planet is full of living creatures that are complex in and of themselves and make up an even more complex network of relationships. By learning, capturing, and presenting information on the flora and fauna of the planet, we will be better able to feed, hydrate and care for humanity while maintaining or improving the environment.
Science Buddies: In the context of understanding biology what are the limitations of big data?
Melissa: Right now the computing power is really an issue when attempting to model biological processes. To really understand the mechanics of how our body works, there are so many parallel processes that even the fastest and most powerful computers are struggling, and as we learn more information, the computing power will need to increase exponentially.
Science Buddies: What kind of student will love big data and the world of bioinformatics?
Melissa: I believe the most exciting part of bioinformatics is that so many different fields are contributing to the field. Students interested in biology, chemistry, physics, computers, math, problem solving, or just being challenged will be able to contribute to the field. There are so many data sets that need analysis, and so much more that needs to be learned, that there are endless possibilities.
Science Career Profiles
Many STEM careers will involve working with big data, but students can learn more about a few specific careers that offer opportunities with big data in the following career profiles: