loader image
Skip to main content
If you continue browsing this website, you agree to our policies:
x
Completion requirements
"Data lineage includes the data origin, what happens to it and where it moves over time (essentially the full journey of a piece of data)". This page explains the concept of data lineage and its utility in tracing errors back to their root cause in the data process. Data lineage is a way of debugging Big Data pipelines, but the process is not simple. Many challenges exist, such as scalability, fault tolerance, anomaly detection, and more. For each of the challenges listed, write your own definition.

Rationale

Distributed systems like Google Map Reduce, Microsoft Dryad, Apache Hadoop (an open-source project) and Google Pregel provide such platforms for businesses and users. However, even with these systems, big data analytics can take several hours, days or weeks to run, simply due to the data volumes involved. For example, a ratings prediction algorithm for the Netflix Prize challenge took nearly 20 hours to execute on 50 cores, and a large-scale image processing task to estimate geographic information took 3 days to complete using 400 cores. "The Large Synoptic Survey Telescope is expected to generate terabytes of data every night and eventually store more than 50 petabytes, while in the bioinformatics sector, the largest genome 12 sequencing houses in the world now store petabytes of data apiece". It is very difficult for a data scientist to trace an unknown or an unanticipated result.


Big data debugging

Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. They apply machine learning algorithms etc. to the data which transforms the data. Due to the humongous size of the data, there could be unknown features in the data, possibly even outliers. It is pretty difficult for a data scientist to actually debug an unexpected result.

The massive scale and unstructured nature of data, the complexity of these analytics pipelines, and long runtimes pose significant manageability and debugging challenges. Even a single error in these analytics can be extremely difficult to identify and remove. While one may debug them by re-running the entire analytics through a debugger for step-wise debugging, this can be expensive due to the amount of time and resources needed. Auditing and data validation are other major problems due to the growing ease of access to relevant data sources for use in experiments, sharing of data between scientific communities and use of third-party data in business enterprises. These problems will only become larger and more acute as these systems and data continue to grow. As such, more cost-efficient ways of analyzing data intensive scalable computing (DISC) are crucial to their continued effective use.


Challenges in big data debugging


Massive scale

According to an EMC/IDC study:

  • 2.8ZB of data were created and replicated in 2012,
  • the digital universe will double every two years between now and 2020, and
  • there will be approximately 5.2TB of data for every person in 2020.

Working with this scale of data has become very challenging.


Unstructured data

Unstructured data usually refers to information that doesn't reside in a traditional row-column database. Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly in a database. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly often many times faster than structured databases are growing. "Big data can include both structured and unstructured data, but IDC estimates that 90 percent of big data is unstructured data".

The fundamental challenge of unstructured data sources is that they are difficult for non-technical business users and data analysts alike to unbox, understand, and prepare for analytic use. Beyond issues of structure, is the sheer volume of this type of data. Because of this, current data mining techniques often leave out valuable information and make analyzing unstructured data laborious and expensive.


Long runtime

In today's competitive business environment, companies have to find and analyze the relevant data they need quickly. The challenge is going through the volumes of data and accessing the level of detail needed, all at a high speed. The challenge only grows as the degree of granularity increases. One possible solution is hardware. Some vendors are using increased memory and parallel processing to crunch large volumes of data quickly. Another method is putting data in-memory but using a grid computing approach, where many machines are used to solve a problem. Both approaches allow organizations to explore huge data volumes. Even this level of sophisticated hardware and software, few of the image processing tasks in large scale take a few days to few weeks. Debugging of the data processing is extremely hard due to long run times.

A third approach of advanced data discovery solutions combines self-service data prep with visual data discovery, enabling analysts to simultaneously prepare and visualize data side-by-side in an interactive analysis environment offered by newer companies Trifacta, Alteryx and others.

Another method to track data lineage is spreadsheet programs such as Excel that do offer users cell-level lineage, or the ability to see what cells are dependent on another, but the structure of the transformation is lost. Similarly, ETL or mapping software provide transform-level lineage, yet this view typically doesn't display data and is too coarse-grained to distinguish between transforms that are logically independent (e.g. transforms that operate on distinct columns) or dependent.


Complex platform

Big Data platforms have a very complicated structure. Data is distributed among several machines. Typically the jobs are mapped into several machines and results are later combined by reduce operations. Debugging of a big data pipeline becomes very challenging because of the very nature of the system. It will not be an easy task for the data scientist to figure out which machine's data has the outliers and unknown features causing a particular algorithm to give unexpected results.


Proposed solution

Data provenance or data lineage can be used to make the debugging of big data pipeline easier. This necessitates the collection of data about data transformations. The below section will explain data provenance in more detail.