loader image
Skip to main content
If you continue browsing this website, you agree to our policies:
x
Completion requirements
This notes the importance of correcting raw/unstructured data to create clean/structured data that can be used for research. Data is considered big data when traditional tools and techniques, including capture, storage, visualization analysis, and transfer, cannot adequately handle it. This article provides a roundup of definitions with industry-specific examples of how big data is utilized.

Science

The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% of these streams, there are 100 collisions of interest per second.

  • As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.
  • If all sensor data were recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020) bytes per day, almost 200 times more than all the other sources combined in the world.

The Square Kilometre Array is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day. It is considered one of the most ambitious scientific projects ever undertaken.