How do business intelligence teams get the information they need to support management teams' decisions? The human brain cannot process even a fraction of the vast amount of information available to it. Technology has evolved to let us access, organize, and filter massive datasets. What exactly are data and text mining? The academic definition is "a multi-disciplinary field based on information retrieval, data mining, machine learning, statistics, and computational linguistics". Essentially, data mining is the process of analyzing a large data set to identify relevant patterns. Text mining is analyzing text data that is in an unstructured format and mapping it into a structured format to derive relevant insights. This unit looks at some common uses and techniques for data and text mining.
Completing this unit should take you approximately 12 hours.
In the most basic terms, big data is larger, more complex data sets, especially from new data sources. The data sets are so large that "traditional" processing software cannot manage them. These datasets are helpful to you as they address problems that were not previously addressable.
This review of current literature explores text mining techniques and industry-specific applications. Selecting and using the right techniques and tools according to the domain helps make the text-mining process easier and more efficient. As you read this article, understand this includes applying specific sequences and patterns to extract useful information by removing irrelevant details for predictive analysis. Of course, major issues that may arise during the text mining process include domain knowledge integration, varying concepts of granularity, multilingual text refinement, and natural language processing ambiguity. Figure 3 shows the inter-relationships among text mining techniques and their core functionalities. Using this as a blueprint, apply one example from your industry to each part of the Venn diagram.
According to the conclusion of this article, "data mining is a young discipline with wide and diverse applications, there is still a significant gap between general principles of data mining and domain specific, effective data mining tools for particular applications". Are there some areas where you have seen improvements? Are there others where there could be more?
Data mining will occupy an increasingly important position as the world moves from solving issues related to collecting data to generating information from large masses of data that are now easily gathered. This paper emphasizes that many industries
depend on insights gathered from data, and thus naturally, data mining will become a central focus. We are now moving into an era where pattern recognition and prediction are common. What patterns do you recognize? Are you able to glean some insights into how you are learning?
Data has intrinsic value, but nothing can be gleaned until the volume of data coming in at a high velocity from various sources is preprocessed. Once that value is ascertained, it must also hold veracity.
This simple video tells the story of the growth of big data from the 1960s to today's cloud architecture. What might come after the cloud? This isn't easy to imagine. It seems that when we are at the point of beginning to adopt new technology, it is the end of change. However, data scientists are already looking for something to manage big data that is even more interconnected and accessible than the cloud, even before most of humanity has adopted it. This is why analysis and its inputs and technical tools are a constantly moving target. How exciting to be part of a field that is so dynamic! How scary that, one day, machine learning could even take our jobs!
Watch this short video for a succinct simple explanation of big data. Does this mesh with your understanding?
Big data is defined by the techniques and tools needed to process the dataset - multiple physical/virtual machines working together to process all the data within a reasonable time. This article highlights tools for analyzing big data, including Apache Hadoop, Apache Spark, and TensorFlow. Keep this list of tools handy, as it will prove beneficial in the future as we explore data architecture.
This article describes text mining as the "retrieval, extraction and analysis of unstructured information in digital text". It allows scientists and other researchers to gather massive amounts of written material and add automation for analyzing it efficiently. This will revolutionize literature review capabilities and allow people in specialized fields to quickly understand the current state of knowledge. How might data and text mining differ regarding generation, storage, standardization, and exploitation?
Review these key points on the internal and external data available when constructing a business report to better understand the concepts covered in this piece. These data can also be split into qualitative (description, non-numeric, usually require context) and quantitative (numeric, measure); both are useful. Primary research is research one has done by oneself. Secondary research is research based on other people's primary research. Be sure to do the practice questions to solidify your understanding.
When evaluating sources, we look at
quality, accuracy, relevance, bias, reputation, currency, and
credibility factors in a specific work. This article breaks down the
questions to ask yourself when evaluating a source – who, what, where,
when, and why (sometimes we also need to add "how") – it then summarises
these into the 5Ws. What are your 5Ws?
Complex optimization problems are no longer being handled by traditional techniques, especially as datasets get larger and more disparate. The research world is moving towards understanding how to reduce computational resources in various ways, including through artificial intelligence, (AI), which essentially teaches machines (computers) to "think" like humans. While the efficiency benefits of AI are obvious, as it develops there are numerous ethical and soundness issues that will be debated as new technologies are created, tested and deployed for various purposes in industry and even in consumer products. Do you want your freezer to decide for you how much and what kind of ice to make, for instance? Maybe you do. Others may find this intrusive and "creepy..".
This page explains the concept of data lineage and its utility in tracing errors back to their root cause in the data process. Data lineage is a way of debugging Big Data pipelines, but the process is not simple. Many challenges exist, such as scalability, fault tolerance, anomaly detection, and more. For each of the challenges listed, write your own definition.
There are some common issues when dealing with big data. Two critical ones are data quality and data variety (such as multiple formats within the dataset) – deep learning techniques, such as dimension reduction, can be used to solve these problems. Traditional data models and machine learning methods struggle to deal with these data issues, further supporting the case of deep learning, as the former cannot handle complex data with the framework of Big Data.
Using the 7Vs characteristics of big data, assign an issue you recognize in your industry to each and think of possible solutions. Additionally, write down the positives and if some of these key points could be added elsewhere in your industry in a different manner.The visuals in this article highlight the importance of the 'Big Picture of Statistics' and summarize the general steps of a statistical study: from coming up with the research question, determining what to measure and collecting data, to conducting exploratory analysis on this data and inference to draw up a conclusion on the population in question. Using the example showcased on this page, pick a topic or issue important to your industry and make a visual representation of the concepts.
We recommend reviewing this Study Guide before taking the Unit 3 Assessment.
Take this assessment to see how well you understood this unit.