loader image
Skip to main content
If you continue browsing this website, you agree to our policies:
x

Topic outline

  • Unit 3: Data Mining and Text Mining

    How do business intelligence teams get the information they need to support management teams' decisions? The human brain cannot process even a fraction of the vast amount of information available to it. Technology has evolved to let us access, organize, and filter massive datasets. What exactly are data and text mining? The academic definition is "a multi-disciplinary field based on information retrieval, data mining, machine learning, statistics, and computational linguistics". Essentially, data mining is the process of analyzing a large data set to identify relevant patterns. Text mining is analyzing text data that is in an unstructured format and mapping it into a structured format to derive relevant insights. This unit looks at some common uses and techniques for data and text mining.

    Completing this unit should take you approximately 12 hours.

    • Upon successful completion of this unit, you will be able to:

      • choose appropriate datasets to meet the requirement;
      • describe the four stages of the data mining process: data generation, data acquisition, data storage, and data analytics;
      • standardize and exploit text and develop taxonomy;
      • evaluate data quality based on source reliability, accuracy, timeliness, and application to the requirement; and
      • identify methods for optimization, filtering, or "cleaning" data for standardization and effective comparison.

    • 3.1: Understanding Big Data

      In the most basic terms, big data is larger, more complex data sets, especially from new data sources. The data sets are so large that "traditional" processing software cannot manage them. These datasets are helpful to you as they address problems that were not previously addressable.

      • Combining text mining techniques and bibliometric analysis can help uncover hidden information in scientific publications and unseen patterns and trends in research fields. Text mining may help researchers gain a more comprehensive understanding of the knowledge of a certain field hidden in a large amount of scientific literature. Clustering can provide a more detailed structured/architecture overview of a certain field. Social network analysis (SNA) explores core themes and allows researchers to better understand the developmental gains of a certain field. How do you think SNA enables companies to understand your purchasing decisions? What are some text mining techniques companies might use to find connections for customer demographic characteristics? Using one of the free tools listed here, map your own interactions with friends and the mutual brands advertised to you. What similarities do you see?
      • This review of current literature explores text mining techniques and industry-specific applications. Selecting and using the right techniques and tools according to the domain helps make the text-mining process easier and more efficient. As you read this article, understand this includes applying specific sequences and patterns to extract useful information by removing irrelevant details for predictive analysis. Of course, major issues that may arise during the text mining process include domain knowledge integration, varying concepts of granularity, multilingual text refinement, and natural language processing ambiguity. Figure 3 shows the inter-relationships among text mining techniques and their core functionalities. Using this as a blueprint, apply one example from your industry to each part of the Venn diagram.

      • According to the conclusion of this article, "data mining is a young discipline with wide and diverse applications, there is still a significant gap between general principles of data mining and domain specific, effective data mining tools for particular applications". Are there some areas where you have seen improvements? Are there others where there could be more?

      • Data mining will occupy an increasingly important position as the world moves from solving issues related to collecting data to generating information from large masses of data that are now easily gathered. This paper emphasizes that many industries depend on insights gathered from data, and thus naturally, data mining will become a central focus. We are now moving into an era where pattern recognition and prediction are common. What patterns do you recognize? Are you able to glean some insights into how you are learning?

      • 3.1.1: What is Big Data?

        Data has intrinsic value, but nothing can be gleaned until the volume of data coming in at a high velocity from various sources is preprocessed. Once that value is ascertained, it must also hold veracity.

        • This simple video tells the story of the growth of big data from the 1960s to today's cloud architecture. What might come after the cloud? This isn't easy to imagine. It seems that when we are at the point of beginning to adopt new technology, it is the end of change. However, data scientists are already looking for something to manage big data that is even more interconnected and accessible than the cloud, even before most of humanity has adopted it. This is why analysis and its inputs and technical tools are a constantly moving target. How exciting to be part of a field that is so dynamic! How scary that, one day, machine learning could even take our jobs!

        • Watch this short video for a succinct simple explanation of big data. Does this mesh with your understanding?

        • Big data is defined by the techniques and tools needed to process the dataset - multiple physical/virtual machines working together to process all the data within a reasonable time. This article highlights tools for analyzing big data, including Apache Hadoop, Apache Spark, and TensorFlow. Keep this list of tools handy, as it will prove beneficial in the future as we explore data architecture.

      • 3.1.2: Where Does Big Data Live?

        Consider how much data is produced and how it is used as you prepare to understand how much "space" is needed to process and make it available for processing.
        • In 2019 the World Economic Forum published this infographic detailing how much data is generated daily. Given the current status of the COVID-19 pandemic, this number is likely a lot bigger than projected. The first article offers insights into the pre-pandemic healthcare industry.
        • Collective big data analysis of electronic health records, medical records, and other medical data is continuously helping build a better prognosis framework in "traditional" medicine outside of the current pandemic of COVID-19 (this is itself a case study of another kind and is pushing the amount of data to a whole new level). However, the challenges of big data analysis in healthcare range from federal law concerning how private data is stored to practical concerns such as how to computationally manage and leverage it. However, the magnitude of data being collected and stored remains the same. This paper asserts that new techniques and strategies should be created to better understand the nature (un-/semi-/structured data), complexity (dimensions and attributes), and volume of data to derive meaningful information. Given the current pandemic of COVID-19, do a little brainstorming and write down some ideas where you think improvements could be made. How should private data be extracted and security and privacy maintained while retaining relevant information for research? You may be well served to occasionally review your ideas and compare them with current research for patterns you may identify.
    • 3.2: Data and Text Mining

      Text mining is analyzing text data in an unstructured format and mapping it into a structured format to derive relevant insights. Data mining relies mostly on statistical techniques and algorithms. Text mining also depends on statistical analysis and uses linguistic analysis techniques.
      • This article provides a nice overview of data mining, its foundations, how it works, and a basic presentation of data mining architecture.
      • This article describes text mining as the "retrieval, extraction and analysis of unstructured information in digital text". It allows scientists and other researchers to gather massive amounts of written material and add automation for analyzing it efficiently. This will revolutionize literature review capabilities and allow people in specialized fields to quickly understand the current state of knowledge. How might data and text mining differ regarding generation, storage, standardization, and exploitation?

      • 3.2.1: Data Mining Techniques

        Data mining is a process that is automated in various ways to allow analysts to exploit large datasets. The data's initial comparability and "cleanliness" will determine how complex the process needs to be. The process will vary with the type, level of existing structure, size, and complexity of your datasets.
        • Watch this video for an in-depth exploration of many data mining applications. The video emphasizes the importance of context around your ML scores: "decision is more than prediction".
        • Publishers, who legally own the information in publications, can define how individuals consume knowledge. In the context of data mining, this means possibly charging additional fees to mine data or restricting the number of pages that can be targeted with algorithms per day. What do you think of this circular transit of funding? Most authors and their universities would prefer to give everyone access to their data.
        • This notes the importance of correcting raw/unstructured data to create clean/structured data that can be used for research. Data is considered big data when traditional tools and techniques, including capture, storage, visualization analysis, and transfer, cannot adequately handle it. This article provides a roundup of definitions with industry-specific examples of how big data is utilized.
      • 3.2.2: Text Mining and the Complications of Language

        • This video gives a concise overview of text mining and its use in turning unstructured data into structured data that can be easily analyzed. It breaks this process into digestible steps (retrieval, processing, extraction, and analysis). Can you think of a topic in your industry? How would you assign relevant information for each step?
        • This video provides great detail about the precise techniques used in text mining.
    • 3.3: Evaluating Source Data

      Using credible sources in your research gives it credibility. High-quality resources are more likely to translate into better results. Conversely, poor quality is likely to adversely affect your results. It is always best to remember these universally accepted criteria when sourcing - accuracy, authority, objectivity, currency, and coverage. Using poor-quality data that results in not-so-valuable findings is commonly called "garbage in, garbage out".
      • This piece defines data as "units of information observed, collected, or created in the course of research". There are two main ways to obtain data: obtaining data that has already been collated or collecting it yourself. This source touches on the first way and how to use this data properly (via ensuring relevance to your research, using visualizations and accurate citations). Data is not copyrightable, but the expression of data is - hence the importance of appropriate citations. This reinterpretation video memorably provides great examples. Watch it and think how this similar method can be applied to areas of your industry.
      • Review these key points on the internal and external data available when constructing a business report to better understand the concepts covered in this piece. These data can also be split into qualitative (description, non-numeric, usually require context) and quantitative (numeric, measure); both are useful. Primary research is research one has done by oneself. Secondary research is research based on other people's primary research. Be sure to do the practice questions to solidify your understanding.

      • When evaluating sources, we look at quality, accuracy, relevance, bias, reputation, currency, and credibility factors in a specific work. This article breaks down the questions to ask yourself when evaluating a source – who, what, where, when, and why (sometimes we also need to add "how") – it then summarises these into the 5Ws. What are your 5Ws?

      • 3.3.1: Identifying Data Sources

        Where your data originated is vital. What, where, and how the data was entered into a machine-readable format is one of the most hotly discussed aspects of error tracing. The comprehensive article below provides a deep dive into the issues, such as providing an audit trail. Ensuring source origination allows you to ensure you know which topics your records relate to.
        • "Data lineage includes the data origin, what happens to it and where it moves over time (essentially the full journey of a piece of data)". This page explains the concept of data lineage and its utility in tracing errors back to their root cause in the data process. Data lineage is a way of debugging Big Data pipelines, but the process is not simple. Many challenges exist, such as scalability, fault tolerance, anomaly detection, and more. For each of the challenges listed, write your own definition.
      • 3.3.2: Source Evaluation Trust Matrix

        These examples describe various types of trust models. To ensure standardized source validation expressions and evaluation methods in your organization, you should rely upon or develop if there is not one already in widespread use, to ensure that all team members dealing with data understand how to know whether to trust it. These articles describe two types of trust evaluation models for specific processes. Yours may be similar or much different depending upon your field and the source requirements in the discipline and in your organization.
        • An effective dynamic trust evaluation model (DTEM) for wireless sensor networks can be used to complement traditional security mechanisms to address security issues. The performance (detection rate) of the DTEM was better than both RSFN and BTMS (these are 2 existing trust models). The traditional security mechanisms (cryptography, authentication, etc.) are widely used to deal with external attacks. A "trust model is a useful complement to the traditional security mechanism, which can solve insider or node misbehavior attacks" In this research paper, the authors highlight the relationship between four modules. Use figure 2 to better understand the progression of the implementation process. In your own words, write or redraw your own understanding of the trust process outlined in figure 1.
        • Another trust model based on D-S evidence theory and sliding windows can bolster system security in cloud computing by enhancing the detection of malicious entities and how entities are evaluated for credibility in general.
    • 3.4: Data Optimization

      Complex optimization problems are no longer being handled by traditional techniques, especially as datasets get larger and more disparate. The research world is moving towards understanding how to reduce computational resources in various ways, including through artificial intelligence, (AI), which essentially teaches machines (computers) to "think" like humans. While the efficiency benefits of AI are obvious, as it develops there are numerous ethical and soundness issues that will be debated as new technologies are created, tested and deployed for various purposes in industry and even in consumer products. Do you want your freezer to decide for you how much and what kind of ice to make, for instance? Maybe you do. Others may find this intrusive and "creepy..".

      • This summary review showcases nine research articles on varying topics and provides key takeaways from each. You would be well served to keep abreast of these advances in your career going forward as these are recent publications and are highly relevant. Pay particular attention to the second article synopsis entitled "Leveraging Image Visual Features in Content-Based Recommender System". This is a recommendation model which combines user-item rating data with item hybrid features based on image visual features can be particularly useful in sparse data scenarios, where it has achieved better results than other conventional approaches. Also of note is the seventh article, "Classification algorithms based on Polyhedral Conic Functions Analysis" which provide promising results in comparison with traditional supervised algorithms. (Where the goal was to classify literature into predefined classes). Write down your interpretation of what these summaries mean in your understanding of data optimization in various types of datasets and industries.
      • 3.4.1: Preparing Data

        So now you have all that data, how do you make it useful? It must be cleaned and enriched to provide some relevant insights. These articles highlight how that can be achieved through confirmatory and exploratory approaches.
        • This paper notes that previous papers that explore how data methods can be used to analyze process data in log files of technology-enhanced assessments are limited in that they only explore the efficacy of one data mining technique under one specific scenario. This also demonstrates the usage of four often-used supervised learning techniques and two unsupervised methods fitted to one assessment data and discusses the pros and cons of each. For example, the authors note that regression trees may deal with noise well but are easily influenced by small changes. Can you differentiate between a confirmatory approach and an exploratory approach?
        • This provides a concise summary of the data preparation process (gather → discover → cleanse → transform → enrich → store). This summary touches upon tasks involved in prepping data (aggregation, formatting, normalization, and labeling), the concept of data quality (as a measure of the success of the preparation pipeline), and challenges you may run into when preparing data (diversity of data sources, time required, lack of confidence in quality). What are the most common languages/libraries used for data preparation? What can be used for fast in-memory data processing in a distributed architecture?
        • Knowledge discovery in databases (KDD) is discovering useful knowledge from data collection. The data mining process aims to extract information from a data set and transform it into an understandable structure for further use. Data mining is just one step of the knowledge discovery process (the core step). Some following steps are pattern evaluation (this step interprets mined patterns and relationships), akin to your analytic process, and knowledge consolidation, similar to reporting your findings, although they ought to be more robust than simply consolidating knowledge to respond responsibly to your requirements. Like analysis, KDD is an iterative process. If the pattern evaluated after the data mining step is not useful, the process can begin again from the previous steps.
      • 3.4.2: Standardization

        This page explains the concept of data lineage and its utility in tracing errors back to their root cause in the data process. Data lineage is a way of debugging Big Data pipelines, but the process is not simple. Many challenges exist, such as scalability, fault tolerance, anomaly detection, and more. For each of the challenges listed, write your own definition.

        • There are some common issues when dealing with big data. Two critical ones are data quality and data variety (such as multiple formats within the dataset) – deep learning techniques, such as dimension reduction, can be used to solve these problems. Traditional data models and machine learning methods struggle to deal with these data issues, further supporting the case of deep learning, as the former cannot handle complex data with the framework of Big Data.

          Using the 7Vs characteristics of big data, assign an issue you recognize in your industry to each and think of possible solutions. Additionally, write down the positives and if some of these key points could be added elsewhere in your industry in a different manner.

        • This article provides a quick explanation of the two types of statistical studies: Observational studies (observes individuals and measures variables of interest) and Experiments (intentionally manipulates one variable to see the effect it has on another variable). Write your interpretive definition of observational studies and experiments.
      • 3.4.3: Combining Data from Different Sources

        Your data must be rigorous and contain a highly representative sample to achieve the most relevant, reliable, and reflective insights. Collecting data from only one subset of a large population is pointless when you wish to market to the whole.
        • Enterprises can capture value from big data to gain immediate social/monetary value or strategic competitive advantage. Firms can capture value in various ways, such as data-driven discovery and innovation of new and existing products and services. Can you think of five examples that can be ascribed to each method?
        • The visuals in this article highlight the importance of the 'Big Picture of Statistics' and summarize the general steps of a statistical study: from coming up with the research question, determining what to measure and collecting data, to conducting exploratory analysis on this data and inference to draw up a conclusion on the population in question. Using the example showcased on this page, pick a topic or issue important to your industry and make a visual representation of the concepts.

    • Study Guide: Unit 3

      We recommend reviewing this Study Guide before taking the Unit 3 Assessment.

    • Unit 3 Assessment

      • Take this assessment to see how well you understood this unit.

        • This assessment does not count towards your grade. It is just for practice!
        • You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
        • You can take this assessment as many times as you want, whenever you want.