Topic: Unit 3: Data Mining and Text Mining | BUS610: Business Intelligence and Analytics

Topic outline

Unit 3: Data Mining and Text Mining

Collapse all Expand all
How do business intelligence teams get the information they need to support management teams' decisions? The human brain cannot process even a fraction of the vast amount of information available to it. Technology has evolved to let us access, organize, and filter massive datasets. What exactly are data and text mining? The academic definition is "a multi-disciplinary field based on information retrieval, data mining, machine learning, statistics, and computational linguistics". Essentially, data mining is the process of analyzing a large data set to identify relevant patterns. Text mining is analyzing text data that is in an unstructured format and mapping it into a structured format to derive relevant insights. This unit looks at some common uses and techniques for data and text mining.
Completing this unit should take you approximately 12 hours.
- Select activity Upon successful completion of this unit, you will ...
  Upon successful completion of this unit, you will be able to:
  choose appropriate datasets to meet the requirement;
  describe the four stages of the data mining process: data generation, data acquisition, data storage, and data analytics;
  standardize and exploit text and develop taxonomy;
  evaluate data quality based on source reliability, accuracy, timeliness, and application to the requirement; and
  identify methods for optimization, filtering, or "cleaning" data for standardization and effective comparison.
- 3.1: Understanding Big Data
  In the most basic terms, big data is larger, more complex data sets, especially from new data sources. The data sets are so large that "traditional" processing software cannot manage them. These datasets are helpful to you as they address problems that were not previously addressable.
  - Select activity Using Text Mining Techniques to Identify Research Trends
    
    Using Text Mining Techniques to Identify Research Trends Book
    
    Students must
    
    Mark as done
    
    Combining text mining techniques and bibliometric analysis can help uncover hidden information in scientific publications and unseen patterns and trends in research fields. Text mining may help researchers gain a more comprehensive understanding of the knowledge of a certain field hidden in a large amount of scientific literature. Clustering can provide a more detailed structured/architecture overview of a certain field. Social network analysis (SNA) explores core themes and allows researchers to better understand the developmental gains of a certain field. How do you think SNA enables companies to understand your purchasing decisions? What are some text mining techniques companies might use to find connections for customer demographic characteristics? Using one of the free tools listed here, map your own interactions with friends and the mutual brands advertised to you. What similarities do you see?
  - Select activity Text Mining Techniques, Applications, and Issues
    
    Text Mining Techniques, Applications, and Issues Book
    
    Students must
    
    Mark as done
    
    This review of current literature explores text mining techniques and industry-specific applications. Selecting and using the right techniques and tools according to the domain helps make the text-mining process easier and more efficient. As you read this article, understand this includes applying specific sequences and patterns to extract useful information by removing irrelevant details for predictive analysis. Of course, major issues that may arise during the text mining process include domain knowledge integration, varying concepts of granularity, multilingual text refinement, and natural language processing ambiguity. Figure 3 shows the inter-relationships among text mining techniques and their core functionalities. Using this as a blueprint, apply one example from your industry to each part of the Venn diagram.
  - Select activity Data Mining Applications and Trends in Data Mining
    
    Data Mining Applications and Trends in Data Mining Page
    
    Students must
    
    Mark as done
    
    According to the conclusion of this article, "data mining is a young discipline with wide and diverse applications, there is still a significant gap between general principles of data mining and domain specific, effective data mining tools for particular applications". Are there some areas where you have seen improvements? Are there others where there could be more?
  - Select activity A Review of Data Mining Techniques and Trends
    
    A Review of Data Mining Techniques and Trends Page
    
    Students must
    
    Mark as done
    
    Data mining will occupy an increasingly important position as the world moves from solving issues related to collecting data to generating information from large masses of data that are now easily gathered. This paper emphasizes that many industries depend on insights gathered from data, and thus naturally, data mining will become a central focus. We are now moving into an era where pattern recognition and prediction are common. What patterns do you recognize? Are you able to glean some insights into how you are learning?
  - 3.1.1: What is Big Data?
    
    Data has intrinsic value, but nothing can be gleaned until the volume of data coming in at a high velocity from various sources is preprocessed. Once that value is ascertained, it must also hold veracity.
    
    Select activity Big Data
    
    Big Data Page
    
    Students must
    
    Mark as done
    
    This simple video tells the story of the growth of big data from the 1960s to today's cloud architecture. What might come after the cloud? This isn't easy to imagine. It seems that when we are at the point of beginning to adopt new technology, it is the end of change. However, data scientists are already looking for something to manage big data that is even more interconnected and accessible than the cloud, even before most of humanity has adopted it. This is why analysis and its inputs and technical tools are a constantly moving target. How exciting to be part of a field that is so dynamic! How scary that, one day, machine learning could even take our jobs!
    
    Select activity What is Big Data?
    
    What is Big Data? Page
    
    Students must
    
    Mark as done
    
    Watch this short video for a succinct simple explanation of big data. Does this mesh with your understanding?
    
    Select activity An Introduction to Big Data
    
    An Introduction to Big Data Page
    
    Students must
    
    Mark as done
    
    Big data is defined by the techniques and tools needed to process the dataset - multiple physical/virtual machines working together to process all the data within a reasonable time. This article highlights tools for analyzing big data, including Apache Hadoop, Apache Spark, and TensorFlow. Keep this list of tools handy, as it will prove beneficial in the future as we explore data architecture.
  - 3.1.2: Where Does Big Data Live?
    
    Consider how much data is produced and how it is used as you prepare to understand how much "space" is needed to process and make it available for processing.
    
    Select activity How Much Data Do You Produce?
    
    How Much Data Do You Produce? Page
    
    Students must
    
    Mark as done
    
    In 2019 the World Economic Forum published this infographic detailing how much data is generated daily. Given the current status of the COVID-19 pandemic, this number is likely a lot bigger than projected. The first article offers insights into the pre-pandemic healthcare industry.
    
    Select activity Big Data in Healthcare
    
    Big Data in Healthcare Book
    
    Students must
    
    Mark as done
    
    Collective big data analysis of electronic health records, medical records, and other medical data is continuously helping build a better prognosis framework in "traditional" medicine outside of the current pandemic of COVID-19 (this is itself a case study of another kind and is pushing the amount of data to a whole new level). However, the challenges of big data analysis in healthcare range from federal law concerning how private data is stored to practical concerns such as how to computationally manage and leverage it. However, the magnitude of data being collected and stored remains the same. This paper asserts that new techniques and strategies should be created to better understand the nature (un-/semi-/structured data), complexity (dimensions and attributes), and volume of data to derive meaningful information. Given the current pandemic of COVID-19, do a little brainstorming and write down some ideas where you think improvements could be made. How should private data be extracted and security and privacy maintained while retaining relevant information for research? You may be well served to occasionally review your ideas and compare them with current research for patterns you may identify.
- 3.2: Data and Text Mining
  Text mining is analyzing text data in an unstructured format and mapping it into a structured format to derive relevant insights. Data mining relies mostly on statistical techniques and algorithms. Text mining also depends on statistical analysis and uses linguistic analysis techniques.
  - Select activity Data Mining from C4DLab
    
    Data Mining from C4DLab Page
    
    Students must
    
    Mark as done
    
    This article provides a nice overview of data mining, its foundations, how it works, and a basic presentation of data mining architecture.
  - Select activity Getting Started in Text Mining
    
    Getting Started in Text Mining Page
    
    Students must
    
    Mark as done
    
    This article describes text mining as the "retrieval, extraction and analysis of unstructured information in digital text". It allows scientists and other researchers to gather massive amounts of written material and add automation for analyzing it efficiently. This will revolutionize literature review capabilities and allow people in specialized fields to quickly understand the current state of knowledge. How might data and text mining differ regarding generation, storage, standardization, and exploitation?
  - 3.2.1: Data Mining Techniques
    
    Data mining is a process that is automated in various ways to allow analysts to exploit large datasets. The data's initial comparability and "cleanliness" will determine how complex the process needs to be. The process will vary with the type, level of existing structure, size, and complexity of your datasets.
    
    Select activity Practical Real-world Data Mining
    
    Practical Real-world Data Mining Page
    
    Students must
    
    Mark as done
    
    Watch this video for an in-depth exploration of many data mining applications. The video emphasizes the importance of context around your ML scores: "decision is more than prediction".
    
    Select activity Text and Data Mining (TDM) Explained
    
    Text and Data Mining (TDM) Explained Page
    
    Students must
    
    Mark as done
    
    Publishers, who legally own the information in publications, can define how individuals consume knowledge. In the context of data mining, this means possibly charging additional fees to mine data or restricting the number of pages that can be targeted with algorithms per day. What do you think of this circular transit of funding? Most authors and their universities would prefer to give everyone access to their data.
    
    Select activity Information, People, and Technology
    
    Information, People, and Technology Book
    
    Students must
    
    Mark as done
    
    This notes the importance of correcting raw/unstructured data to create clean/structured data that can be used for research. Data is considered big data when traditional tools and techniques, including capture, storage, visualization analysis, and transfer, cannot adequately handle it. This article provides a roundup of definitions with industry-specific examples of how big data is utilized.
  - 3.2.2: Text Mining and the Complications of Language
    
    Select activity Introduction to Text Mining
    
    Introduction to Text Mining Page
    
    Students must
    
    Mark as done
    
    This video gives a concise overview of text mining and its use in turning unstructured data into structured data that can be easily analyzed. It breaks this process into digestible steps (retrieval, processing, extraction, and analysis). Can you think of a topic in your industry? How would you assign relevant information for each step?
    
    Select activity Text Mining
    
    Text Mining Page
    
    Students must
    
    Mark as done
    
    This video provides great detail about the precise techniques used in text mining.
- 3.3: Evaluating Source Data
  Using credible sources in your research gives it credibility. High-quality resources are more likely to translate into better results. Conversely, poor quality is likely to adversely affect your results. It is always best to remember these universally accepted criteria when sourcing - accuracy, authority, objectivity, currency, and coverage. Using poor-quality data that results in not-so-valuable findings is commonly called "garbage in, garbage out".
  - Select activity Data as Sources
    
    Data as Sources Page
    
    Students must
    
    Mark as done
    
    This piece defines data as "units of information observed, collected, or created in the course of research". There are two main ways to obtain data: obtaining data that has already been collated or collecting it yourself. This source touches on the first way and how to use this data properly (via ensuring relevance to your research, using visualizations and accurate citations). Data is not copyrightable, but the expression of data is - hence the importance of appropriate citations. This reinterpretation video memorably provides great examples. Watch it and think how this similar method can be applied to areas of your industry.
  - Select activity Types of Data Sources
    
    Types of Data Sources Book
    
    Students must
    
    Mark as done
    
    Review these key points on the internal and external data available when constructing a business report to better understand the concepts covered in this piece. These data can also be split into qualitative (description, non-numeric, usually require context) and quantitative (numeric, measure); both are useful. Primary research is research one has done by oneself. Secondary research is research based on other people's primary research. Be sure to do the practice questions to solidify your understanding.
  - Select activity Evaluating Sources
    
    Evaluating Sources Book
    
    Students must
    
    Mark as done
    
    When evaluating sources, we look at quality, accuracy, relevance, bias, reputation, currency, and credibility factors in a specific work. This article breaks down the questions to ask yourself when evaluating a source – who, what, where, when, and why (sometimes we also need to add "how") – it then summarises these into the 5Ws. What are your 5Ws?
  - 3.3.1: Identifying Data Sources
    
    Where your data originated is vital. What, where, and how the data was entered into a machine-readable format is one of the most hotly discussed aspects of error tracing. The comprehensive article below provides a deep dive into the issues, such as providing an audit trail. Ensuring source origination allows you to ensure you know which topics your records relate to.
    
    Select activity Data Lineage
    
    Data Lineage Book
    
    Students must
    
    Mark as done
    
    "Data lineage includes the data origin, what happens to it and where it moves over time (essentially the full journey of a piece of data)". This page explains the concept of data lineage and its utility in tracing errors back to their root cause in the data process. Data lineage is a way of debugging Big Data pipelines, but the process is not simple. Many challenges exist, such as scalability, fault tolerance, anomaly detection, and more. For each of the challenges listed, write your own definition.
  - 3.3.2: Source Evaluation Trust Matrix
    
    These examples describe various types of trust models. To ensure standardized source validation expressions and evaluation methods in your organization, you should rely upon or develop if there is not one already in widespread use, to ensure that all team members dealing with data understand how to know whether to trust it. These articles describe two types of trust evaluation models for specific processes. Yours may be similar or much different depending upon your field and the source requirements in the discipline and in your organization.
    
    Select activity A Trust Evaluation Model for Wireless Sensor Networks
    
    A Trust Evaluation Model for Wireless Sensor Networks Book
    
    Students must
    
    Mark as done
    
    An effective dynamic trust evaluation model (DTEM) for wireless sensor networks can be used to complement traditional security mechanisms to address security issues. The performance (detection rate) of the DTEM was better than both RSFN and BTMS (these are 2 existing trust models). The traditional security mechanisms (cryptography, authentication, etc.) are widely used to deal with external attacks. A "trust model is a useful complement to the traditional security mechanism, which can solve insider or node misbehavior attacks" In this research paper, the authors highlight the relationship between four modules. Use figure 2 to better understand the progression of the implementation process. In your own words, write or redraw your own understanding of the trust process outlined in figure 1.
    
    Select activity A Trust Evaluation Model for Cloud Computing
    
    A Trust Evaluation Model for Cloud Computing Book
    
    Students must
    
    Mark as done
    
    Another trust model based on D-S evidence theory and sliding windows can bolster system security in cloud computing by enhancing the detection of malicious entities and how entities are evaluated for credibility in general.
- 3.4: Data Optimization
  Complex optimization problems are no longer being handled by traditional techniques, especially as datasets get larger and more disparate. The research world is moving towards understanding how to reduce computational resources in various ways, including through artificial intelligence, (AI), which essentially teaches machines (computers) to "think" like humans. While the efficiency benefits of AI are obvious, as it develops there are numerous ethical and soundness issues that will be debated as new technologies are created, tested and deployed for various purposes in industry and even in consumer products. Do you want your freezer to decide for you how much and what kind of ice to make, for instance? Maybe you do. Others may find this intrusive and "creepy..".
  - Select activity Data Science and AI-Based Optimization in Scientific Programming
    
    Data Science and AI-Based Optimization in Scientific Programming Page
    
    Students must
    
    Mark as done
    
    This summary review showcases nine research articles on varying topics and provides key takeaways from each. You would be well served to keep abreast of these advances in your career going forward as these are recent publications and are highly relevant. Pay particular attention to the second article synopsis entitled "Leveraging Image Visual Features in Content-Based Recommender System". This is a recommendation model which combines user-item rating data with item hybrid features based on image visual features can be particularly useful in sparse data scenarios, where it has achieved better results than other conventional approaches. Also of note is the seventh article, "Classification algorithms based on Polyhedral Conic Functions Analysis" which provide promising results in comparison with traditional supervised algorithms. (Where the goal was to classify literature into predefined classes). Write down your interpretation of what these summaries mean in your understanding of data optimization in various types of datasets and industries.
  - 3.4.1: Preparing Data
    
    So now you have all that data, how do you make it useful? It must be cleaned and enriched to provide some relevant insights. These articles highlight how that can be achieved through confirmatory and exploratory approaches.
    
    Select activity Data Mining Techniques in Analyzing Process Data
    
    Data Mining Techniques in Analyzing Process Data Book
    
    Students must
    
    Mark as done
    
    This paper notes that previous papers that explore how data methods can be used to analyze process data in log files of technology-enhanced assessments are limited in that they only explore the efficacy of one data mining technique under one specific scenario. This also demonstrates the usage of four often-used supervised learning techniques and two unsupervised methods fitted to one assessment data and discusses the pros and cons of each. For example, the authors note that regression trees may deal with noise well but are easily influenced by small changes. Can you differentiate between a confirmatory approach and an exploratory approach?
    
    Select activity Data Preparation by Developers
    
    Data Preparation by Developers Page
    
    Students must
    
    Mark as done
    
    This provides a concise summary of the data preparation process (gather → discover → cleanse → transform → enrich → store). This summary touches upon tasks involved in prepping data (aggregation, formatting, normalization, and labeling), the concept of data quality (as a measure of the success of the preparation pipeline), and challenges you may run into when preparing data (diversity of data sources, time required, lack of confidence in quality). What are the most common languages/libraries used for data preparation? What can be used for fast in-memory data processing in a distributed architecture?
    
    Select activity Knowledge Discovery in Data-Mining
    
    Knowledge Discovery in Data-Mining Book
    
    Students must
    
    Mark as done
    
    Knowledge discovery in databases (KDD) is discovering useful knowledge from data collection. The data mining process aims to extract information from a data set and transform it into an understandable structure for further use. Data mining is just one step of the knowledge discovery process (the core step). Some following steps are pattern evaluation (this step interprets mined patterns and relationships), akin to your analytic process, and knowledge consolidation, similar to reporting your findings, although they ought to be more robust than simply consolidating knowledge to respond responsibly to your requirements. Like analysis, KDD is an iterative process. If the pattern evaluated after the data mining step is not useful, the process can begin again from the previous steps.
  - 3.4.2: Standardization
    
    This page explains the concept of data lineage and its utility in tracing errors back to their root cause in the data process. Data lineage is a way of debugging Big Data pipelines, but the process is not simple. Many challenges exist, such as scalability, fault tolerance, anomaly detection, and more. For each of the challenges listed, write your own definition.
    
    Select activity Big Data Analytics for Disparate Data
    
    Big Data Analytics for Disparate Data Book
    
    Students must
    
    Mark as done
    
    There are some common issues when dealing with big data. Two critical ones are data quality and data variety (such as multiple formats within the dataset) – deep learning techniques, such as dimension reduction, can be used to solve these problems. Traditional data models and machine learning methods struggle to deal with these data issues, further supporting the case of deep learning, as the former cannot handle complex data with the framework of Big Data.
    Using the 7Vs characteristics of big data, assign an issue you recognize in your industry to each and think of possible solutions. Additionally, write down the positives and if some of these key points could be added elsewhere in your industry in a different manner.
    
    Select activity Types of Statistical Studies and Producing Data
    
    Types of Statistical Studies and Producing Data Page
    
    Students must
    
    Mark as done
    
    This article provides a quick explanation of the two types of statistical studies: Observational studies (observes individuals and measures variables of interest) and Experiments (intentionally manipulates one variable to see the effect it has on another variable). Write your interpretive definition of observational studies and experiments.
  - 3.4.3: Combining Data from Different Sources
    
    Your data must be rigorous and contain a highly representative sample to achieve the most relevant, reliable, and reflective insights. Collecting data from only one subset of a large population is pointless when you wish to market to the whole.
    
    Select activity Capturing Value from Big Data
    
    Capturing Value from Big Data Page
    
    Students must
    
    Mark as done
    
    Enterprises can capture value from big data to gain immediate social/monetary value or strategic competitive advantage. Firms can capture value in various ways, such as data-driven discovery and innovation of new and existing products and services. Can you think of five examples that can be ascribed to each method?
    
    Select activity Types of Statistical Studies and Producing Data
    
    Types of Statistical Studies and Producing Data Page
    
    Students must
    
    Mark as done
    
    The visuals in this article highlight the importance of the 'Big Picture of Statistics' and summarize the general steps of a statistical study: from coming up with the research question, determining what to measure and collecting data, to conducting exploratory analysis on this data and inference to draw up a conclusion on the population in question. Using the example showcased on this page, pick a topic or issue important to your industry and make a visual representation of the concepts.
- Study Guide: Unit 3
  We recommend reviewing this Study Guide before taking the Unit 3 Assessment.
  - Select activity Study Guide: Unit 3
    
    Study Guide: Unit 3 URL
- Unit 3 Assessment
  - Select activity Unit 3 Assessment
    
    Unit 3 Assessment Quiz
    
    Students must
    
    Receive a grade
    
    Take this assessment to see how well you understood this unit.
    
    This assessment does not count towards your grade. It is just for practice!
    
    You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
    
    You can take this assessment as many times as you want, whenever you want.

Course Syllabus

Course Syllabus

Unit 1: Business Intelligence Overview

1.1: What is Business Intelligence?

Business Intelligence

Introduction to Business Intelligence

1.1.1: What Business Intelligence is Not

Frontiers of Business Intelligence and Analytics

Business Intelligence Dashboards

1.1.2: Business Intelligence vs. Competitive Intelligence

What is Competitive Intelligence?

1.1.3: From Systems Engineering to Business Engineering

Information Architecture Analysis

Systems Engineering

Business Engineering

1.2.1: Contemporary Applications

Business Intelligence in ERP

Improving Outcomes with Business Intelligence

How Businesses Use Information

1.2.2: BI Approaches for Each Lifecycle Stage

The Business Cycle

Big Data Analytics in Supply Chain Management

1.2.3: BI for Prediction

Goal-Oriented BI

Big Data Analytics

BI System Effectiveness

Data Mining Analytics for BI and Decision Support

1.3: The Future of BI

Future Trends in Information Systems

Internet Trends

Trends in Information Technology

Technology Trends in the COVID-19 Pandemic

The Future of BI

1.3.1: Adapting Business Models to Globalization and Technology

Global Business Strategies for Responding to Cultural Differences

Internationalization and the Need of Business Model Innovation

1.3.2: Maintaining the Firm-Centric Approach

Designing BI Solutions in the Era of Big Data

1.3.3: Incorporating Data from the Internet of Things (IoT)

The Internet of Things

The Cognitive Internet of Things and Big Data

Data Science in Heavy Industry and the Internet of Things

Causality and Variables

The Internet of Things is Revolutionary

Unit 1 Discussion

Study Guide: Unit 1

Study Guide: Unit 1

Unit 1 Assessment

Unit 1 Assessment

Unit 2: BI as Business Support

2.1: Defining the Problem

Choice and Happiness

2.1.1: Framing Internal Client Discussions

Overview of Managerial Decision-Making

2.1.2: Drafting the Terms of Reference (TOR)

Defining the Scope of your Project

Developing Terms of Reference

2.1.3: Negotiating the Project Scope

Scope Planning

Negotiation

2.2: The Art and Science of Decision-Making

Decision-Making in Management

Decision-Making Processes in the Workplace

2.2.1: Thinking about Thinking

Experience vs. Memory

Evidence Logs and Metacognitive Logs

2.2.2: Use Analysis, or "Go with Your Gut"?

Problem Solving, Thinking, and Intelligence

Using a Heuristics Checklist

2.2.3: Decision-Making Approaches

Decision-Making Tools

2.2.4: Structuring Decision-Making Effectively

RAPID Decision-Making

2.3: Using Data to Make Decisions

Business Intelligence Dashboards

2.3.1: Everyday Data

2.3.2: Why Expert Judgement is No Better than Yours

Why You Think You're Right Even if You're Wrong

2.3.3: How Forecasting can Help Decision-Making

System Interventions