Developing Insights from Social Media: Discovering social trends in a target audience

Discovering social trends in a target audience

Methodology

We present the details of how to discover a target audience of Twitter users and their collective voice from raw Twitter data. First, in order to identify candidate users that meet certain criteria, we explore available Twitter resources for data collection and existing approaches to user profiling. Next, we discuss enriching user profiles utilizing hashtags in the tweets posted by the target users. Lastly, we present developing topical and social insights from the collective voice of the target users.

Before we go into details, we first present formal modeling of the data space that we analyze in this paper. Our Twitter data space can be noted as ${\mathcal {U}} \times {\mathcal {T}} \times {\mathcal {H}}$ , where ${\mathcal {U}}$ is a set of users on Twitter, ${\mathcal {T}}$ is a set of tweets created by the users, and ${\mathcal {H}}$ is a set of hashtags used in the tweets by the users. This implies that a user $u \in {\mathcal {U}}$ creates a tweet $t \in {\mathcal {T}}$ using a set of hashtags ${\mathcal {H}}_{u,t} \subset {\mathcal {H}}$ .

User profiling is an essential component to our approach, which defines user attributes needed for a study and populates the attribute values for each user. We define the profile of a Twitter user u∈U as a set of tuples consisting of an attribute and its value where, with respect to user u for an attribute a∈A, its value p(u, a) is computed by a user profiling function p, as in Eq. (1):

$\begin{aligned} P_{u}=\{(a,p(u,a)) \mid a\in A, u\in U\}, \end{aligned}$

(1)

where A is a set of user attributes. Determining the user profiling function p for each user attribute is the goal of the user profiling phase.

Fig. 1

The flow map of our unified scheme for developing social insights from the collective voice of target users

Figure 1 illustrates the flow of our unified scheme for developing social insights from the collective voice of target users. First, attributes of Twitter users are identified in the user profiling stage such as demographic attributes and other personal attributes. When some user attributes are missing due to data availability, researchers can consider developing their own customized solution to a specific user profiling task. A supervised machine learning model can be built by utilizing hashtags as the features for prediction. Second, once this user profiling phase is completed, researchers select only the users of interest based on the identified user attributes. Finally, researchers proceed to develop topical and social insights from the collective voice of these target users.

User profiling

In general, sampling of Twitter users is less common than sampling of tweets due to the limited functionality of Twitter API for collecting users. For this reason, we begin with a large pool of random tweets, which are known to be much easier to collect via Twitter API mentioned earlier in "Introduction" section. Each tweet collected contains author information describing the user who created the tweet. Some user attributes for the users in the pool are already known or can be easily acquired, while other attributes need to be inferred, are difficult, or impossible to identify. It is worth noting that raw user data collected from Twitter via Twitter API provides surprisingly useful information about users. Table 2 lists native Twitter objects and their fields along with user attributes that can be derived from the fields. Twitter API provides several types of objects encoded in JavaScript Object Notation (JSON), of which User and Tweet objects are the most useful in user profiling.

Table 2 Summary of the user attributes derivable from native Twitter objects

Object	Field	Description	Derivable user attributes
User	name	Name of the user	Name, gender, age, race/ethnicity
	location	User-defined location for the account's profile	Location
	url	URL provided by the user in association with their profile	Web site, blog, or other social media accounts
	description	User-defined description of their account	Demographics, expertise, hobbies, interests, personality traits, political orientation
	verified	Whether Twitter has verified that the account of public interest is authentic	Popularity
	followers_count	Number of users following the account	Popularity
	friends_count	Number of users the account is following	Sociability
	listed_count	Number of public lists that the user is a member of	Popularity
	favourites_count	Number of tweets the user has liked in the account's lifetime	Posting activeness
	statuses_count	Number of tweets (including retweets) issued by the user	Posting activeness
	created_at	UTC datetime that the user account was created on Twitter	Account age
	profile_image_url_https	HTTPS-based URL pointing to the user's profile image	Gender, age, race/ethnicity
	followers*	List of users following the account	Network
	friends*	List of users the account is following	Network
Tweet	created_at	UTC time when the tweet was created	Behavior
	text	Actual text of the status update	Demographics, expertise, interests, personality traits, political orientation
	coordinates	Geographic location of the tweet as longitude and latitude coordinates	Location, behavior
	place	Known place as city, state, or country	Location, behavior
	reply_count	Number of times the tweet has been replied to	Popularity
	retweet_count	Number of times the tweet has been retweeted by other users	Popularity
	favorite_count	Number of times the tweet has been liked by other users	Popularity
	lang	Machine-detected language of the tweet	Language
	retweeted_status	Original tweet object if the tweet is a retweet	Typical tweet or retweet

A User object, which describes an individual user on Twitter, has several fields that can be directly used as user attributes, such as name, location, and url, while the other fields can be analyzed to infer new attributes. For example, from the description field that has a user-defined description or bio of an account, one can infer many different types of user attributes, such as demographic attributes (e.g., age, education, gender, location, marital status, language, occupation, and race/ethnicity) and other personal attributes (e.g., expertise, hobbies, interests, personality traits, and political orientation), depending on the information included in the text of the field. A wide range of natural language processing (NLP) and text mining techniques can be applied to this field. The other fields in a User object can be good indicators of the account's popularity, sociability, or activeness. For example, the followers_count and the listed_count fields indicate how popular the account is, while the friends_count field indicates how sociable the account is. One may want to compare the followers_count to the friends_count, to see if there is a large or small gap between the two fields. For example, celebrities tend to have a very large number of followers but a smaller number of friends, whereas spam accounts or bot accounts tend to have many friends but few followers.

The favourites_count and the statuses_count fields can be used to measure how active the account is in terms of posting tweets. The created_at field can be used to calculate the account age in days, months, or years, which can be combined with other fields for normalization. For example, users who have been using Twitter for ten years would probably have more followers or have posted more tweets than those who just began to use Twitter. In this case, one may need to divide the number of followers or number of statuses by the account age, so that the indicators can be normalized for each user.

A profile image from the profile_image_url_https field can be used to identify gender, age, or race/ethnicity of the user by applying state-of-the-art image analysis techniques. The followers field contains the lists of users following the account, while the friends field contains the list of users the account is following, both of which present the relationship network of the user. Note that the two fields, each marked with an asterisk, are not actually linked to the User object as its fields. Twitter API separates these two fields from the User object for some reason. But we link them as fields of the User object, as we believe those fields should also be treated as user attributes. The two fields provide direct information about who are the followers and friends of a user. The verified field is a unique feature of Twitter, which indicates whether Twitter has verified that the account of public interest is authentic. A verified account has a blue verified badge on Twitter. This can serve as another indicator of the user's popularity or authority.

A Tweet object describes an individual tweet posted by a user. An individual tweet could not be directly used as an attribute of a user due to its limited information. When aggregated, however, they can be a powerful source for a researcher to understand the user. While a Tweet object has a number of fields, the bottom half of Table 2 lists a few of those that can be used to infer user attributes. The text field is the most important one among all fields, as it provides raw tweet text written by the user. It is worth noting that tweet text can have up to only 280 characters (the length limit was increased from 140 to 280 in 2017), which is why Twitter is called a micro-blogging service. The short text has its own pros and cons. In some cases, tweet text might be too short to convey meaningful information from an analysis perspective, while in other cases a single short tweet can have enough information to understand the user. On the other hand, the short text is what has made people freely use Twitter. From a Big Data perspective, the more tweet text we have for a user, the better understanding of the user we will have. The text field can be used to infer most of the demographic attributes and personal attributes mentioned earlier. As with the description field of a User object, this field can benefit from text analysis techniques.

The created_at, coordinates, and place fields can bring a temporal or a geo-spatial aspect to the study. While every tweet has a value in its created_at field, not all tweets have values in the coordinates and place fields. It depends on whether the user had activated location sharing in their applications. It is known that, as already discussed earlier, only a small fraction of tweets are geo-tagged or geo-referenced. The three fields reply_count, retweet_count, favorite_count are considered to be good indicators for the popularity of the tweet, which can also translate into the popularity of the user. The lang field indicates which language the user is primarily using or able to use. It is also worth noting that users can retweet other users' tweets, and those retweets are considered to be the user's tweets, although they were originally created by others (users can also add their own comments to the original tweet when retweeting). If we analyze tweets to understand the user, however, those retweets could be of no help, because they were not originally created by the user. In this case, by referring to the retweeted_status field, those retweets can be excluded from any analysis, so that only the normal tweets created by the user are considered.

The Twitter objects and their associated fields listed in Table 2 provide insight into some heuristics for user profiling before attempting to apply advanced methodologies. In particular, the description field of a User object can be directly used to extract various user attributes like gender, location, occupation, and so on. The following description from a Twitter user account, which is open to the public, is a good example:

Senior Narrative Designer @UbiMassive - cats, books, games and scones - Brit in Sweden - opinions all mine - She/her.

This short bio tells much about the user, such as gender, occupation, hobby, nationality, and location. The user is female from the phrase “She/her”; she is a narrative designer at a game company; she likes cats, books, games, and scones; she is British; she lives in Sweden. While not all Twitter users describe themselves in such detail, it is apparent that the description field can serve as a primary source for understanding users. In order to extract the right information from the description text, a string pattern matching technique called regular expression can be employed.

If the approaches relying on some raw user attributes provided by Twitter are too simple to work for a research study, one should consider employing advanced techniques for user profiling listed in Table 1. As described in "Related literature" section, previous works have explored different ways of profiling Twitter users. When applying the advanced methodologies, note again that different methodologies use different data for user profiling, depending on their proposed approaches. For example, to identify the location of a user, some methodologies consider only tweet text, whereas other methodologies use not only tweet text but also use follow relationship of users or tweet context. Note also that the methodologies targeted at the same user attribute do not always yield exactly the same outcome, as each methodology has its own research questions to address. Depending on objectives of the study, a subset of the user attributes listed in Table 2 can be considered in user profiling. For the market research project example mentioned in "Introduction" section, the researchers should only focus on such attributes of users as age, gender, and interest, and thus examine which methodologies would fit the data they currently have. Again, they should be aware that different methodologies use different data. Once this user profiling task is performed over all users in the data pool, they now can select only the users that meet the criteria they have set for the study. This initial set of selected users can be further analyzed to be selected as the final set of target users.

Customized user profiling

If the user profiling task was perfectly done and ended up properly populating all user attributes needed, we can move on to selection of target users based on the user attributes. In many cases, however, it is possible that there are no resources available for some user attributes, leaving their values missing. This can happen when (1) there are no available resources at all, (2) the existing resources do not fit the data we have, or (3) the performance of the available resource is not satisfactory.

To resolve this issue, we propose to consider developing a customized solution to a specific user profiling task, especially if it is a supervised machine learning problem. For example, suppose we want to classify each Twitter user by their political orientation, i.e., conservative or liberal. While there are some available resources for political orientation classification, as listed in Table 1, one might find that those existing resources do not work well with the recent Twitter data. This leads us to consider developing our own political orientation classifier as long as we can make labeled data that can be used for training and testing machine learning models. Inspired by the observation that some Twitter users explicitly share their political orientation in their bio, we can collect a set of those users and label them as conservative or liberal. We then can use the labeled data as training data and test data for machine learning by selecting a set of features for prediction. Specifically, we propose to utilize hashtags as the features for political orientation prediction, based on the idea that conservatives and liberals are believed to be interested in different topics to some extent, thereby using somewhat different hashtags. Once a machine learning model is built, one can apply the model to populate the values in the target user attribute. While we cannot say that this approach would work for all user profiling tasks, we believe that it can work for supervised machine learning tasks, such as classification and regression, and that it can be a good complement to the existing user profiling solutions. We call this phase text-based customized user profiling, as opposed to the primary user profiling performed in the first phase, as this customized user profiling task can complement what is missing from the primary user profiling task.

In order to utilize hashtags as features for prediction, we first need to collect the tweets posted by users and mine hashtags from the tweets. The Twitter API allows researchers to retrieve up to 3200 most recent tweets of a user account, as long as the account is set to public.Footnote 9 Alternatively, one can consider web scraping to retrieve more than 3200 tweets from an account, although this option does not provide easy access to the web data in a structured manner unlike using an API. While all words in tweets are meaningful in one way or another, we particularly focus on hashtags in tweets. A hashtag is a word starting with a hash (#) symbol as its prefix such as #metoo, #nowplaying, and #earthday. Hashtags were originally introduced by Twitter and have been used to index keywords or topics on social media, which allow users to easily follow topics of interest. As mentioned, the goal of a hashtag is to facilitate search and aggregation of messages related to the same topic. With the wide adoption of hashtags on Twitter, a number of studies have investigated hashtags on Twitter. Tsur et al. attempt to predict the spread of thoughts and ideas, called memes, using hashtags. Ferragina et al. address hashtag relatedness and classification.

One of the reasons why we focus on hashtags, instead of all words or phrases in tweets, is that they are easy to handle. As users explicitly create a hashtag with the hash symbol and a hashtag allows no space in it, they are easy to extract and aggregate from text. In fact, Twitter API provides a list of hashtags identified in a tweet as a Hashtag object, thus API users do not have to extract hashtags themselves, which otherwise should be done with the help of a text analysis technique like regular expression. The main drawback to using hashtags is its sparsity; as pointed out by Godin et al., not all tweets have hashtags and not all users use hashtags. Nevertheless, this sparsity can be overcome when a large number of hashtags are aggregated, mainly because of the fact that a hashtag tends to be adopted by a significant number of users who want to join a virtual community that is interested in a certain topic.

Once all hashtags are extracted from tweets, they are aggregated such that the total frequency for each hashtag is calculated. Based on the hashtag frequency, one can have a hashtag popularity ranking sorted by frequency in descending order. This hashtag ranking can be a basis for researchers to manually select top-k popular hashtags that will be used as features for prediction, where k can be determined empirically. When top-k hashtags are selected as features, their frequencies are the values that should be put into the machine learning model. This way, one can build a model that is able to predict the value of a user attribute for a user. Building a machine learning model should always be followed by evaluating the model performance, using commonly used machine learning metrics.

Discovering social trends

Once the user profiling is completed and all values of the user attributes needed for the study are properly populated, one can now select the target users of interest, using the user attributes. For the market research example mentioned earlier, the researchers can simply select the users in their pool, who are young, female, and interested in fashion. Given that the target users have been identified, researchers can now proceed with in-depth analysis on the collective voice of these targeted users. While this final phase should completely depend on the objectives of the study, i.e., what the researchers want to know about their target audience, we focus on hashtags from a topical perspective to discover popular or rising topics among people and also on relationship networks from a social perspective to identify influencers.

Popular hashtags among the target users can be captured in a similar way that we used earlier to identify popular hashtags for the customized user profiling. A simple frequency ranking from tweets will work for popular hashtags, while one may want to consider advanced techniques to detect a trend over time with hashtags. Influencers in a social network can be identified as well, based on the network structure among the target users. A variety of centrality measures, such as degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality, can be applied, as previously mentioned in "Related literature" section.

Course Syllabus

Course Syllabus

Unit 1: The Principles of Strategic Brand Management

1.1: Brand Management Strategic Models

Branding and the Brand Building Process

1.2: Multi-brand Company's Strategic Plan Goal Setting

Strategy and Context

1.3: Modeling Long-Term Brand Strategy

Common Branding Strategies

Unit 1 Study Resources

Unit 1 Review Video

Unit 1 Review Slides

Study Guide: Unit 1

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Consumer Choice: Loyalty, Preference, and Brand Equity

2.1: Customer-Based Brand Equity

Brand Equity Models

Loyalty Management

2.2: Consumer Buying Loyalty Factors

Customer Journey Mapping

2.3: The Consumer Journey

Five KPIs Every Business Must Consider

Unit 2 Study Resources

Unit 2 Review Video

Unit 2 Review Slides

Study Guide: Unit 2

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The Brand Audit: Asset Development

3.1: Strategic Imperatives

Brand Auditing and Brand Salience Management

Strategies: Principles of Management

Mediators of the Customer Satisfaction-Loyalty Relationship

3.2: The Core Brand Identity

Uncovering the Corporate Brand's Core Values

3.3: Brand architecture

Consumer Reliance on Alternative Digital Touchpoints

Branding, Labeling and Packaging

Unit 3 Study Resources

Unit 3 Review Video

Unit 3 Review Slides

Study Guide: Unit 3

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Brand Portfolio Management

4.1: Strategies in Managing Brand Portfolios

Strategic Portfolio Planning Approaches

Product Portfolio Management

4.2: Product Expansion/Growth Matrix

Portfolio Planning and Corporate Level Strategy

4.3: New Product Development Stages for Brand Expansion

Concentration Strategies

Unit 4 Study Resources

Unit 4 Review Video

Unit 4 Review Slides

Study Guide: Unit 4

Unit 4 Assessment

Unit 4 Assessment

Unit 5: Sustainable Competitive Advantage

5.1: Competitive Differentiation

Porter's Generic Strategies and Firm Performance

Differentiation: Mastering Strategic Management

5.2: Competitive Intelligence

Competitive Intelligence Information

Business Process Performance Measurement

The Global Business Environment

5.3: Ethical Brand Management

Marketing Ethics: Selling Controversial Products

Unit 5 Study Resources

Unit 5 Review Video

Unit 5 Review Slides

Study Guide: Unit 5

Unit 5 Assessment

Unit 5 Assessment

Unit 6: Brand Research and Trends

6.1: Macro Environmental Factors Impacting Brand Success

PESTLE Analysis

Marketing Information Systems

From Information Experience to Consumer Engagement