zoukankan      html  css  js  c++  java
  • 【337】Text Mining Using Twitter Streaming API and Python

    Reference: An Introduction to Text Mining using Twitter Streaming API and Python

    Reference: How to Register a Twitter App in 8 Easy Steps

    • Getting Data from Twitter Streaming API
    • Reading and Understanding the data
    • Mining the tweets

    Key Methods:

    • Map()
    • Lambda()
    • Set()
    • Pandas.DataFrame()
    • matplotlib

    1. Getting Data from Twitter Streaming API

    twitter_streaming.py, this file is used to extract information from Twitter.

    #Import the necessary methods from tweepy library
    from tweepy.streaming import StreamListener
    from tweepy import OAuthHandler
    from tweepy import Stream
    
    #Variables that contains the user credentials to access Twitter API 
    access_token = "ENTER YOUR ACCESS TOKEN"
    access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
    consumer_key = "ENTER YOUR API KEY"
    consumer_secret = "ENTER YOUR API SECRET"
    
    
    #This is a basic listener that just prints received tweets to stdout.
    class StdOutListener(StreamListener):
    
        def on_data(self, data):
            print(data)
            return True
    
        def on_error(self, status):
            print(status)
    
    
    if __name__ == '__main__':
    
        #This handles Twitter authetification and the connection to Twitter Streaming API
        l = StdOutListener()
        auth = OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_token, access_token_secret)
        stream = Stream(auth, l)
    
        #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
        stream.filter(track=['python', 'javascript', 'ruby'])
    

    You can use the following command to store information in the specific file. (By CMD)

    python twitter_streaming.py > twitter_data.txt
    

    Then we will get the information from the above text file and store them in JSON format.

    import json
    tweets_data_path = r"..	witter_data.txt"
    tweets_data = []
    tweets_file = open(tweets_data_path, "r")
    for line in tweets_file:
    	try:
    		tweet = json.loads(line)
    		tweets_data.append(tweet)
    	except:
    		continue
    

    Data are stored in tweets_data, and we can get the specific information by the following scripts.

    Reference: python JSON only get keys in first level

    # get the text content, language from the specific tweets
    num = 0
    for tweet in tweets_data:
    	num += 1
    	if num == 10:
    		break
    	else:
    		tweet_text = tweet["text"]
    		tweet_lang = tweet["lang"]
    		print(str(num))
    		print(tweet_lang)
    		print(tweet_text)
    		print()
    
    # get all the keys from json
    tweets_data[0].keys()
    

    2. Reading and Understanding the data

    Now we can also get the specific key by list(), map() and lambda() with the following scripts.

    Reference: Python中map与lambda的结合使用

    >>> a = list(map(lambda tweet: tweet['text'], tweets_data))
    >>> len(a)
    1633
    >>> a[0]
    'RT @neet_se: 案件数って点だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…'
    

    Or we can also use set() method to get the unique values of the list.

    Reference: Python set() 函数

    Reference: Python统计列表中的重复项出现的次数的方法

    >>> langs = list(map(lambda tweet: tweet['lang'], tweets_data))
    >>> len(langs)
    1633
    >>> set(langs)
    {'zh', 'de', 'es', 'et', 'th', 'cy', 'ru', 'in', 'lt', 'pt', 'tl', 'en', 'it', 'ja', 'ro', 'fa', 'pl', 'fr', 'ht', 'ar', 'tr', 'ca', 'cs', 'und', 'da'}
    

    Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.

    >>> import pandas as pd
    >>> tweets = pd.DataFrame()
    >>> tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data))
    >>> tweets['lang'] = list(map(lambda tweet: tweet['lang'], tweets_data))
    >>> tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))
    >>> tweets['lang'].value_counts()
    en     1119
    ja      278
    es      113
    pt       36
    und      26
    ...
    

    Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.

    >>> tweets_by_lang = tweets['lang'].value_counts()
    
    >>> import matplotlib.pyplot as plt
    >>> fig, ax = plt.subplots()
    >>> ax.tick_params(axis='x', labelsize=15)
    >>> ax.tick_params(axis='y', labelsize=10)
    >>> ax.set_xlabel('Languages', fontsize=15)
    Text(0.5, 0, 'Languages')
    >>> ax.set_ylabel('Number of tweets' , fontsize=15)
    Text(0, 0.5, 'Number of tweets')
    >>> ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
    Text(0.5, 1.0, 'Top 5 languages')
    >>> tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
    <matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630>
    >>> plt.show()
    

    Next, we will create a chart describing the Top 5 countries from which the tweets were sent.

    >>> tweets_by_country = tweets['country'].value_counts()
    
    >>> fig, ax = plt.subplots()
    >>> ax.tick_params(axis='x', labelsize=15)
    >>> ax.tick_params(axis='y', labelsize=10)
    >>> ax.set_xlabel('Countries', fontsize=15)
    Text(0.5, 0, 'Countries')
    >>> ax.set_ylabel('Number of tweets' , fontsize=15)
    Text(0, 0.5, 'Number of tweets')
    >>> ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
    Text(0.5, 1.0, 'Top 5 countries')
    >>> tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')
    <matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0>
    >>> plt.show()
    

    3. Mining the tweets

    Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:

    • We will add tags to our tweets DataFrame in order to be able to manipulate the data easily.
    • Target tweets that have "programming" or tutorial" keywords.
    • Extract links from the relevant tweets.

    Adding Python, Ruby, and Javascript tags

    First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正则表达式).

    Python provides a library for regular expression called re. We will start by importing this library.

    Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.

    >>> import re
    >>> def word_in_text(word, text):
    	word = word.lower()
    	text = text.lower()
    	match = re.search(word, text)
    	if match:
    		return True
    	return False
    

    Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().

    >>> tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet))
    >>> tweets['ruby'] = tweets['text'].apply(lambda tweet: word_in_text('ruby', tweet))
    >>> tweets['javascript'] = tweets['text'].apply(lambda tweet: word_in_text('javascript', tweet))
    

    We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:

    >>> print(tweets['python'].value_counts()[True])	       
    447
    >>> print(tweets['ruby'].value_counts()[True])	       
    529
    >>> print(tweets['javascript'].value_counts()[True])	       
    275
    

    We can make a simple comparison chart by executing the following:

    >>> prg_langs = ['python', 'ruby', 'javascript']  
    >>> tweets_by_prg_lang = [tweets['python'].value_counts()[True], tweets['ruby'].value_counts()[True], tweets['javascript'].value_counts()[True]]     
    >>> x_pos = list(range(len(prg_langs)))
    >>> width = 0.8       
    >>> fig, ax = plt.subplots()  
    >>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha=1, color='g')	       
    <BarContainer object of 3 artists>
    >>> # Setting axis labels and ticks       
    >>> ax.set_ylabel('Number of tweets', fontsize=15)       
    Text(0, 0.5, 'Number of tweets')
    >>> ax.set_title('Ranking: python vs. javascript vs. ruby (Raw data)', fontsize=10, fontweight='bold')       
    Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Raw data)')
    >>> ax.set_xticks([p + 0.4 * width for p in x_pos])      
    [<matplotlib.axis.XTick object at 0x00000189BA5D1F28>, <matplotlib.axis.XTick object at 0x00000189BA603D30>, <matplotlib.axis.XTick object at 0x00000189BA5D15F8>]
    >>> ax.set_xticklabels(prg_langs)       
    [Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
    >>> plt.grid()       
    >>> plt.show()
    

    This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.

    Targeting relevant tweets

    We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.

    >>> tweets['programming'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet))
    >>> tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))
    

    We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.

    >>> tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet) or word_in_text('tutorial', tweet))
    

    We can print the counts of relevant tweet by executing the commands below.

    >>> print(tweets['programming'].value_counts()[True])       
    55
    >>> print(tweets['tutorial'].value_counts()[True])       
    22
    >>> print(tweets['relevant'].value_counts()[True])  
    74
    

    We can compare now the popularity of the programming languages by executing the commands below.

    tweets[tweets['relevant'] == True]['python'] # 将 relevant 为 True 的索引对应 Python 组成一个新的列
    >>> print(tweets[tweets['relevant'] == True]['python'].value_counts()[True])       
    31
    >>> print(tweets[tweets['relevant'] == True]['ruby'].value_counts()[True])
    8
    >>> print(tweets[tweets['relevant'] == True]['javascript'].value_counts()[True])   
    11
    

    Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison

    >>> tweets_by_prg_lang = [tweets[tweets['relevant'] == True]['python'].value_counts()[True],
    			  tweets[tweets['relevant'] == True]['ruby'].value_counts()[True],
    			  tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]] 
    >>> x_pos = list(range(len(prg_langs)))
    >>> width = 0.8
    >>> fig, ax = plt.subplots()
    >>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha=1,color='g')
    <BarContainer object of 3 artists>
    >>> ax.set_ylabel('Number of tweets', fontsize=15)
    Text(0, 0.5, 'Number of tweets')
    >>> ax.set_title('Ranking: python vs. javascript vs. ruby (Relevant data)', fontsize=10, fontweight='bold')
    Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Relevant data)')
    >>> ax.set_xticks([p + 0.4 * width for p in x_pos])
    [<matplotlib.axis.XTick object at 0x00000189B6E9E128>, <matplotlib.axis.XTick object at 0x00000189B430F9E8>, <matplotlib.axis.XTick object at 0x00000189B430F5C0>]
    >>> ax.set_xticklabels(prg_langs) 
    [Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
    >>> plt.grid()
    >>> plt.show()
    

    Extracting links from the relevants tweets

    Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.

    >>> def extract_link(text):
    	regex = r'https?://[^s<>"]+|www.[^s<>"]+'
    	match = re.search(regex, text)
    	if match:
    		return match.group()
    	return ''
    

    Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.

    >>> tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))
    

    Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.

    将原有 DataFrame 进行截取。

    >>> tweets_relevant = tweets[tweets['relevant'] == True]       
    >>> tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']
    

    We can now print out all links for python, ruby, and javascript by executing the commands below:

    >>> print(tweets_relevant_with_link[tweets_relevant_with_link['python'] == True]['link'])       
    40      https://t.co/zoAgyQuMAZ
    105     https://t.co/ogaPbuIbEW
    274     https://t.co/y4sUmovFOn
    329     https://t.co/A030fqWeWA
    339     https://t.co/LaaVc5T2rQ
    391     https://t.co/8bYvlziCZb
    413     https://t.co/8bYvlziCZb
    436     https://t.co/EByqxT1qyN
    444     https://t.co/8bYvlziCZb
    445     https://t.co/5Jujg6h31B
    462     https://t.co/UrFHlOaJYf
    476     https://t.co/5Jujg6h31B
    477     https://t.co/EByqxT1qyN
    589     https://t.co/UrFHlOaJYf
    603     https://t.co/5Jujg6h31B
    822     https://t.co/Oc21FrzQc5
    1060    https://t.co/qOAIuKfyD0
    1097    https://t.co/qOAIuKfyD0
    1248    https://t.co/V3ZNKuYsK7
    1278    https://t.co/qOAIuKfyD0
    1411    https://t.co/szHRHavQKy
    1594    https://t.co/X6KWMlzlv6
    Name: link, dtype: object
    >>> print(tweets_relevant_with_link[tweets_relevant_with_link['ruby'] == True]['link'])	       
    782     https://t.co/JgY40r2NSo
    833     https://t.co/JgY40r2NSo
    1177    https://t.co/xycOG3ndi9
    1254    https://t.co/xycOG3ndi9
    1293    https://t.co/LMHW050TGs
    1328    https://t.co/SS4DzEnSBZ
    1393    https://t.co/NZlUce5Ne8
    1619    https://t.co/e4nwrn3N2j
    Name: link, dtype: object
    >>> print(tweets_relevant_with_link[tweets_relevant_with_link['javascript'] == True]['link'])     
    130     https://t.co/AbJFaSI0B8
    286     https://t.co/7dNBIsQ5Gq
    467     https://t.co/3YIK588j8t
    471     https://t.co/vjBJWWzvfv
    830     https://t.co/T4mUjwUcgL
    1093    https://t.co/wvLZLjuVKF
    1180    https://t.co/luxL2qbxte
    1526    https://t.co/G3ZTFL0RKv
    Name: link, dtype: object
    
  • 相关阅读:
    公式中表达单个双引号【"】和空值【""】的方法及说明
    Ext.net CRUD
    水煮肉片
    配置传入电子邮件(Office SharePoint Server 管理中心帮助)
    CodeSmith开发系列资料总结
    报表中的Excel操作之Aspose.Cells(Excel模板)
    40个UI设计工具和资源
    配置Sharepoint传入/传出电子邮件a
    联想乐Pad_A1获取root权限
    Windows Azure
  • 原文地址:https://www.cnblogs.com/alex-bn-lee/p/9946375.html
Copyright © 2011-2022 走看看