Archive for the ‘Reflection’ Category

Using Github on Windows

Published by chengjun on July 16th, 2013

It’s fairly painful to use github on windows in the past days. However, such a pain has already become a story recently. Github officially release the software for windows users, which makes it pretty easy to pull files from github website and push your modification on your local computer to the web.


1. Download and install the software

Install the software and login in with your Github accounts and password.

2. Clone with a click

For example, I have been working on writing a book on computational social science with my collaborators. He makes a repository on github names css. I want to clone this repository to my local computer, add files, and push the files onto the github website.

There are two “places” that I can manipulate: “local” and “github”.


The first step is to clone this repository.Click the “github” first to find the “css” repository. Just clicking “clone the repository” to pull the files to your local computer.

2. Add files directly to your local github directory

You can manually add files and make changes freely.

For example, I add a markdown file named “” to the sub-directory of “css” repository on my local computer.

By the way, I suggest the new beginners to use the MarkdownPad ( to write markdown files. Using MarkdownPad, you can instantly see what your Markdown documents look like in HTML. While you type, LivePreview will automatically scroll to the current location you’re editing.

3. Sync the changes to github


After that the software detect the change, as it has been showed in the figure above. You can enter the commit message which is required and click the commit the button.

Finally, click the sync button on the top to manually push the changes to the github website.

See, it’s so easy. Now, you don’t need to program for pulling and pushing your github files. I guess you have no excuse to reject Github now. Then, start to use it.

Demise of Bursts

Published by chengjun on June 2nd, 2012

The origin of Burst

In the famous paper titled The origin of bursts and heavy tails in human dynamics, barabasi introduce that the waiting time of human communication behavior follows a power law distribution rather than a Poisson distribution, and he argues that the origin of the burst phenomena is originated from the queuing process of decision making.

In his book Bursts, Barabasi told the story of how he came out of the idea. To be specific, the secret of the success of Poisson. As a great scientist, Poisson made great contributions in many aspects. He has a habit of writing down the good research question he encountered and returning to his on-going work. Until he has finished his work in hand, he will select the most interesting question in his question list.

Heavy-tailed processes allow for very long periods of inactivity that separate bursts of intensive activity. I am interested in this model since:

Although I have illustrated the queuing process for e-mails, in general the model is better suited to capture the competition between different kinds of activities an individual is engaged in; that is, the switching between various work, entertainment and communication events. Indeed, most data sets displaying heavytailed inter-event times in a specific activity reflect the outcome of the competition between tasks of different nature.

Bursts of video viewing activity

The metaphor of burst if pretty good. It’s more obvious in video viewing activity.

In the paper of Robust dynamic classes revealed by measuring the response function of a social system, Riley Crane* and Didier Sornette argues that  bursts of activity originated from endogenous and exogenous causes. an epidemic cascade of actions becoming the cause of future actions.

To fit in Barabasi’s theory, we can understand individual’s viewing behavior as the following process.

The action is for the individual to view the video in question after a time t since she was first subjected to the cause without any other influences between 0 and t, corresponding to a direct (or first-generation) effect.

However, there is big problem, since most audiences view the video immediately after they are exposed to the Youtube videos.

They illustrated an epidemic branching process that describes the cascade of influences on the social network. The model integrate both exogenous sources and  the interpersonal effect of the social networks.

As we have discussed above, “by definition, the memory kernel φ(t) describes the distribution of waiting times between “cause” and “action” for an individual”.

μi is the number of potential viewers who will be influenced directly over all future times after ti by person i who viewed a video at timeti. Thus, the existence of well connected individuals can be accounted for with large values of μi. Lastly,V(t) is the exogenous source, which captures all spontaneous views that are not triggered by epidemic effects on the network.

Based on this model, they categorized the videos into four kinds: Endogenous-subcritical, Endogenous-critical, Exogenous-subcritical, and Exogenous-critical.

According to our model, the aggregated dynamics can be classified by a combination of the type of disturbance (endo/exo) and the ability of individuals to influence others to action (critical/subcritical)

Peak Fraction

Peak Fraction (F) is the fraction of views observed on the peak day compared with the total cumulative views. They calculate the fraction F  and sort the time series into three classes:

  1. Class 1 is defined by 80% ≤ F ≤ 100 %.↔ Exogenous subcritical↔ Spam videos.↔ 1+θ

  2. Class 2 is defined by 20% < F < 80%.↔ Exogenous critical↔ Quality videos.↔ 1 − θ

  3. Class 3 is defined by 0% ≤ F ≤ 20%.↔ Endogenous sub critical. ↔Viral videos.↔ 1 − 2θ

Demise of Bursts

I am working on analyzing the time series of video views and found interesting demise of bursts along the time.

Interestingly, I find there is no burst for the most popular video Charlie bit my finger – again ! The total views si 456,651,832, and it has never stopped to grow since it has been uploaded to Youtube in 2007. Enjoy it

Charlie bit my finger – again

The green line is the cumulative growth curve, and the red line is the normalized daily views. You can see that the growth of the red line is stead. However, it lives out most of the other videos.

Scraping New York Times & The Guardian using Python

Published by admin on April 23rd, 2012

I have read the blog post about Scraping New York Times Articles with R. It’s great. I want to reproduce the work with python.
First, we should learn about nytimes article search api.

Second, we need to register and get the key which will be used in python script.

# !/usr/bin/env python
# -*- coding: UTF-8  -*-
# Scraping New York Times using python
# 20120421@ Canberra
# chengjun wang

import json
import urllib2

About the api and the key, see the links above.

'''step 1: input query information'''
query='query=occupy+wall+street'                            # set the query word here
apiDate='begin_date=20110901&end_date=20120214'             # set the date here
key='api-key=c2c5b91680.......2811165'  # input your key here

'''step 2: get the number of offset/pages'''
link=[apiUrl, query, apiDate, fields, offset, key]
jstr = urllib2.urlopen(ReqUrl).read()  # t = jstr.strip('()')
ts = json.loads( jstr )
number=ts['total'] #  the number of queries  # query=ts['tokens'] # result=ts['results']
print number
seq=range(number/9)  # this is not a good way
print seq

'''step 3: crawl the data and dump into csv'''
import csv
addressForSavingData= "D:/Research/Dropbox/tweets/wapor_assessing online opinion/News coverage of ows/nyt.csv"
file = open(addressForSavingData,'wb') # save to csv file
for i in seq:
    offsets=''.join(['offset=', nums]) # I made error here, and print is a good way to test
    links=[apiUrl, query, apiDate, fields, offsets, key]
    print "*_____________*", ReqUrls
    jstrs = urllib2.urlopen(ReqUrls).read()
    t = jstrs.strip('()')
    tss= json.loads( t )  # error no joson object
    result = tss['results']
    for ob in result:
        title=ob['title']  # body=ob['body']   # body,url,title,date,des_facet,desk_facet,byline
        print title
        date=ob['date'] # desk_facet=ob['desk_facet']  # byline=ob['byline'] # some author names don't exist
        w = csv.writer(file,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
        w.writerow((date, title, url)) # write it out

see the result:

Similarly, you can crawl the article data from The Guardian. See the link below.

After you have registered you app and got the key, we can work on the python script.

# !/usr/bin/env python
# -*- coding: UTF-8  -*-
# Scraping The Guardian using Python
# 20120421@ Canberra
# chengjun wang

import json
import urllib2


'''step 1: input query information'''
apiUrl='' # set the query word here
apiDate='from-date=2011-09-01&to-date=2011-10-14'                     # set the date here
apiPage='page=2'      # set the page
apiNum=10             # set the number of articles in one page
key='api-key=mudfuj...g33gzq'  # input your key here

'''step 2: get the number of offset/pages'''
link=[apiUrl, apiDate, apiPage, apiPageSize, fields, key]
jstr = urllib2.urlopen(ReqUrl).read()  # t = jstr.strip('()')
ts = json.loads( jstr )
number=ts['response']['total'] #  the number of queries  # query=ts['tokens'] # result=ts['results']
print number
seq=range(number/(apiNum-1))  # this is not a good way
print seq

'''step 3: crawl the data and dump into csv'''
import csv
addressForSavingData= "D:/Research/Dropbox/tweets/wapor_assessing online opinion/News coverage of ows/guardian.csv"
file = open(addressForSavingData,'wb') # save to csv file
for i in seq:
apiPages=''.join(['page=', nums]) # I made error here, and print is a good way to test
links= [apiUrl, apiDate, apiPages, apiPageSize, fields, key]
print "*_____________*", ReqUrls
jstrs = urllib2.urlopen(ReqUrls).read()
t = jstrs.strip('()')
tss= json.loads( t )
result = tss['response']['results']
for ob in result:
title=ob['webTitle'].encode('utf-8')  # body=ob['body']   # body,url,title,date,des_facet,desk_facet,byline
print title
date=ob['fields']['newspaperEditionDate'] # date=ob['webPublicationDate']  # byline=ob['fields']['byline']
w = csv.writer(file,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow((date, title, section, url)) # write it out

Learning To Do Sentiment Analysis Using Python & NLTK

Published by admin on March 18th, 2012

This is my first try in learning sentiment analysis using python. Glad to know nltk could distinguish ‘like’ and ‘not like’. It’s great. I am wondering compared it with R.

The method below employ the procedure as the figure demonstrates below.

original author: Laurent Luce

The model uses naive bayes model to classify.

# Twitter sentiment analysis using Python and NLTK
# original author: Laurent Luce
# Reproduced by chengjun wang to test the validity
# 20120319@Canberra

# find the original post by Laurent Luce following the link below:

import nltk

pos_tweets = [('I love this car', 'positive'),
	('This view is amazing', 'positive'),
	('I feel great this morning', 'positive'),
	('I am so excited about the concert', 'positive'),
	('He is my best friend', 'positive')]

neg_tweets = [('I do not like this car', 'negative'),
	('This view is horrible', 'negative'),
	('I feel tired this morning', 'negative'),
	('I am not looking forward to the concert', 'negative'),
	('He is my enemy', 'negative')]

tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
	words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
	tweets.append((words_filtered, sentiment))

# print tweets
# print to see the result

test_tweets = [
	(['feel', 'happy', 'this', 'morning'], 'positive'),
	(['larry', 'friend'], 'positive'),
	(['not', 'like', 'that', 'man'], 'negative'),
	(['house', 'not', 'great'], 'negative'),
	(['your', 'song', 'annoying'], 'negative')]

# print test_tweets

# The list of word features need to be extracted from the tweets. It is a list with every distinct words
# ordered by frequency of appearance. We use the follow ing function to get the list plus the tw o helper
# functions.

def get_words_in_tweets(tweets):
	all_words = []
	for (words, sentiment) in tweets:
	return all_words
def get_word_features(wordlist):
	wordlist = nltk.FreqDist(wordlist)
	word_features = wordlist.keys()
	return word_features
# what does word_features do?
word_features = get_word_features(get_words_in_tweets(tweets))
# print word_features

# To create a classifier, we need to decide what features are relevant. To do that, we first need a
# feature extractor. The one we are going to use returns a dictionary indicating what words are
# contained in the input passed. Here, the input is the tweet. We use the word features list defined
# above along with the input to create the dictionary.

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
      features['contains(%s)' % word] = (word in document_words)
    return features

# call the feature extractor with the document ['love', 'this', 'car']
# document=['love', 'this', 'car']
# features = extract_features(document)
# print features

training_set = nltk.classify.util.apply_features(extract_features, tweets)
# print training_set
# be careful here, it should be nltk.classify.util.apply_features rather than nltk.classify.apply_features
# apply the features to our classifier using the method apply_features.
# We pass the feature extractor along with the tweets list defined above.

# The variable training_set contains the labeled feature sets. It is a list of tuples which each tuple
# containing the feature dictionary and the sentiment string for each tweet. The sentiment string is
# also called label.

classifier = nltk.NaiveBayesClassifier.train(training_set)
# look inside the classifier train method in the source code of the NLTK library

def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist):
    # Create the P(label) distribution
    label_probdist = estimator(label_freqdist)
    # Create the P(fval|label, fname) distribution
    feature_probdist = {}
    return NaiveBayesClassifier(label_probdist, feature_probdist)

# print label_probdist.prob('positive')
# print label_probdist.prob('negative')

# print feature_probdist
# print feature_probdist[('negative', 'contains(best)')].prob(True)

# print classifier.show_most_informative_features(32)
# show_most_informative_features

tweet = 'Larry is not my friend'
# print classifier.classify(extract_features(tweet.split()))

# take a look at how the classify method works internally in the NLTK library. What we pass to the classify method is the feature set of
# the tweet we want to analyze. The feature set dictionary indicates that the tweet contains the word "friend".
print extract_features(tweet.split()), '\n'

# def classify(self, featureset):
    # Discard any feature names that we've never seen before.
    # Find the log probability of each label, given the features.
	# {'positive': -1.0, 'negative': -1.0}
	# Then add in the log probability of features given labels.
	# {'positive': -5.4785441837188511, 'negative': -14.784261334886439}
    # Generate a probability distribution dictionary using the dict logprod
	# DictionaryProbDist(logprob, normalize=True, log=True)
    # Return the sample with the greatest probability from the probability
    # distribution dictionary

'''Taking the following test tweet 'Your song is annoying'. The classifier thinks it is positive.
The reason is that we don't have any information on the feature name annoying.
Larger the training sample tweets is, better the classifier will be.'''

# tweet = 'Your song is annoying'
print classifier.classify(extract_features(tweet.split()))

'''find the original post by Laurent Luce following the link below:'''