Archive for the ‘Data’ Category

Randomly sampling tweets with stream API

Published by chengjun on January 20th, 2013

information ocean

I want to randomly sample twitter streams. Thus, i turn to the steam api of twitter.
With the help of tweepy package of Python, I tried the following scripts. So far it works pretty well.

# Twitter API Crawler
# -*- coding: utf-8 -*-

Author: chengjun wang
Hong Kong, 2013/01/20
import sys
import tweepy
import codecs
from time import clock

'''OAuth Authentication'''

auth1 = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth1.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth1)

# Note: Had you wanted to perform the full OAuth dance instead of using
# an access key and access secret, you could have uses the following
# four lines of code instead of the previous line that manually set the
# access token via auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET).
# auth_url = auth.get_authorization_url(signin_with_twitter=True)
# verifier = raw_input('PIN: ').strip()
# auth.get_access_token(verifier)

file = open("C:/Python27/twitter/mydata6.csv",'wb') # save to csv file

print # api.update_status('Updating using OAuth authentication via Tweepy!')

start = clock()
print start

'''Specify the stream'''
class StreamListenerChengjun(tweepy.StreamListener):
	def on_status(self, status):
			tweet = status.text.encode('utf-8')
			tweet = tweet.replace('\n', '\\n')
			user ='utf-8')
			userid =
			time = status.created_at
			source = status.source
			tweetid =
			timePass = clock()-start
			if timePass%60==0:
				print "I have been working for", timePass, "seconds."
			if not ('RT @' in tweet) :	# Exclude re-tweets
				print >>file, "%s,%s,%s,%s,|%s|,%s" % (userid, user, time, tweetid, tweet, source)

		except Exception, e:
			print >> sys.stderr, 'Encountered Exception:', e
	def on_error(self, status_code):
		print 'Error: ' + repr(status_code)
		return True # False to stop
	def on_delete(self, status_id, user_id):
		"""Called when a delete notice arrives for a status"""
		print "Delete notice for %s. %s" % (status_id, user_id)
	def on_limit(self, track):
		"""Called when a limitation notice arrvies"""
		print "!!! Limitation notice received: %s" % str(track)
	def on_timeout(self):
		print >> sys.stderr, 'Timeout...'
		return True

'''Link the tube with tweet stream'''
streamTube = tweepy.Stream(auth=auth1, listener=StreamListenerChengjun(), timeout= 300)  # # setTerms = ['good', 'goodbye', 'goodnight', 'good morning'] # streamer.filter(track = setTerms)


timePass = time.clock()-start
print timePass

Understanding dyad in ERGM

Published by chengjun on October 20th, 2012

It’s a bit difficult to understand the terms used by ERGM.

I am working on predicting how friendships influence the information diffusion using weibo landscape data with ERGM.

According to the statnet library:

With the parameter of “dyadcov” term, we add three statistics to the model, each equal to the sum of the covariate values for all dyads occupying one of the three possible non-empty dyad states (mutual, upper-triangular asymmetric, and lower-triangular asymmetric dyads, respectively).

Obviously, there are three kinds of dyads.

Check Wasserman and Faust’ book of Social Network Analysis, and I find the figure above.

 However, most of us don’t know the difference between upper-triangular asymmetric and lower-triangular asymmetric dyads.

Based on my understanding, it’s related to the direction of ties, see the figure below (am i right?):

Estimating Threshold of Time Series Using R

Published by chengjun on March 20th, 2012

Please note that I am NOT an expert in time series analysis. Therefore, I am not the ideal person to answer the technical questions on this topic. Please consider (1) raising your question on stackoverflow, (2) sending emails to the developer of related R packages, (3) joining related email groups, etc.

The method of estimating Threshold of Time Series Data has been developed by R. This post shows how to use the method by adopting two packages. First, I would like to highlight Bruce Hansen’s work in this field.

Bruce E. Hansen

Programs — Threshold Models

“Inference when a nuisance parameter is not identified under the null hypothesis.” Econometrica, (1996). [Download].

“Inference in TAR models.” Studies in Nonlinear Dynamics and Econometrics, (1997). [Download].

“Threshold effects in non-dynamic panels: Estimation, testing and inference.” Journal of Econometrics, (1999). [Download].

“Testing for Linearity.” Journal of Economic Surveys, (1999). [Download].

“Sample splitting and threshold estimation.” Econometrica, (2000). [Download].

“Threshold Autoregression with a Unit Root.” Econometrica (2001), with Mehmet Caner. [Download].

“How responsive are private transfers to income? Evidence from a laissez-faire economy.”
with Donald Cox and Emmanuel Jimenez, Journal of Public Economics, (2004), 88, 2193-2219. [Download].

“Testing for two-regime threshold cointegration in vector error correction models,” with Byeongseon Seo, Journal of Econometrics (2002). [Download].

“Instrumental Variable Estimation of a Threshold Model”, with Mehmet Caner, Econometric Theory, (2004), 20, 813-843. [Download].

# Chengjun WANG
# @anu, 20120320
# The first step is Unit root and cointegration Analysis (urca)
# install.packages("urca")
# load the package
# see the data which is used as an example
#~~~~~~~~~~~~~~~threshold model using tsDyn~~~~~~~~~~~~~~~~~~~~~~~#
# install.packages("tsDyn")
# load the package of tsDyn

# models in this package
 [1] "linear"  "nnetTs"  "setar"   "lstar"   "star"    "aar"     "lineVar"
 [8] "VECM"    "TVAR"    "TVECM"

#fit an AAR model:
mod #Summary data informations:
#Diagnostic plots:

# STAR model fitting with automatic selection of the number
# of regimes based on LM tests.


# Estimate a multivariate Threshold VAR
?TVAR  # ask r to introduce about TVAR


TVAR(data, lag=2, nthresh=2, thDelay=1, trim=0.1, mTh=1, plot=TRUE)
TVAR.LRtest(data, lag=2, mTh=1,thDelay=1:2, nboot=3, plot=FALSE, trim=0.1, test="1vs")

# The one threshold (two regimes) gives a value of 10.698 for the
# threshold and 1 for the delay. Conditional on this values, the search
# for a second threshold (three regimes) gives 8.129. Starting from this
# values, a full grid search finds the same values and confims the first
# step estimation.

##simulate VAR as in Enders 2004, p 268
B1var1ts.plot(var1, type="l", col=c(1,2))

B2varcovvar2ts.plot(var2, type="l", col=c(1,2))

##Simulation of a TVAR with 1 threshold
#estimate the new serie
TVAR(sim, lag=1, dummyToBothRegimes=TRUE)

##Bootstrap a TVAR with two threshold (three regimes)
serieTVAR.sim(data=serie,nthresh=2, type="boot",mTh=1, Thresh=c(7,9))

##Check the bootstrap
cbind(TVAR.sim(data=serie,nthresh=2, type="check",mTh=1, Thresh=c(7,9)),serie)

# Estimate a Threshold Vector Error Correction model (VECM)
# Hansen, B. and Seo, B. (2002), Testing for two-regime threshold cointegration in vector error-correction models, Journal of Econometrics, 110, pages 293 - 318
##Estimate a TVECM (we use here minimal grid, it should be usually much bigger!)


#Obtain diverse infos:
#export the equations as Latex:

Top 200 Twitter Users of Occupying Wall Street

Published by chengjun on December 17th, 2011
The centrality of the discussion network of occupying wall street is always changing. However, there are something in common. The people who are spoken to tends to be the movement leaders, media, and politicians.
_____________________________________________________________ (more…)
Comments Off on Top 200 Twitter Users of Occupying Wall Street