Archive for March, 2012

Estimating Threshold of Time Series Using R

Published by chengjun on March 20th, 2012

Please note that I am NOT an expert in time series analysis. Therefore, I am not the ideal person to answer the technical questions on this topic. Please consider (1) raising your question on stackoverflow, (2) sending emails to the developer of related R packages, (3) joining related email groups, etc.

The method of estimating Threshold of Time Series Data has been developed by R. This post shows how to use the method by adopting two packages. First, I would like to highlight Bruce Hansen’s work in this field.

Bruce E. Hansen

Programs — Threshold Models

“Inference when a nuisance parameter is not identified under the null hypothesis.” Econometrica, (1996). [Download].

“Inference in TAR models.” Studies in Nonlinear Dynamics and Econometrics, (1997). [Download].

“Threshold effects in non-dynamic panels: Estimation, testing and inference.” Journal of Econometrics, (1999). [Download].

“Testing for Linearity.” Journal of Economic Surveys, (1999). [Download].

“Sample splitting and threshold estimation.” Econometrica, (2000). [Download].

“Threshold Autoregression with a Unit Root.” Econometrica (2001), with Mehmet Caner. [Download].

“How responsive are private transfers to income? Evidence from a laissez-faire economy.”
with Donald Cox and Emmanuel Jimenez, Journal of Public Economics, (2004), 88, 2193-2219. [Download].

“Testing for two-regime threshold cointegration in vector error correction models,” with Byeongseon Seo, Journal of Econometrics (2002). [Download].

“Instrumental Variable Estimation of a Threshold Model”, with Mehmet Caner, Econometric Theory, (2004), 20, 813-843. [Download].

# Chengjun WANG
# @anu, 20120320
# The first step is Unit root and cointegration Analysis (urca)
# install.packages("urca")
# load the package
# see the data which is used as an example
#~~~~~~~~~~~~~~~threshold model using tsDyn~~~~~~~~~~~~~~~~~~~~~~~#
# install.packages("tsDyn")
# load the package of tsDyn

# models in this package
 [1] "linear"  "nnetTs"  "setar"   "lstar"   "star"    "aar"     "lineVar"
 [8] "VECM"    "TVAR"    "TVECM"

#fit an AAR model:
mod #Summary data informations:
#Diagnostic plots:

# STAR model fitting with automatic selection of the number
# of regimes based on LM tests.


# Estimate a multivariate Threshold VAR
?TVAR  # ask r to introduce about TVAR


TVAR(data, lag=2, nthresh=2, thDelay=1, trim=0.1, mTh=1, plot=TRUE)
TVAR.LRtest(data, lag=2, mTh=1,thDelay=1:2, nboot=3, plot=FALSE, trim=0.1, test="1vs")

# The one threshold (two regimes) gives a value of 10.698 for the
# threshold and 1 for the delay. Conditional on this values, the search
# for a second threshold (three regimes) gives 8.129. Starting from this
# values, a full grid search finds the same values and confims the first
# step estimation.

##simulate VAR as in Enders 2004, p 268
B1var1ts.plot(var1, type="l", col=c(1,2))

B2varcovvar2ts.plot(var2, type="l", col=c(1,2))

##Simulation of a TVAR with 1 threshold
#estimate the new serie
TVAR(sim, lag=1, dummyToBothRegimes=TRUE)

##Bootstrap a TVAR with two threshold (three regimes)
serieTVAR.sim(data=serie,nthresh=2, type="boot",mTh=1, Thresh=c(7,9))

##Check the bootstrap
cbind(TVAR.sim(data=serie,nthresh=2, type="check",mTh=1, Thresh=c(7,9)),serie)

# Estimate a Threshold Vector Error Correction model (VECM)
# Hansen, B. and Seo, B. (2002), Testing for two-regime threshold cointegration in vector error-correction models, Journal of Econometrics, 110, pages 293 - 318
##Estimate a TVECM (we use here minimal grid, it should be usually much bigger!)


#Obtain diverse infos:
#export the equations as Latex:

Learning To Do Sentiment Analysis Using Python & NLTK

Published by admin on March 18th, 2012

This is my first try in learning sentiment analysis using python. Glad to know nltk could distinguish ‘like’ and ‘not like’. It’s great. I am wondering compared it with R.

The method below employ the procedure as the figure demonstrates below.

original author: Laurent Luce

The model uses naive bayes model to classify.

# Twitter sentiment analysis using Python and NLTK
# original author: Laurent Luce
# Reproduced by chengjun wang to test the validity
# 20120319@Canberra

# find the original post by Laurent Luce following the link below:

import nltk

pos_tweets = [('I love this car', 'positive'),
	('This view is amazing', 'positive'),
	('I feel great this morning', 'positive'),
	('I am so excited about the concert', 'positive'),
	('He is my best friend', 'positive')]

neg_tweets = [('I do not like this car', 'negative'),
	('This view is horrible', 'negative'),
	('I feel tired this morning', 'negative'),
	('I am not looking forward to the concert', 'negative'),
	('He is my enemy', 'negative')]

tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
	words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
	tweets.append((words_filtered, sentiment))

# print tweets
# print to see the result

test_tweets = [
	(['feel', 'happy', 'this', 'morning'], 'positive'),
	(['larry', 'friend'], 'positive'),
	(['not', 'like', 'that', 'man'], 'negative'),
	(['house', 'not', 'great'], 'negative'),
	(['your', 'song', 'annoying'], 'negative')]

# print test_tweets

# The list of word features need to be extracted from the tweets. It is a list with every distinct words
# ordered by frequency of appearance. We use the follow ing function to get the list plus the tw o helper
# functions.

def get_words_in_tweets(tweets):
	all_words = []
	for (words, sentiment) in tweets:
	return all_words
def get_word_features(wordlist):
	wordlist = nltk.FreqDist(wordlist)
	word_features = wordlist.keys()
	return word_features
# what does word_features do?
word_features = get_word_features(get_words_in_tweets(tweets))
# print word_features

# To create a classifier, we need to decide what features are relevant. To do that, we first need a
# feature extractor. The one we are going to use returns a dictionary indicating what words are
# contained in the input passed. Here, the input is the tweet. We use the word features list defined
# above along with the input to create the dictionary.

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
      features['contains(%s)' % word] = (word in document_words)
    return features

# call the feature extractor with the document ['love', 'this', 'car']
# document=['love', 'this', 'car']
# features = extract_features(document)
# print features

training_set = nltk.classify.util.apply_features(extract_features, tweets)
# print training_set
# be careful here, it should be nltk.classify.util.apply_features rather than nltk.classify.apply_features
# apply the features to our classifier using the method apply_features.
# We pass the feature extractor along with the tweets list defined above.

# The variable training_set contains the labeled feature sets. It is a list of tuples which each tuple
# containing the feature dictionary and the sentiment string for each tweet. The sentiment string is
# also called label.

classifier = nltk.NaiveBayesClassifier.train(training_set)
# look inside the classifier train method in the source code of the NLTK library

def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist):
    # Create the P(label) distribution
    label_probdist = estimator(label_freqdist)
    # Create the P(fval|label, fname) distribution
    feature_probdist = {}
    return NaiveBayesClassifier(label_probdist, feature_probdist)

# print label_probdist.prob('positive')
# print label_probdist.prob('negative')

# print feature_probdist
# print feature_probdist[('negative', 'contains(best)')].prob(True)

# print classifier.show_most_informative_features(32)
# show_most_informative_features

tweet = 'Larry is not my friend'
# print classifier.classify(extract_features(tweet.split()))

# take a look at how the classify method works internally in the NLTK library. What we pass to the classify method is the feature set of
# the tweet we want to analyze. The feature set dictionary indicates that the tweet contains the word "friend".
print extract_features(tweet.split()), '\n'

# def classify(self, featureset):
    # Discard any feature names that we've never seen before.
    # Find the log probability of each label, given the features.
	# {'positive': -1.0, 'negative': -1.0}
	# Then add in the log probability of features given labels.
	# {'positive': -5.4785441837188511, 'negative': -14.784261334886439}
    # Generate a probability distribution dictionary using the dict logprod
	# DictionaryProbDist(logprob, normalize=True, log=True)
    # Return the sample with the greatest probability from the probability
    # distribution dictionary

'''Taking the following test tweet 'Your song is annoying'. The classifier thinks it is positive.
The reason is that we don't have any information on the feature name annoying.
Larger the training sample tweets is, better the classifier will be.'''

# tweet = 'Your song is annoying'
print classifier.classify(extract_features(tweet.split()))

'''find the original post by Laurent Luce following the link below:'''

How to Get Standard Regression Coefficient Using R

Published by admin on March 18th, 2012

Last week, Lexing showed us a picture of how different programming languages look like. Very Interesting. However, let’s dive to the question raised in the title of this post: How to Get Standard Regression Coefficient Using R?( In addition to that, I want to practice using SyntaxHighliter here.)

# e.g. I set up one regression
# model diagnosis

# according to the formula between standard regression coefficient
# calcuate the standard regression coefficient one by one
# for the beta of feo
# it's dull to do it in this way!

# output standard coefficient using QuantPsyc library
library(QuantPsyc)# install.packages("QuantPsyc")

Apparently, R is not well designed in this aspect, although we could get what we want, but it’s not so efficient and convenient compared with other commercial software. Hope this will be improved in the future.

0 visitors online now
0 guests, 0 bots, 0 members