Posts Tagged ‘OWS’

Scraping New York Times & The Guardian using Python

Published by admin on April 23rd, 2012

I have read the blog post about Scraping New York Times Articles with R. It’s great. I want to reproduce the work with python.
First, we should learn about nytimes article search api.

http://developer.nytimes.com/docs/article_search_api/

Second, we need to register and get the key which will be used in python script.

http://developer.nytimes.com/apps/register

# !/usr/bin/env python
# -*- coding: UTF-8  -*-
# Scraping New York Times using python
# 20120421@ Canberra
# chengjun wang

import json
import urllib2

'''
About the api and the key, see the links above.
'''

'''step 1: input query information'''
apiUrl='http://api.nytimes.com/svc/search/v1/article?format=json'
query='query=occupy+wall+street'                            # set the query word here
apiDate='begin_date=20110901&end_date=20120214'             # set the date here
fields='fields=body%2Curl%2Ctitle%2Cdate%2Cdes_facet%2Cdesk_facet%2Cbyline'
offset='offset=0'
key='api-key=c2c5b91680.......2811165'  # input your key here

'''step 2: get the number of offset/pages'''
link=[apiUrl, query, apiDate, fields, offset, key]
ReqUrl='&'.join(link)
jstr = urllib2.urlopen(ReqUrl).read()  # t = jstr.strip('()')
ts = json.loads( jstr )
number=ts['total'] #  the number of queries  # query=ts['tokens'] # result=ts['results']
print number
seq=range(number/9)  # this is not a good way
print seq

'''step 3: crawl the data and dump into csv'''
import csv
addressForSavingData= "D:/Research/Dropbox/tweets/wapor_assessing online opinion/News coverage of ows/nyt.csv"
file = open(addressForSavingData,'wb') # save to csv file
for i in seq:
    nums=str(i)
    offsets=''.join(['offset=', nums]) # I made error here, and print is a good way to test
    links=[apiUrl, query, apiDate, fields, offsets, key]
    ReqUrls='&'.join(links)
    print "*_____________*", ReqUrls
    jstrs = urllib2.urlopen(ReqUrls).read()
    t = jstrs.strip('()')
    tss= json.loads( t )  # error no joson object
    result = tss['results']
    for ob in result:
        title=ob['title']  # body=ob['body']   # body,url,title,date,des_facet,desk_facet,byline
        print title
        url=ob['url']
        date=ob['date'] # desk_facet=ob['desk_facet']  # byline=ob['byline'] # some author names don't exist
        w = csv.writer(file,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
        w.writerow((date, title, url)) # write it out
file.close()
pass

see the result:

Similarly, you can crawl the article data from The Guardian. See the link below.

http://explorer.content.guardianapis.com/#/?format=json&order-by=newest

After you have registered you app and got the key, we can work on the python script.


# !/usr/bin/env python
# -*- coding: UTF-8  -*-
# Scraping The Guardian using Python
# 20120421@ Canberra
# chengjun wang

import json
import urllib2

'''
http://content.guardianapis.com/search?q=occupy+wall+street&from-date=2011-09-01&to-date=2012-02-14&page=2
&page-size=3&format=json&show-fields=all&use-date=newspaper-edition&api-key=m....g33gzq
'''

'''step 1: input query information'''
apiUrl='http://content.guardianapis.com/search?q=occupy+wall+street' # set the query word here
apiDate='from-date=2011-09-01&to-date=2011-10-14'                     # set the date here
apiPage='page=2'      # set the page
apiNum=10             # set the number of articles in one page
apiPageSize=''.join(['page-size=',str(apiNum)])
fields='format=json&show-fields=all&use-date=newspaper-edition'
key='api-key=mudfuj...g33gzq'  # input your key here

'''step 2: get the number of offset/pages'''
link=[apiUrl, apiDate, apiPage, apiPageSize, fields, key]
ReqUrl='&'.join(link)
jstr = urllib2.urlopen(ReqUrl).read()  # t = jstr.strip('()')
ts = json.loads( jstr )
number=ts['response']['total'] #  the number of queries  # query=ts['tokens'] # result=ts['results']
print number
seq=range(number/(apiNum-1))  # this is not a good way
print seq

'''step 3: crawl the data and dump into csv'''
import csv
addressForSavingData= "D:/Research/Dropbox/tweets/wapor_assessing online opinion/News coverage of ows/guardian.csv"
file = open(addressForSavingData,'wb') # save to csv file
for i in seq:
nums=str(i+1)
apiPages=''.join(['page=', nums]) # I made error here, and print is a good way to test
links= [apiUrl, apiDate, apiPages, apiPageSize, fields, key]
ReqUrls='&'.join(links)
print "*_____________*", ReqUrls
jstrs = urllib2.urlopen(ReqUrls).read()
t = jstrs.strip('()')
tss= json.loads( t )
result = tss['response']['results']
for ob in result:
title=ob['webTitle'].encode('utf-8')  # body=ob['body']   # body,url,title,date,des_facet,desk_facet,byline
print title
section=ob["sectionName"].encode('utf-8')
url=ob['webUrl']
date=ob['fields']['newspaperEditionDate'] # date=ob['webPublicationDate']  # byline=ob['fields']['byline']
w = csv.writer(file,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow((date, title, section, url)) # write it out
file.close()
pass


Top 200 Twitter Users of Occupying Wall Street

Published by chengjun on December 17th, 2011
The centrality of the discussion network of occupying wall street is always changing. However, there are something in common. The people who are spoken to tends to be the movement leaders, media, and politicians.
_____________________________________________________________ (more…)
Comments Off on Top 200 Twitter Users of Occupying Wall Street

Occupying Wall Street Powered by Sentiment on Twitter

Published by chengjun on December 10th, 2011

Chengjun WANG  @ City University of Hong Kong

Key Words: Online Social Movement, Mass Media, Sentiment, Twitter

Introduction

“Occupy Wall Street (OWS) is an ongoing series of demonstrations initiated by the Canadian activist group Adbusters which began September 17, 2011 in Zuccotti Park, located in New York City‘s Wall Street financial district. The protests are against social and economic inequality, high unemployment, greed, as well as corruption, and the undue influence of corporations—particularly from the financial services sector—on government. The protesters’ slogan We are the 99% refers to the growingdifference in wealth in the U.S. between the wealthiest 1% and the rest of the population.” (cited from wikipedia)

______________________________________________________________

Online Social Movement

I study it as one online social movement, and aims to uncover the structural pattern of the interactions.

Data

The data are collected by R-shief, and was open to the world by registration. The data contains 1353413 tweets between Sep 24th 2011 and Oct 10th 2011, among which, there are 88601 interactions.

I have been working on this piece of data and finding interesting temporal patterns among twitting and retwitting behavior in terms of action and interaction.

______________________________________________________________

3923 twitter users spoke to others  in Sep 24. Apparently, there are cores and periphery.

The power Law Distribution above confirms such proposition.