Archive for the ‘Data analysis’ Category

Analyzing the dynamics of cities using nightime satellite images

Saturday, December 22nd, 2012


The book Fractal Urban System triggered my interest in analyzing urban systems. Without the access to this type of data sets, I decided to use the satellite images released by NASA. In this study, I found that human activities (which are indicated by the lightness of the studied area)increase faster than city size.


1. The central place model of Shanghai

W.Christaller (1933) suggested that cities generally confirm to the “central place model”, i.e., most of varibales concerning human activities distributed sparsely in the remote zone than at the center of cities. This pattern can be quantified by a negative relatiosnhip between population-related variable l and the distance from the center r. Or, by intergating both variables on r, we get the relatiosnip beween L and A, total activty and area.

Figure 1

I downloaded a night satellite photo of Shanghai from NASA, cut the photo to exclude Suzhou (a city closed to Shanghai) and then changed the color scheme of the photo from RGB to gray scale. The former scheme expresses every pixel with three numbers vary from 0 to 255, whereas the latter only uses a number between 0-1 (the brighter the pixel is, the greater the value is). This transformation lead to a loss of information, but is necessary to our analysis, because we are only concerned about the brightness, which indicates the strength of human activities.

With brightness data at hand, we now want to find the “center” of Shanghai to examine the aforementioned model. The CBD (yellow point in Figure 1) is apparently a good candidate. For comparison I also calculated the “center of mass” of Shanghai (red point in Figure 1). In doing this I assume that the city is a multi-particle system, with the mass of each point denoted by the lightness of the pixel.

Figure 2

Now we can examine the relation between L and A. Specifically, for a light point i, I calculated ri, its Euclidean distance from the center. Obviously, Ai is the sqaure of ri, while the sum of the lightness of the points closer to the center is Li. In Figure 2, (d) shows the negative replationship between r and l. Therefore the central place model is confirmed. (a) and (b) show the sub-linear increase of A with L measured from two version of city center, respectively. Both of them are shown in log-log axes (C) combines (a) and (b) together and displays the data points in linear axes. the two data displayed together to linear coordinates.


2. The allometric growth of U.S. cities

Bettencourt et al(2007) investigated the allometric growth pattern of cities. Allometric means non-linear and usually particularly refer to power-law relationships. They suggested that most of variables related to human activities scales with population across cities of different size. Here I am going to examine this assumption on the relationship between S and L.

Figure 3

First of all, we need a photo including many cities. I downloaded a photo of the U.S. (Figure 3a) and converted it into gray scale (Figure 3b). Next, I considered a parameter k between 0-1, when the brightness of a pixel is less than k, I “turned-off” it by setting its gray scale value to 0. Figure 3c demonstrate the case when k = 0.5. Then I defined S as the number of pixels within disjoint clusters and L as the sum of lightness over all pixels within the clusters. Figure 3d shows that when k = 0.5 there are totally 1322 clusters.

Figure 4

I found that when k = 0.5, there is a scaling relationship between S and L as given by Fig 4b. The slope of the OLS regression line fitted in log-log axes is 1.07, r-squared is 0.99. This is a
strong pattern across almost all U.S. cities (or urban clusters). It means that (1) the human activities grows faster than city size, and (2) large cities are scaled versions of small cities. In a previous study on the accelerating growth of online tagging communities, I proposed to explain this sup-linear scaling pattern by the power-law distribution of human activities of a scaling exponent smaller than 2. Here I found that my theory also fits the cities data, as given by Figure 4a.


3. The univeral scaling pattern across cities worldwide

Figure 5

It is obviously that the value of the allometric exponent might be affected by k, if k changes, it is likely that the scaling relatiosnship will also change. Therefore, I investigated the
relationship between k and exponent gamma across three systems, including the world, the United States, and Shanghai. Figure 5 shows that the number of clusters (which are shown in the left down corner of figures) generally decreases when k increases. The value of k are 0.1,0.3,0.6,and 0.9 in the figure columns of figures, respectively.

Figure 6

I found that, as shown in Figure 6b, (a) the value of gamma is always greater than 1 (all data points have r-square greater than 0.98), and (2) the value of gamma decrease as k increase. I also investigate the distribution of the pixel lightness across three systems and found both of the pixels in the worldwide and U.S. photos confirms to power law distribution.


4. An simple network model on the growth of cities

Figure 7

In the field of systems science, there is a well-known saying that if one truely “understand” a system, he should be able to replicate it. It turns out that the observed patterns can be generated by a very simple network model.   I used Netlogo scripts to set up a agent model that leads to scaling relationship both in space (Figure 7) and time (Figure 8).

Figure 8

Clickstreams: the flow of collective attention on the Web

Sunday, March 11th, 2012

The real-world complex systems are often maintained by “flow”. Food webs are supplied by energy flow and the world-wide trading network grows at the expense of cash flow. Then what does the Web live on? Attention flow.

The fundamental behavior of users in the virtual world is clicking.  A individual clickstream is composed of a series of clicked urls, denoting a flow of attention from a information resource to another. At every second, millions are clicking in the virtual world, leading to an attention-flow network with information resources as nodes and the flow of attentions as edges.


Figure 1. A individual clickstream (left) and a collective clickstream network on Digg.

The Web as an attention engine

Sunday, March 11th, 2012

We can view the Web, or any of its sub-systems, as an “attention engine” that performs the conversion of human attention to information. Inspired by the ideal heat engine proposed by S. Carnot [1], I am working on a theoretical model of “attention engine”. This proposal describes the progress made so far and the work left to be done.

This research plan has three parts. Firstly, data describing the flow of attention online is collected. Then an ideal model is proposed to explain how attention is converted to information by the Web. Finally, the theoretical maximum efficiency of the conversion is calculated.


1. Describing the flow of attention on the Web


Figure 1. Attention flow on the Web. Websites are represented by nodes, whose colors correspond to the flow hierarchies and sizes correspond to the traffic. The clickstreams are shown in directed edges.


The Web converts attention flow to information, like heat engines convert heat flow to mechanical work. Clickstreams arising from successive, collective clicks show the navigation of users from one websites to another, thus indicate the flow of attention on the Web.

In a recent study [2], I collected data from and constructed a clickstream network comprising of 980 websites (accounts for 97% of the global Internet traffic) and 12,008 clickstreams. A hierarchy structure of websites was found, illustrating the flow of attention from large, user-generated content (UGC) websites (e.g., Facebook and Youtube) to small, non-UGC websites (Figure 1).


2. Understanding the conversion from human attention to information

Figure 2. The Carnot cycles in human brain and on the Web.


Two types of “attention engines” are involved in user navigation. One use human brain as “working material” and the other use artificial intelligence (Figure 2). In the former a user’s exposure to peer attention enhances his willingness to express, leading to the production of information. In the latter the Web as a collection of artificial intelligence programs leverages the information in user’s navigation to solve users’ problems. These two models explain the success of UGC websites, in which I found that the peer production facilitated by an intelligent system leads to an accelerating growth of information [3] [4].

In the next step, the two types of “attention engine” will be investigated empirically in the datasets I collected. The datasets includes log files from Wikipedia, Flikr, and Delicious, each of which contains millions of user activities and across several years.


3. Calculating the maximum efficiency of the Web


The maximum efficiency of the Web as an “attention engine” can be given as

g =   Iin / Iout                         (1)

in which Iout is the information output and Iin the information input. To compute information input/output generate by clicks, we consider the uncertainty decreased by them. Every click is a choice made from n information resources, thus will generate log2(n) bit of information. In this way, we can compare the theoretical and empirical efficiency of the Web, or any of its subsystems.


[1]  S. Carnot, E. Mendoza, B. P. É. Clapeyron, and R. Clausius, Reflections on the motive power of fire. Springer, 1960.

[2]  L. Wu and J. Zhang, “The Flow Structure on the WWW”, arXiv:1110.6097, Oct 2011.

[3]  L. Wu and J. Zhang, “Accelerating growth and size-dependent distribution of human online activities”, Physical Review E, vol. 84, no. 2, p. 026113, 2011.

[4]  L. Wu, “The accelerating growth of online tagging systems”, Eur. Phys. J. B, vol. 83, pp. 283–287, 2011.