2008年4月18日星期五

A simple study on "Wang Qian Yuan Incident" at Duke

This Wednesday when I attended an informal meeting, an American told me there is one incident happened at Duke Last Wednesday. This is related to both Tibet and a disobedient Chinese girl, Wang Qian Yuan (王千源). Usually the pro-China protests organized by CSSA, and endorsed by Chinese government are boring and ineffective, so they will not attract too much of my attention. However this time it is a little different, because we Chinese succeeded in transforming a conflict between Chinese and foreigners into a civil war between Chinese again. This Friday the proposed econometrics seminar is canceled. So I decided to study this incident and try to find the whole story.

By finding several videos from YouTube, and several articles from the Duke Chinese Scholars and Students Association, The Chronicle, an independent daily newspaper at Duke University, the whole process seems emerges.

The event begins with a candlelight vigil supporting freedom in Tibet organized by the Duke Human Rights Coalition on Wednesday evening. One leader of this coalition, Daniel Cordero said they reserved the Chapel Quad in advance, planing to advocate for Tibet's freedom from the People's Republic of China. However "crowds of upset protesters flooded the Chapel Quadrangle". They bear signs and Chinese flags, expressing patriotism and criticizing Western media through chants and song.

During this process, a Chinese female in yellow came to the central stage, presenting her idea on Tibet. We can here several Chinese yelling at her "Are you Chinese?" from the video I attached. Later she was surrounded and queried by many Chinese. Because the volume of background songs are so loud, I cannot hear their conversation clearly.

Anyway, in one video several of her "crimes" are listed. It is in Chinese, I tried to translate it into English as follows:

(1) Write the slogan "Free Tibet" on the back of one "separatist";

(2) Make counter-active gestures together with "separatists" (which is based on the gesture for "one world, one dream" but falsified by "separatists").

(3) Compare the Tibetan ensign with the Hong Kong ensign.

What is more astonishing happened later. Her name, her phone number and her Chinese identity were posted to the web site of the Duke Chinese Scholars and Students Association. Later these private information were posted in several popular Chinese-language forums, say, Tianya, Kaidi. Even worse, contact information of her parents were also posted.

Since she has received many harassing phone calls and e-mails, she filed a report with Duke University Police Department Friday and indicted DCSSA released her private information. The president of DCSSA denied such accusation consequently.

Though the story is not finished, because there might be more fights between Ms.Wang and DCSSA, I'd like to comment what Duke CSSA has done.

(1) The organization of pro-China protest is not effective at all.
They did not reserve the place to protest. They protest by chants and yelling is not pursuasive. They threat other people's right to free speech. Basically it seems that DCSSA does not know the rule in the US at all. Actually what they have done will hurt the image of Chinese eventually.

(2) It is illegal to release an individual's privacy, if DCSSA did that.

(3) Treat such a severe accusation that DCSSA released privacy by only denying is not sufficient at all. DCSSA need express their concern about the victim, need provide all necessary help finding the person who released the privacy, since it is released throught DCSSA's email system.

It seems that DCSSA abuse the freedom in the US. Their actions are not constructive. It does not know what their interests and purposes are, and how to protect their rights and deliver their opinions in the US.

PS:

http://www.dukechronicle.com/media/storage/paper884/news/2008/04/14/News/Student.Gets.Threats.After.China.Protest-3322848.shtml

http://media.www.dukechronicle.com/media/storage/paper884/news/2008/04/10/News/ProTibet.ProChina.Protesters.Clash.On.Quad-3316313.shtml

http://www.youtube.com/watch?v=I4J6nfyb-3k





An Open Letter to Duke Community
Apr 14th, 2008

After last Wednesday’s high profile protest on Duke campus, a few subscribers to the mailing list
China@duke.edu
anonymously sent out messages verbally attacking one student using language we found troubling and heinous, as well as releasing this student’s private information. This mailing list was set up mainly for the purpose of helping students exchange information such as second-hand car or apartment sublease. It is open to the public, not limited only to Chinese students and scholars at Duke, for subscription and currently has more than 900 registered users, and like many other mailing list of this kind, we do not have a dedicated member to monitor it closely on a daily basis. However, we removed all the relevant messages once they were brought to our attention. And starting on Saturday, April 12, 2008, we have imposed stricter filter rules for messages sent through the mailing list. Duke Chinese Students and Scholars Association (DCSSA) hereby declares our unequivocal position that we strongly disagree and condemn the behavior of these few anonymous subscribers.

However, we are very disappointed by the story “Student gets threats after China protest” appearing on today’s Chronicle (Apr 14th, 2008). We feel regretful that this student considered it was DCSSA’s fault to release “all kinds of information” about her, and several other student organizations on campus blamed DCSSA for actions taken by certain subscribers to our mailing list, which, for the reasons stated above, we have to disagree with. We are sympathetic to this student’s situation, and as the representatives of DCSSA, we will try to contact this student to resolve any misunderstandings.

As one of the largest student groups on campus, DCSSA is an organization dedicated to promoting diversity on Duke Campus. We are always proud to bring the culture from China—our home country which has a glorious history of more than 5,000 years, to the Gothic Wonderland which we also call home. We hope that by learning from each other, we can work towards an even brighter future. We appreciate the increasing attention on China recently received from the Duke Community. In light of the recent events on and off campus, we welcome your constructive comments and healthy reflections on a wide range of topics, including the impartiality of media, freedom of speech, and effectiveness of cross-cultural communication. Please feel free to send us your email to:
dcssa2008@gmail.com
.

Thank you!

Zhizhong Li, DCSSA President
Weina Wang, DCSSA Vice President
Weining Bian, DCSSA Vice President

2008年4月13日星期日

我常用到的stata命令之五

(续)
合并数据库既要合并观察,又要合并变量。合并观察用append。两个数据库的格式完全一样,但观察不一样,合并他们用append空格using空格(文件名)就可以狗尾续貂了。简单明了,很难犯错。用merge就需要格外小心。如果两个数据库中包含共同的观察,但是变量不同,希望从一个数据库中提取一些变量到另一个数据库,这时用merge。完整的过程如下:

use (文件名) [打开辅助数据库]
sort (变量名) [根据变量排序,这个变量是两个数据库共有的识别信息]
save (文件名), replace [保存辅助数据库]
use (文件名) [打开主数据库]
sort (变量名) [对相同的变量排序]
merge (变量名) using (文件名), keep((变量名))
[第一个变量名即为前面sort后面的变量名,文件名是辅助数据库的名字,后面的变量名是希望提取的变量名]
ta _merge [显示_merge的取值情况。_merge等于1的观察是仅主库有的,等于2的是仅辅助库有的,等于3是两个库都有的。]
drop if _merge==2 [删除仅仅来自辅助库的观察]
drop _merge [删除_merge]
save (文件名), replace [将合并后的文件保存,通常另存]

讲到这里似乎对于数据的生成和处理应该闭嘴了,讲讲估计、检验这些更有趣的事情吧。这里我最后举一个例子,说说准备工作的不易。麻烦的事情是总是有一些没办法简单套用命令的特殊要求。现在有两条路可以通向罗马:一是找到更高级的命令一步到位;二是利用已知简单命令多绕几个圈子达到目的。

下面讲一个惨痛的教训,是我迄今碰到的最繁复的生成新数据。原始数据是一份住户登记表。里面有每个人的个人信息和他与与户主关系的信息,目的是找到亲子关系。构想是新数据库以子辈为观察单位,找到他们的父母,再把父母的变量添加到每个观察上。我的做法如下:

use a1,clear [打开全部样本数据库]
keep if gender==2&agemos>=96&a8~=1&line<10
[保留已婚的一定年龄的女性]
replace a5=1 if a5==0
[变量a5标记和户主的关系。等于0是户主,等于1是户主的配偶。这里不加区分地将户主及其配偶放在一起。]
keep if a5==1|a5==3|a5==7
[保留是户主(=1),是户主的子女(=3),或是户主的儿媳(=7)的那些人。]

ren h hf [将所需变量加上后缀f,表示女性]
ren line lf [将所需变量加上后缀f,表示女性]
sort wave hhid
save b1,replace [排序并保存]

keep if a5f==1 [留下其中是户主或户主配偶的]
save b2,replace [保存]

use b1,clear
keep if a5f==3|a5f==7
save b3,replace [留下其中是户主女儿或儿媳的并保存]

use a3,clear [打开与户主关系是户主子女的儿童数据库]
sort wave hhid
merge wave hhid using CHNS01b2, keep(hf lf)
ta _merge
drop if _merge==2
sort hhid line wave [处理两代户,将户主配偶女性库与儿童库合并]

by hhid line wave: egen x=count(id)
drop x _merge [计算每个年份家庭匹配的情况,x只取值1,表明两代户匹配成功]
save b4,replace [保存]

use a4,clear [打开与户主关系是户主孙子女的儿童数据库]
sort wave hhid
merge wave hhid using CHNS01b3, keep(a5f a8f schf a12f hf agemosf c8f lf)
ta _merge
drop if _merge==2 [处理三代户,将户主女儿或儿媳女性库与孙子女儿童库合并]

sort hhid line wave
by hhid line wave: egen x=count(id)
gen a=agemosf-agemos
drop if a<216&x==3 [计算每个年份家庭匹配的情况,x不只取1,三代户匹配不完全成功。删除不合理的样本,标准是年龄差距和有三个可能母亲的那些家庭。]

gen xx=x[_n+1]
gen xxx=x[_n-1]
gen y=lf if x==1
replace y=lf[_n+1] if x==2&xx==1
replace y=lf[_n-1] if x==2&xxx==1
keep if x==1|(lf==y&x==2)
[对于有两个可能母亲的儿童,有相同编码的女性出现两次的情况。上面的做法是为了保证不删除这部分样本。]

drop a x xx xxx y _merge
save b5,replace [保存合并后的数据库]

[对男性数据的合并完全类似,不赘述。]

log close
exit,clear

我的方法是使用简单命令反复迂回地达到目的,所以非常希望有更简便的方法。不过往往不能追求程序非常漂亮,也得过且过了。曾经有人向我索要过上面的处理方法,但我一直没有回复。现在公开了,希望对需要的人能有所帮助,我也懒得答复了。
(待续)

我常用到的stata命令之四

(续)
egen也是生成变量的一个命令,特点是函数功能强大。gen可以支持一些简单的函数,像四则运算。egen支持更复杂的函数,比如求某变量的平均值,

egen ave = mean((变量名))

还有很多函数,可以从help里面查到,不一一列举。到现在为止我用到的有取平均、加和等函数。

讲了这么多,举个例子。某个原始数据中用变量date记录了一些日期,格式是:1980年12月11日被记为19801211。如果要提取其中的年份和月份,并生成虚拟变量,该怎么办?下面是我的做法:

gen yr=int(date/10000)
gen mo=int((data-yr*10000)/100)
ta yr, gen( yd)
ta mo, gen( md)

这里函数int()是取整函数,即把一个数字的小数部分去掉。

所需变量做好后,就可以保存为新数据库了。命令是save空格(文件名),replace。前面说过,replace选项将更新对数据库的修改,所以一定要小心使用。所以应另存为一个新库,如果把原始数据改了又变不回去,就叫天不应叫地不灵了。

除了对单个数据库的简单操作外,有时需要改变数据的结构,或者抽取来自不同数据库的信息合并。这一类命令中我用过的有:改变数据的纵横结构的命令reshape,生成退化数据库的命令collapse,合并数据库的命令append和merge。

reshape用于改变纵列(longitudinal)数据的结构。纵列数据就是通常说的面板(panel)数据,它记录下同一个主体(agent)同一个变量在不同时刻的观察值。记录纵列数据有宽表和长表两种格式。所谓宽表是以每个主体为纪录的单位,不同时期的相同变量都记录在同一观察下。例如,主体是某厂商,时期有2000、2001年,变量是雇佣人数和所在城市,假设雇佣人数在不同时期不同,所在城市则不变。宽表记录的格式是每个厂商是一个观察,没有时期变量,雇佣人数有两个变量,分别记录2000年和2001年的人数,所在城市只有一个变量。所谓长表是主体和时间共同定义观察,在上面的例子中,每个厂商有两个观察,由不同的年份变量区分,雇佣人数和所在城市在不同观察下都只有一个,记录在不同年份该变量的取值。reshape就是把数据库把宽表变成长表,把长表变成宽表。

在上面的例子下,把宽表变成长表的命令格式如下:

reshape long (雇佣人数的变量名), i((标记厂商的变量名)) j((标记时期的变量名))

因为所在城市不随时期变化,所以在转换格式时不用放在reshape long后面,转换前后也不改变什么。相反,把长表变成宽表只需把long换成了wide。

collapse的用处是计算某个数据库的一些统计量,再把这些统计量另存为新数据库。我用到它也较无奈,因为我找不到直接报告中位数和从1%到99%百分位数的命令。哪位大侠知道麻烦告诉我一下,在下先谢过了。计算中位数的命令如下。

collapse (median) ((变量名)), by((变量名))

生成的新数据库中记录了第一个括号中的变量(可以是多个变量)的中位数。右面的by选项是根据某个变量分组计算中位数,没有这个选项则计算全部样本的中位数。
(待续)

我常用到的stata命令之三

(续)
在生成新数据库的过程中,往往需要用原始变量派生出新的变量。生成新变量的命令有gen,egen和replace。它们的基本语法是

gen (或replace)(变量名)=(表达式)

二者的不同之处是gen生成新变量,replace重新定义旧变量。

虚拟变量是取值为0或1的变量,用来标记样本中主体的某种性质。虚拟变量在实证分析中广泛使用,所以略述如何生成的新的虚拟变量。有两种基本的方法。一种较简明,

gen(变量名)=((限制条件))

这里“限制条件”最外面的小括号是语法要求的,里面的小括号表示括号中间的内容是解释性的。如果某个观察满足限制条件,那么这个虚拟变量在该处取值为1,否则为0。另一种是

gen (变量名)=1 if (取值为1限制条件)
replace(相同的变量名)=0 if (取值为0的限制条件)

二者有一个小小的区别。如果限制条件的表达式里面没有任何缺失值,那么两种方法的结果一样。如果有缺失值,第一种方法会把是缺失值的观察的虚拟变量都定义为0。而第二种方法可以将虚拟变量的取值分为三种,一是等于1,二是等于0,三是等于缺失值。避免了把本来信息不明的观察错误地放到回归中去。

需要生成的虚拟变量不多时,依次定义新变量即可,如果需要生成大量类似的虚拟变量,基本方法就很费时费力。比如,希望在一个包含大量社区的数据中生成社区虚拟变量时,社区的数目可能有成百上千个,太费事了。如果每个社区有一个编码标记,就可以入下命令批量生成相应的虚拟变量。

ta (变量名), gen((变量名))

第一个括号里的变量名是已知变量,即上面例子中的社区编码。后一个括号里的变量名是新生成的虚拟变量的共同前缀,后面跟数字表示不同的虚拟变量。如果我在这里填入d,那么,上述命令就会新生成d1,d2,等等,直到所有社区都有一个虚拟变量。

补充一句。在回归中控制这么多社区变量,如果一个一个地输入变量名会很累。可以用省略符号简化,d*表示所有d字母开头的变量;或者是用破折号,d1-d150表示第一个到第150个社区虚拟变量。

还有一种方法可以在回归中直接控制虚拟变量,而不必真的生成这些虚拟变量。如下。

areg (被解释变量) (解释变量), absorb(变量名)

absorb选项后面的变量名和前面讲的命令中第一个变量名相同,即上面例子中的社区编码。回归的结果和在reg中直接加入相应的虚拟变量相同。
(待续)

我常用到的stata命令之二

(续)
第一步是整理原始数据。没有经过整理得原始数据,有错漏和不统一的地方。比如,一些变量的缺失观察值的表示方法,有时会用点,有时会用-9,-99等数字。未加调整就回归,结果自然荒谬。还有在不同的数据中,相同的变量的变量名不同,给合并数据带来麻烦。个人意见:根据需要,从原始数据中提取所需信息,再重新生成新的数据库,后续的分析只使用这个新库。如果需要增添新的信息,也是修改这个新库,不宜直接调用原始数据。这部分工作不难,但是非常基础。如果在这里你不够小心,后面的事情往往会白做。

现在检查数据。常用的命令包括codebook,su,ta,des和list。其中codebook提供的信息最全面,缺点是不能使用if条件限制范围,而且各种信息同时报告,在制作表格时不方便,所以还要用别的命令帮帮忙。su(summrarize)的语法是空格加变量名,它报告相应变量的非缺失的观察个数,均值,标准差,最小值和最大值。ta(tabulate)的语法是空格后面加一个(或两个)变量名,它报告某个变量(两个变量时即为二维)的取值(不含缺失值)的频数,比率和按大小排列的累积比率。des(describe)的后面也可以加任意个变量名,只要在数据中有。它报告变量的存储的类型,显示的格式和标签。(一般地,标签记录该变量的定义和单位)。List的后面也是接变量名,它报告该变量的观察值,我们可以用if或in来限制观察值的范围。所有这些命令都可以后面不加任何变量名,报告的结果是正在使用的数据库中的所有变量的相应信息。说起来苍白无力,不如打开stata亲自实验一下吧。

一句题外话。除了codebook之外,上述统计类的命令都属于r族命令(也称一般命令)。执行后都可以使用return list报告储存在r()中的统计结果。最典型的r族命令当属summarize。它会把样本量、均值、标准差、方差、最小值、最大值、总和等统计信息储存起来。你在执行su之后,只需敲入return list就可以得到所有这些信息。除了用于统计的命令之外,还有用于估计的命令,比如,regress。这些估计命令(又称e族命令)也存储了很多相关信息。和前面的统计命令类似,我们可以用ereturn list命令看到相应的信息。在复杂一些的应用中,比如对回归分解,计算一些程序中无法直接计算的统计量时,这些功能很有用。

用codebook可以看变量的值域和单位。如果有-9,-99这样的取值,怀疑是缺失。核对一下问卷中对缺失值的记录方法,确定后,改为用点记录。命令是replace (变量名)=. if (变量名)==-9。缺失值占总样本的比例太多不好,一是样本小,结果会不显著;二是可能有选择性偏差,缺失的那部分人的特征和总体相差很大。这是选用变量的一个依据。

统一命名。或者统一标签;或者统一变量的命名规则。更改变量名的命令是

ren(原变量名)(新变量名)

定义标签的命令是

label var(变量名)”(标签内容)”

整齐划一的变量名有助于记忆,简明的标签有助于分析数据。
(待续)