代码
import stata_setup
'C:/Program Files/Stata18', 'mp', splash=False) stata_setup.config(
Simon Zhou
2025年5月3日
数据字典,可以用来描述数据集的基本信息,包括变量名称、变量类型、缺失值、变量描述等。
语法: codebook [varlist] [if] [in] [, options]
[]
表示可选项,不是必须的。varlist
表示变量列表,可以指定一个或多个变量。if
和 in
是条件语句,可以用来筛选数据。options
是可选项,可以用来指定其他参数,一些可以自定义的选项。
. // 载入数据集,使用 Stata 的内置数据集 auto.dta
. // 该数据集包含汽车的各种属性,如价格、重量、马力等
. sysuse auto, clear
(1978 automobile data)
. codebook price
-------------------------------------------------------------------------------
price Price
-------------------------------------------------------------------------------
Type: Numeric (int)
Range: [3291,15906] Units: 1
Unique values: 74 Missing .: 0/74
Mean: 6165.26
Std. dev.: 2949.5
Percentiles: 10% 25% 50% 75% 90%
3895 4195 5006.5 6342 11385
.
-------------------------------------------------------------------------------
price Price
-------------------------------------------------------------------------------
Type: Numeric (int)
Range: [5079,15906] Units: 1
Unique values: 37 Missing .: 0/37
Mean: 8086.95
Std. dev.: 3142.58
Percentiles: 10% 25% 50% 75% 90%
5189 5788 6342 10371 13466
-------------------------------------------------------------------------------
price Price
-------------------------------------------------------------------------------
Type: Numeric (int)
Range: [3299,15906] Units: 1
Unique values: 11 Missing .: 0/11
Mean: 6917.36
Std. dev.: 4668.09
Percentiles: 10% 25% 50% 75% 90%
3667 3955 4504 11385 14500
[D] codebook -- Describe data contents
(View complete PDF manual entry)
Syntax
------
codebook [varlist] [if] [in] [, options]
options Description
-------------------------------------------------------------------------
Options
all print complete report without missing values
header print dataset name and last saved date
notes print any notes attached to variables
mv report pattern of missing values
tabulate(#) set tables/summary statistics threshold; default
is tabulate(9)
problems report potential problems in dataset
detail display detailed report on the variables; only
with problems
compact display compact report on the variables
dots display a dot for each variable processed; only
with compact
Languages
languages[(namelist)] use with multilingual datasets; see [D] label
language for details
-------------------------------------------------------------------------
collect is allowed; see prefix.
Menu
----
Data > Describe data > Describe data contents (codebook)
Description
-----------
codebook examines the variable names, labels, and data to produce a
codebook describing the dataset.
Links to PDF documentation
--------------------------
Quick start
Remarks and examples
The above sections are not included in this help file.
Options
-------
+---------+
----+ Options +----------------------------------------------------------
all is equivalent to specifying the header and notes options. It
provides a complete report, which excludes only performing mv.
header adds to the top of the output a header that lists the dataset
name, the date that the dataset was last saved, etc.
notes lists any notes attached to the variables; see [D] notes.
mv specifies that codebook search the data to determine the pattern of
missing values. This is a CPU-intensive task.
tabulate(#) specifies the number of unique values of the variables to use
to determine whether a variable is categorical or continuous.
Missing values are not included in this count. The default is 9;
when there are more than nine unique values, the variable is
classified as continuous. Extended missing values will be included
in the tabulation.
problems specifies that a summary report is produced describing potential
problems that have been diagnosed:
- Variables that are labeled with an undefined value label
- Incompletely value-labeled variables
- Variables that are constant, including always missing
- Leading, trailing, and embedded spaces in string variables
- Embedded binary 0 (\0) in string variables
- Noninteger-valued date variables
See codebook problems for a discussion of these problems and advice
on overcoming them.
detail may be specified only with the problems option. It specifies that
the detailed report on the variables not be suppressed.
compact specifies that a compact report on the variables be displayed.
compact may not be specified with any options other than dots.
dots specifies that a dot be displayed for every variable processed.
dots may be specified only with compact.
+-----------+
----+ Languages +--------------------------------------------------------
languages[(namelist)] is for use with multilingual datasets; see [D]
label language. It indicates that the codebook pertains to the
languages in namelist or to all defined languages if no such list is
specified as an argument to languages(). The output of codebook
lists the data label and variable labels in these languages and which
value labels are attached to variables in these languages.
Problems are diagnosed in all of these languages, as well. The
problem report does not provide details in which language problems
occur. We advise you to rerun codebook for problematic variables;
specify detail to produce the problem report again.
If you have a multilingual dataset but do not specify languages(),
all output, including the problem report, is shown in the "active"
language.
Examples
--------
With standard (monolingual) datasets,
-----------------------------------------------------------------------
Setup
. sysuse auto
. note rep78: "investigate missing values"
. label values rep78 repairlbl
Display codebook for all variables in dataset
. codebook
Same as above command
. codebook _all
Same as above command, but print dataset name, date last saved,
dataset label, number of variables and of observations, and dataset
size
. codebook, header
Display codebook for rep78 variable
. codebook rep78
Display codebook for rep78 variable, including notes attached to
rep78
. codebook rep78, notes
Report potential problems with dataset
. codebook, problems
Display compact report for all variables in dataset
. codebook, compact
-----------------------------------------------------------------------
Setup
. webuse citytemp, clear
Display codebook for cooldd, heatdd, tempjan, and tempjuly, and
report pattern of missing values
. codebook cooldd heatdd tempjan tempjuly, mv
-----------------------------------------------------------------------
With multilingual datasets, with languages en and es, and with active
language en,
Setup
. webuse autom
Display codebook for foreign in language en
. codebook foreign
Display codebook for foreign in language es
. codebook foreign, language(es)
Display codebook for foreign in both en and es
. codebook foreign, languages
Stored results
--------------
codebook stores the following lists of variables with potential problems
in r():
Macros
r(cons) constant (or missing)
r(labelnotfound) undefined value labeled
r(notlabeled) value labeled but with unlabeled categories
r(str_type) compressible
r(str_leading) leading blanks
r(str_trailing) trailing blanks
r(str_embedded) embedded blanks
r(str_embedded0) embedded binary 0 (\0)
r(realdate) noninteger dates
打印数据集的基本统计描述指标,包括均值、标准差、最小值、最大值等。 语法: summarize [varlist] [if] [in] [, options]
可以使用 sum
或 summ
来代替 summarize
。
. sum price
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 74 6165.257 2949.496 3291 15906
. summ price
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 74 6165.257 2949.496 3291 15906
.
最大的区别就是 summarize
有一个 detail
选项,可以打印出更多的统计描述指标,比如四分位数、偏度、峰度等。
[R] summarize -- Summary statistics
(View complete PDF manual entry)
Syntax
------
summarize [varlist] [if] [in] [weight] [, options]
options Description
-------------------------------------------------------------------------
Main
detail display additional statistics
meanonly suppress the display; calculate only the mean;
programmer's option
format use variable's display format
separator(#) draw separator line after every # variables; default is
separator(5)
display_options control spacing, line width, and base and empty cells
-------------------------------------------------------------------------
varlist may contain factor variables; see fvvarlist.
varlist may contain time-series operators; see tsvarlist.
by, collect, rolling, and statsby are allowed; see prefix.
aweights, fweights, and iweights are allowed. However, iweights may not
be used with the detail option; see weight.
Menu
----
Statistics > Summaries, tables, and tests > Summary and descriptive
statistics > Summary statistics
Description
-----------
summarize calculates and displays a variety of univariate summary
statistics. If no varlist is specified, summary statistics are
calculated for all the variables in the dataset.
Links to PDF documentation
--------------------------
Quick start
Remarks and examples
Methods and formulas
The above sections are not included in this help file.
Options
-------
+------+
----+ Main +-------------------------------------------------------------
detail produces additional statistics, including skewness, kurtosis, the
four smallest and four largest values, and various percentiles.
meanonly, which is allowed only when detail is not specified, suppresses
the display of results and calculation of the variance. Ado-file
writers will find this useful for fast calls.
format requests that the summary statistics be displayed using the
display formats associated with the variables rather than the default
g display format; see [D] format.
separator(#) specifies how often to insert separation lines into the
output. The default is separator(5), meaning that a line is drawn
after every five variables. separator(10) would draw a line after
every 10 variables. separator(0) suppresses the separation line.
display_options: vsquish, noemptycells, baselevels, allbaselevels,
nofvlabel, fvwrap(#), and fvwrapon(style); see [R] Estimation
options.
Examples
--------
. sysuse auto
. summarize
. summarize mpg weight
. summarize mpg weight if foreign
. summarize mpg weight if foreign, detail
. summarize i.rep78
Video example
-------------
Descriptive statistics in Stata
Stored results
--------------
summarize stores the following in r():
Scalars
r(N) number of observations
r(mean) mean
r(skewness) skewness (detail only)
r(min) minimum
r(max) maximum
r(sum_w) sum of the weights
r(p1) 1st percentile (detail only)
r(p5) 5th percentile (detail only)
r(p10) 10th percentile (detail only)
r(p25) 25th percentile (detail only)
r(p50) 50th percentile (detail only)
r(p75) 75th percentile (detail only)
r(p90) 90th percentile (detail only)
r(p95) 95th percentile (detail only)
r(p99) 99th percentile (detail only)
r(Var) variance
r(kurtosis) kurtosis (detail only)
r(sum) sum of variable
r(sd) standard deviation
Price
-------------------------------------------------------------
Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 74
25% 4195 3748 Sum of wgt. 74
50% 5006.5 Mean 6165.257
Largest Std. dev. 2949.496
75% 6342 13466
90% 11385 13594 Variance 8699526
95% 13466 14500 Skewness 1.653434
99% 15906 15906 Kurtosis 4.819188
绘制直方图,可以用来查看数据的分布情况。
语法: histogram varname [if] [in] [weight] [,[continuous_opts][discrete_opts] options]
varname
是变量名称,可以指定一个变量。if
和 in
是条件语句,可以用来筛选数据。options
是可选项,可以用来指定其他参数,比如直方图的颜色、边框、标题等。bin()
选项可以用来指定直方图的分组数,比如 bin(20)
表示将数据分成 20 组。normal
选项可以用来绘制正态分布曲线,可以用来查看数据是否服从正态分布。freq
选项可以用来绘制频率直方图,可以用来查看数据的频率分布情况。percent
选项可以用来绘制百分比直方图,可以用来查看数据的百分比分布情况。density
选项可以用来绘制密度直方图,可以用来查看数据的密度分布情况。start()
选项可以用来指定直方图的起始值,比如 start(0)
表示从 0 开始。width()
选项可以用来指定直方图的宽度,比如 width(1)
表示每组的宽度为 1。gap()
选项可以用来指定直方图的间隔,比如 gap(0)
表示没有间隔。barwidth()
选项可以用来指定直方图的条形宽度,比如 barwidth(0.5)
表示条形宽度为 0.5。barcolor()
选项可以用来指定直方图的条形颜色,比如 barcolor(red)
表示条形颜色为红色。barlabel()
选项可以用来指定直方图的条形标签,比如 barlabel(1)
表示条形标签为 1。barlabelcolor()
选项可以用来指定直方图的条形标签颜色,比如 barlabelcolor(blue)
表示条形标签颜色为蓝色。barlabelsize()
选项可以用来指定直方图的条形标签大小,比如 barlabelsize(10)
表示条形标签大小为 10。barlabelposition()
选项可以用来指定直方图的条形标签位置,比如 barlabelposition(inside)
表示条形标签在条形内部, barlabelposition(outside)
表示条形标签在条形外部。histogram 可以简写为 histo
,也可以简写为 hist
。
normal
选项可以用来绘制正态分布曲线,可以用来查看数据是否服从正态分布。normopts(line_options
) 选项可以用来指定正态分布曲线的线条属性,比如 normopts(lcolor(red))
表示正态分布曲线的颜色为红色。kdensity
选项可以用来绘制核密度曲线,可以用来查看数据的密度分布情况。kdenopts(line_options)
选项可以用来指定核密度曲线的线条属性,比如 kdenopts(lcolor(blue))
表示核密度曲线的颜色为蓝色。绘制箱线图,可以用来查看数据的分布情况和异常值。
箱线图的中位数、四分位数、最大值、最小值、异常值等信息。
箱线图的中位数是箱子的中间线,四分位数是箱子的上下边界,最大值和最小值是箱子的上下须,异常值是箱子外的点。
箱线图的优点是可以清晰地显示数据的分布情况和异常值,缺点是不能显示数据的具体数值。
语法: graph box yvars [if] [in] [weight] [, options]
graph hbox yvars [if] [in] [weight] [, options]
他们的区别在于 graph box
是绘制垂直箱线图,graph hbox
是绘制水平箱线图。
[]
表示可选项,不是必须的。varlist
表示变量列表,可以指定一个或多个变量。if
和 in
是条件语句,可以用来筛选数据。options
是可选项,可以用来指定其他参数,一些可以自定义的选项。小提琴图,类似于箱线图,可以用来查看数据的分布情况和异常值。
语法: biolin [varlist] [if] [in] [, options]
- []
表示可选项,不是必须的。 - varlist
表示变量列表,可以指定一个或多个变量。 - if
和 in
是条件语句,可以用来筛选数据。 - options
是可选项,可以用来指定其他参数,一些可以自定义的选项。
. // 不是自带的命令,需要下载安装
. ssc install vioplot
checking vioplot consistency and verifying not already installed...
installing into C:\Users\asus\ado\plus\...
installation complete.
.
语法: