A very first introduction to Stata
Introduction to Stata
Mr Aidan Horn
University of Cape Town
aidan@econometrics.co.za
Note: Please use the Stata User's Guide. The latest version can be found at https://www.stata.com/manuals/u.pdf . This page is structured around the Stata User's Guide (for v17). You can use the .do file version of this page.
Advice for studying economics at UCT: http://toolkit.uctecossoc.co.za/
Connect to a UCT computer remotely: http://commerceit.uct.ac.za/remoteaccess , using the VPN, once you have a student account.
Contents
0. Code to use at the beginning of a script
15. Saving and printing output — log files
Stata basics
1. Read this — it will help
2. A brief description of Stata
3. Resources for learning and using Stata
4. Stata's help and search facilities
5. Editions of Stata
6. Managing memory
7. more conditions
8. Error messages and return codes
9. The Break key
10. Keyboard use
28. Commands everyone should know
Elements of Stata
11. Language syntax
12. Data
13. Functions and expressions
16. Do files
18. Programming Stata
20. Estimation and postestimation commands
21. Creating reports
Advice
22. Entering and importing data
23. Combining datasets
0. Code to use at the beginning of a script
Click on View > Wrap lines, so that long lines get displayed as a paragraph.
* Comments can by made by starting the line with an asterisk, or by ending a line with // and writing a comment on the same line after the code, or by surrounding a block of text with
/* multi-line
text */
You can include comments so that others understand your code better, and so that you remember what you were trying to do when you read your code again later on. It is best practice to comment above or on the same line as the corresponding code. For example:
* When the .do file is run, the results window will come forward.
window manage forward results
* Clear datasets in memory, so that analysis can start afresh.
clear all
pause on // switch off for no pauses
From here onwards, on this web-page I will keep comments mostly on the regular HTML text format, for better presentation.
My suggestion for how to set up a project directory:
Project
├─ DataIN
├─ DataOUT
├─ Scripts
│ ├─ Logs
│ └─ Graphs
├─ Writing
└─ Info
For this tutorial, go to the DataFirst website (go to the Open Data Portal), and search for and download PALMS. Log in with your account, and state that you are exploring PALMS for educational purposes (when they ask for a reason why you are downloading the dataset).
Copy the folder path from the file explorer. Use forward-slashes in folder paths, as that works on Mac, and eliminates the chance of characters being 'escaped'. When collaborating with others, you can both run the same .do file and use different directories if you ask the script to check the username.
if c(username)=="hrnaid001" {
* Data source
global PALMS "C:/Users/hrnaid001/Dropbox/Economics/Survey data/PALMS"
* Project folder
global USER "C:/Users/hrnaid001/Dropbox/Economics/Tutoring/SoE/2023ECO5011F/Stata"
}
cd "$USER/Scripts"
It is easiest if you create the appropriate folders on your computer manually.
15. Saving and printing output — log files
cap log close _all
cap means that the program won't stop if the line is an error. log close closes the running log if it is open (so that we can re-start the log from the top). The _all specification closes all the open log files (e.g. if there are multiple logs being kept).
A log file saves the results into a file, so that we can check the results later if needed, without having to run the script again (which could take time for large, complex analyses).
The Stata Markup and Control Language (smcl format) log file is responsive to the screen size, and has colours.
log using "$USER/Scripts/Logs/Practice", smcl replace name(smcl1)
Supervisors or clients sometimes don't use Stata themselves, so it is easier for them to open a text version of the log file.
log using "$USER/Scripts/Logs/Practice", text replace name(text1)
You can also convert .smcl log files to their text-based version with translate filename.smcl filename.log, replace, which saves filename.log (the text version). You can also convert an SMCL file to PDF, to share it more easily with others: translate filename.smcl filename.pdf
Loading data into memory
Press Ctrl + Shift + Esc, and look at the available RAM (random access memory) in the computer. When a statistical package uses a dataset, it loads the data from the hard drive to the RAM, because reading data on RAM is quicker. The following command loads data into memory. Watch how the space used on RAM goes up, as the data is loaded in.
use "$PALMS/palmsv3.3.dta", clear
Data analysts need more RAM than regular computer users, because of this.
Stata basics
1. Read this — it will help
If you are confused about a command, type help <command> in the console. It is very important that you use Stata's help files to learn how to use commands, and the correct syntax to use. You will need to continue using the help command to remind yourself of different commands' usage, even after you have become accomplished at using Stata. Run the following line:
help
2. A brief description of Stata
Stata is a statistical package for managing, analyzing, and graphing data.
Throughout this tutorial, we will introduce commands.
Create a variable:
generate x = 5
(many observations, with 5 for each observation) The variable name here is "x".
realearnings base month is December 2017.
Go to https://www.statssa.gov.za/publications/P0141/CPIHistory.pdf to see the indicies, and https://www.aidanhorn.co.za/inflation/app for tidy inflation data. Only run the following command once (or generate a new variable instead)
replace realearnings = realearnings/84.3*104.2
Now the base period is 2022. A "global macro" saves a small piece of information (a "local macro" does not carry on saving it after the .do file has finished running).
global BASEyr "2022"
We can abbreviate commands, and they will still run normally (see the underlined part of a command in a help file). "lab var" stands for "label variable".
lab var realearnings "Real gross monthly earnings, in $BASEyr rands"
You need to use two equals signs when doing boolean logic. You should only use one equals sign when defining a variable.
count if realearnings==0
gen logrealearnings = log(realearnings)
The summarize command quickly computes the mean. The detail option quickly shows the distribution, and moments. Do you know what the moments of a distribution are?
summarize realearnings
summarize realearnings, detail
Note that . i.e. "missing", has a value of infinity, so when an if statement has a condition that a variable needs to be greater than an amount, you also need to include "and the variable is also less than missing". There are multiple missing categories (.a, .b, .c, etc.) above just the standard .
Inspect the data for outliers
Sort the dataset in descending order.
gen minusrealearnings = -realearnings
sort minusrealearnings
list realearnings if realearnings > 10^7 & realearnings <.
* Note that in Python, the power symbol must be two asterisks: **
format realearnings %12.0fc
list realearnings if realearnings > 10^7 & realearnings <. // These are monthly earnings values for individuals, from the labour force surveys
count if realearnings > 10^6 & realearnings <. // In the raw data, 177 people have earnings above R 1 million per month, over the years 1993-2017.
The != means 'not equal to'
levelsof year if realearnings !=. // Note that years 2008 and 2009 do not have earnings data.
DataFirst has imputed earnings values for outliers, which we show at the bottom of this tutorial.
You can view the actual dataset by typing
browse
or browse <varlist>
3. Resources for learning and using Stata
Before posting a question to Statalist, you should read the Statalist FAQ, which can be found at https://www.statalist.org/forums/help/
4. Stata's help and search facilities
Use the help function whenever you are unsure about how to write code.
5. Editions of Stata
There are three editions/sizes for Stata. In order from most expensive to cheapest: Stata/MP, Stata/SE and Stata/BE. It costs in the region of ZAR 4700 to purchase a Stata licence.
Stata/MP supports parallel processing — processing on multiple cores on the computer at the same time. Go to the task manager now to find out how many cores your computer has. Modern laptops can have between 2 and 6, and servers can have many cores. The server at the National Treasury had 32 cores in 2022, which researchers share while working at the same time, although they ask people to limit the amount of cores they use on individual instances of Stata. Stata/MP is used by institutions for big datasets, and it supports billions of observations and up to 120 000 variables.
Stata/SE is more common for individuals, as it is cheaper.
Stata/BE allows only a limited size dataset to be used.
6. Managing memory
Number the observations from 1 to the end:
gen n = _n
Preserve… restore
If you type preserve in your .do file, then you can change your dataset, save it, and restore it. The original data that you were working with will not have changed afterwards. For example, I often use collapse or reshape within this environment.
preserve
help pause
help collapse
pause // type "end"
keep if year >= 2000
* Median real earnings by gender and main occupational category, over the entire time period.
collapse (median) realearnings (count) n, by(gender jobocccode)
lab var realearnings "Median gross real earnings ($BASEyr rands)"
format realearnings %12.0fc
browse
pause
save "collapse_medianrearn_jobocc.dta", replace
help import excel
export excel using "Median_realearnings_occupation.xlsx", sheet("Gender") sheetreplace keepcellfmt firstrow(varlabels)
restore
7. more conditions
The results will run at full speed by default.
8. Error messages and return codes
If there is a (small) mistake in your code, then Stata will stop running the script where the error occurs. Make sure to read the error message carefully, in order to debug what has gone wrong with your code. This way, you can fix problems yourself, without necessarily having to ask others for help.
9. The Break key
Click on the red cross to stop the execution (for example, in case you realise your code wasn't adequate). You can try this with the above preserve section, as I noticed that that section is slow.
10. Keyboard use
Make sure to save your .do file regularly while typing (Ctrl + S), in case Stata crashes after you have developed code. I do this as frequently as every 20–40 seconds, while typing, or after a few lines (because losing work that took mental effort and creativity is frustrating). On a side note, you should sync your files to a cloud, to avoid losing work. See the "Software > Cloud storage" section in http://toolkit.uctecossoc.co.za/
PgUp (the page up key) cycles through your previous commands (in the console).
PgDn (the page down key) goes forwards through your previous commands, in the console.
10.6. Tab expansion of variable names.
A quick way to enter a variable name is to take advantage of Stata’s tab-completion feature. Simply type the first few letters of the variable name in the Command Window and press the Tab key. Stata will automatically type the rest of the variable name for you. If more than one variable name matches the letters you have typed, Stata will complete as much as it can and beep at you to let you know that you have typed a nonunique variable abbreviation. The tab-completion feature also applies to typing filenames.
28. Commands everyone should know
To make sure that you have fre installed, run
capture fre
if _rc == 199 {
ssc install fre
}
fre helps with quickly inspecting the values of a variable (similar to tab, an abbreviation for tabulate).
Here is a list of commands that "everyone" should know (go through this list, with the help files):
Getting help
help, net search, search Stata’s help and search facilities
Operating system interface
pwd, cd
Using and saving data from disk
save
use
compress
Inputting data into Stata
import
edit
Basic data reporting
describe
codebook
list
browse
count
inspect
table
tabulate [R] tabulate oneway and tabulate twoway
fre Similar to tabulate, but includes missing values
summarize
append, merge [U] 23 Combining datasets
generate, replace
egen
rename
clear
drop, keep
sort
encode, decode
order
by [U] 11.5 by varlist: construct
reshape
frames [D] frames
Graphing data
graph
Keeping track of your work
log [U] 15 Saving and printing output—log files
notes [D] notes
Convenience
display
Elements of Stata
11. Language syntax
NB: Oxford Languages defines "syntax" as (2): "The structure of statements in a computer language."
It is very important that you know what "syntax" means! This is what you're looking for when you read the help files.
11.1. Overview
With few exceptions, the basic Stata language syntax is:
by varlist: command varlist=exp if exp in range [weight], options
Take note of how weights are included in estimation, from the line above. A command can be customised with options that come after the comma. There are often multiple options (settings) available, which can be found in the help file. When an option takes an argument, the argument is enclosed in parentheses.
11.1.8. numlist
A numlist is a list of numbers. Stata allows certain shorthands to indicate ranges. Practice editing and running the following loop, with (some of) the various examples listed below.
forvalues v = numlist {
display `v'
}
Numlist Meaning
2 just one number
1 2 3 three numbers
3 2 1 three numbers in reversed order
.5 1 1.5 three different numbers
1 3 -2.17 5.12 four numbers in jumbled order
1/3 three numbers: 1, 2, 3
3/1 the same three numbers in reverse order
5/8 four numbers: 5, 6, 7, 8
-8/-5 four numbers: −8, −7, −6, −5
-5/-8 four numbers: −5, −6, −7, −8
-1/2 four numbers: −1, 0, 1, 2
1 2 to 4 four numbers: 1, 2, 3, 4
4 3 to 1 four numbers: 4, 3, 2, 1
10 15 to 30 five numbers: 10, 15, 20, 25, 30
1 2:4 same as 1 2 to 4
4 3:1 same as 4 3 to 1
10 15:30 same as 10 15 to 30
1(1)3 three numbers: 1, 2, 3
1(2)9 five numbers: 1, 3, 5, 7, 9
1(2)10 the same five numbers, 1, 3, 5, 7, 9
9(-2)1 five numbers: 9, 7, 5, 3, and 1
-1(.5)2.5 the numbers −1, −.5, 0, .5, 1, 1.5, 2, 2.5
1[1]3 same as 1(1)3
1[2]9 same as 1(2)9
1[2]10 same as 1(2)10
9[-2]1 same as 9(−2)1
-1[.5]2.5 same as −1(.5)2.5
1 2 3/5 8(2)12 eight numbers: 1, 2, 3, 4, 5, 8, 10, 12
1,2,3/5,8(2)12 the same eight numbers
1 2 3/5 8 10 to 12 the same eight numbers
1,2,3/5,8,10 to 12 the same eight numbers
1 2 3/5 8 10:12 the same eight numbers
11.1.10. Prefix commands
The quietly prefix suppresses output on the results window (and log file), which is usually used when you merely want to collect an estimate or perform an operation, but not make the screen to busy. I have used this when running many loops in a program.
11.2. Abbreviation rules
As mentioned in Section 2, you can run commands even if they are abbreviated. The minimum abbreviation is shown by underlining in the help files. For example:
summ hrslstwk, detail
Here, summ is an abbreviation for summarize. Other commonly-used abbreviations include gen for generate, and lab for label.
11.2.3. Variable-name abbreviation
Variable names may be abbreviated to the shortest string of characters that uniquely identifies them given the data currently loaded in memory. For example:
summarize hrs // Shows count, mean, standard deviation, min and max of hrslstwk.
11.2.4. This can be unexpected, in long, complicated scripts, when you think that you've typed the full variable name. In such cases, you can turn off this feature with novarabbrev.
11.4.1. You may use * to indicate that "zero or more characters go here". For instance, if you suffix * to a partial variable name (for example, educ*), you are referring to all variable names that start with that letter combination. If you prefix * to a letter combination (for example, *_derived), you are referring to all variables that end in that letter combination. If you put * in the middle (for example, self*emp), you are referring to all variables that begin and end with the specified letters.
You may use ? to specify that one character goes here.
You may place a dash - between two variable names to specify all the variables stored between the two listed variables (in the order saved in the dataset), inclusive. You can determine storage order by using describe, which lists variables in the order in which they are stored.
11.4.3. Factor variables
You do not need to create categorical dummies for variables, as when doing estimation, i.varname separates out the variable into categorical dummies, and ## gives interaction terms, as: i.varname##c.varname where c. indicates a continuous variable. For example:
fre province
reg realearnings i.province##c.year c.age##c.age
11.4.3.2. When we typed i.group in a regression command, group = 1 became the base level. When we do not specify otherwise, the smallest level becomes the base level. You can specify the base level of a factor variable by using the ib. operator.
reg realearnings ib4.province##c.year c.age##c.age // 4. Free State is set as the base.
11.4.4. Time-series variables
You would need to specify what the time variable is, with tsset (e.g. tsset time). Then, L., F., D. and S. are the lag, lead, difference and seasonal operators respectively. For panel data, xtset unit time sets up the dataset for panel data analysis.
11.5. by varlist: construct
by varlist: or by varlist, sort: runs the estimation within groups defined by the interactions of the variables in varlist. During each iteration, the values of the system variables _n and _N are set in relation to the first observation in the by-group. (_n is the row of the observation in the dataset.)
11.6. You can load a data file from the Internet. For example (covid-19 deaths by country):
import delimited "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv", clear
rename v4 longitude
rename lat latitude
help reshape
reshape long v, i(countryregion provincestate latitude longitude) j(time)
rename v deaths
lab var deaths "Covid-19 deaths"
browse
replace time = time - 4
gen datetime = date("2020-01-22", "YMD") if time==1
by countryregion provincestate, sort: replace datetime = datetime[1] + time-1 if time>1
Let's go back to using the PALMS data.
use "$PALMS/palmsv3.3.dta", clear
replace realearnings = realearnings/84.3*104.2
* Note that the global macro is still in memory, if this is the same session.
lab var realearnings "Real gross monthly earnings, in $BASEyr rands"
11.6.1. The characters .. refer to the folder containing the current folder. Thus ../myfile refers to myfile in the folder containing the current folder, and "../nextdoor/myfile" refers to "myfile" in the folder "nextdoor" in the folder containing the current folder. This can be useful when saving files in a project.
11.6.2. Stata understands ~ to mean your home directory.
12. Data
12.1. A dataset is data plus labelings, formats, notes, and characteristics.
12.2.1. Some data collectors use "extended" missing values to indicate why a certain value is unknown: the question was not asked, the person refused to answer, etc. The ordering of extended missing values is:
all numbers < . < .a < .b < ··· < .z
Thus,
count if age>50
may not return the wanted result, as it includes missing ages. You should remember to include "and less than missing" when using a greater-than sign, or a not equals sign:
count if age>50 & age<.
Compare
count if age!=50
to
count if age!50 & age<.
In a regression, if any of the variables have missing values, those observations will not be in the sample. That is why it is wise to understand the amount of missing data in each variable, before doing analysis. And, how the nonmissing subsets intersect, as the intersection determines the sample size of your analysis. It makes sense to draw a Venn diagram as part of your report.
12.2.2. Numeric storage types
Numbers can be stored in one of five variable types: byte, int, long, float (the default), or double. The number storage type can be set when generating a variable, for example:
gen byte age50 = age>=50 & age<.
replace age50=. if age==.
Storage types will be shrunk in size when the compress command is used. bytes are stored in 1 byte. ints are stored in 2 bytes; longs and floats in 4 bytes (float is short for 'floating point'), and doubles in 8 bytes. The table below shows the minimum and maximum values for each storage type.
12.4. Strings
A "string" is a sequence of characters. They are defined within double-quotation marks (Python can also use single quotation marks), so if you want quotation marks in your string, use `" and "'. For example:
label define reason 17 `"Their mobile phone was "lost"."', modify
label list reason
12.4.4. String data in Stata is usually encoded: stored as numbers, but labelled with the string interpretation. This makes programming and analysis more efficient. Encoding can be done with the encode (and undone with the decode) command, but then the programmer will not have control over the order of the values, if order matters (for example, from survey responses).
12.5.1. Numeric formats
I suggest that you use %15.0fc as the numeric format, to look at large values. For example, compare
quietly summ realearnings if year==2017
di r(sum)
di %15.0fc r(sum)
table year, contents(sum realearnings)
br realearnings // View the data before and after we change the format!
format realearnings %15.0fc
help table
* Note that we still have outliers in these data.
table year, contents(sum realearnings) format(%15.0fc)
The number to the left of the decimal point is the total number of digits, including the decimal point (if you specify a positive number of decimal places).
Example 2:
cap drop y
gen double y = 234567890.34 in 1
replace y = 654321098765.5432 in 2
list y in 1/2
format y %20.1fc
list y in 1/2
12.6. Dataset, variable, and value labels
Labels are strings used to label elements in Stata, such as labels for datasets, variables, and values.
12.6.2. Variable labels
You can label variables, as we have already shown, with
label variable y "My variable"
This is useful for users of your dataset to understand in more detail what the variables mean, as variable names can only be a short string.
12.6.3. Value labels
Variables usually take on numerical values, and these values are labelled, with 'value labels'. This labelling is important, and is often the main activity of cleaning, once variables are created. For example:
cap drop age10cats
cap label drop bin10
gen age10cats = int(age/10)*10
lab var age10cats "Age categories (bins of 10)"
tab age10cats if age<130
// Run both the following lines at the same time. The /// can be used to break a line.
label define bin10 0 "0-9" 10 "10-19" 20 "20-29" 30 "30-39" 40 "40-49" 50 "50-59" ///
60 "60-60" 70 "70-79" 80 "80-89" 90 "90-99" 100 "100-109" 110 "110-119" 120 "120-129"
You can list the contents of a value label with label list.
The bin10 value label can now be used on multiple variables. It must be attached to the variable, to put it to work:
label values age10cats bin10
tab age10cats if age<130
Practice highlighting the table in the results window, right click > Copy table, and paste the table into Excel.
12.7. Notes attached to data
You can attach notes to the dataset, with
note: realearnings inflated to $BASEyr
* Display notes
notes
* You can attach notes to variables as well:
note age10cats: There may be outlier ages above 129.
notes age10cats
This enables you to save information longer than just a variable label. See help notes for more guidance!
12.10. Data frames
Similar to R, Stata can now hold multiple datasets in memory.
13. Functions and expressions
13.2.4. Logical operators
Note that & means 'and', | means 'or', and ! means 'not'. For example:
summarize pweight if (age<=15 | (age>=65 & age<130)) & year==2016 // Non-working-age population in 2015
di %-15.0fc r(sum)/4 // the QLFS is conducted every quarter, so we divide the weight by four.
scalar nonworking2016 = r(sum) // saves the number, taking up a small amount of space.
summ pweight if year==2016 // comparison
di %-15.0fc r(sum)/4
di nonworking2016/r(sum) // proportion of non-working age population, out of total population
13.6. Accessing results from commands
Note that you can view what results are saved in memory, by running return list or ereturn list (for estimation results).
13.7. Explicit subscripting
You can access the value of a variable, by suffixing the variable with square brackets, and putting the observation number in the brackets. E.g.:
di age[1000]
* The last observation:
di age[_N]
13.12. It's sometimes better to use the float() operator when making a conditional statement, as Stata usually calculates with double precision on float numerals.
16. Do files
16.1.3. In order to break lines, when writing a long command, write your code in-between
#delimit ;
; #delimit cr
This changes the "line break" that Stata recognizes to a semi-colon — so you must use a semi-colon at the end of the command! Writing #delimit cr changes the "line break" back to a "carriage return". Note that a comment written with * or // will not be read as a separate "line" to the rest of the code, so you need to put comments within /* and */ within this environment. It is good to use this environment when writing code for graphs.
18. Programming Stata
You can save a small piece of information (e.g. a number or a string) in a global or local "macro". The global macro will continue to be saved after the (section of the) .do file has run, but the local macro will be discarded once the .do file is not running. For example:
global Intro "Hello world!"
display "$Intro"
display `"I told the computer to say, "$Intro""'
local Short 4.567
display `Short'*4
Note that when running the last line by itself, `Short' does not exist.
You can save entire sections of code, in case you want to re-run it multiple times, in a "program". E.g.:
program earnings_tenths
forvalues y=2011(1)2017 {
keep if year==`y'
xtile earningstenth = realearnings [pw=bracketweight], nquantiles(10)
preserve
collapse (mean) earnings_mean = realearnings [pw=bracketweight], by(earningstenth)
save "earnings_means_tenths_`1'_`y'.dta", replace
restore
}
end
forvalues j=1/10 {
preserve
keep if jobindcode==`j'
earnings_tenths `j'
restore
}
Note that the first argument of the program earnings_tenths is turned into a local variable `1' in the program — in this case, we save the collapsed dataset with the main industry code in the file name. Global macros can be used with ${global}restoffile in file names, and forward-slashes should be used for folder paths to avoid escaping the dollar sign or back-tick.
You can delete the files you just created, with
forvalues d=1/10 {
forvalues y=2011/2017 {
erase "earnings_means_tenths_`d'_`y'.dta
}
}
20. Estimation and postestimation commands
20.11. Obtaining predicted values
After running a regression, you can type predict dephat to calculate the predicted values for each observation, based on the estimated coefficients, where dephat is the name of the variable you want to create. To restrict the variable to the sample, state predict dephat if e(sample)
20.12. Accessing estimated coefficients
After estimation, you can access the coefficients with _b[varname]
See 20.13. for performing tests after estimation (such as F-tests).
20.14. Obtaining linear combinations of coefficients
lincom computes point estimates, standard errors, t or z statistics, p-values, and confidence intervals for a linear combination of coefficients after estimation.
Please make sure that you have covered the material in Wooldridge, J. 2013. Introduction to econometrics. (EMEA edition).
21. Creating reports
See putexcel and putdocx for creating files programmatically.
Advice
22. Entering and importing data
A dataset can easily be opened in Stata by clicking File > Import.
23. Combining datasets
append combines datasets vertically.
merge combines datasets horizontally. This can be done 1:1, 1:m, m:1, or m:m, i.e. merging the observations:
one-to-one,
one (group) to many (individuals),
many (individuals on the left) to one (group on the right, such as households), or
many-to-many.
joinby combines datasets as a product within groups.