Repeat Tasks with Loops
Overview
Teaching: 30 min
Exercises: 10 minQuestions
How can I minimize bugs in my code?
Objectives
Automate repetitive tasks using
foreach
andforvalues
.
For loops
Sometimes you will need to do repetitive tasks in the process of data manipulation and analysis. For example, you might want to summarize gdp_per_capita
, for each year in your dataset. You may be tempted to write something like this:
. use "data/derived/gdp_per_capita.dta", clear
. summarize gdp_per_capita if year == 2010
. summarize gdp_per_capita if year == 2011
. summarize gdp_per_capita if year == 2012
. summarize gdp_per_capita if year == 2013
. summarize gdp_per_capita if year == 2014
. summarize gdp_per_capita if year == 2015
. summarize gdp_per_capita if year == 2016
. summarize gdp_per_capita if year == 2017
This is approach is not great in case you would like to look at all years in the dataset, or change time-period of interest. Loops are a way to avoid repeating the same code multiple times.
. forvalues i = 1/5 {
2. display `i'
3. }
1
2
3
4
5
You should always place the curly braces to open and close the loop.
i). the open brace must appear on the same line as the for
statement;
ii). nothing may follow the open brace except comments;
iii). the first command to be executed must appear on a new line;
iv). the close brace must appear on a line by itself.
In Stata
, the indentation is optional, but helps read your code better, especially with nested loops.
. forvalues i = 1/5
{ required
r(100);
We can use multiple commands inside the loop.
. forvalues i = 1/5 {
2. display `i'
3. display 6 - `i'
4. }
1
5
2
4
3
3
4
2
5
1
You can use the loop variable in any command, in any place.
.forvalues t = 2010/2017 {
2. summarize gdp_per_capita if year == `t'
3. }
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 239 17122 18892.45 660.211 125140.8
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 242 17372.16 19354.81 682.4322 129349.9
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 239 17645.64 19386.14 706.798 125302.1
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 239 17833.08 19529.57 593.056 135318.8
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 238 17885.82 19371.79 597.1352 130755.2
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 237 18017.96 18936.6 621.5698 119872.6
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 237 18227.89 18957.92 642.8735 118222.4
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 235 18567.77 19230.34 661.24 116932
The loop variable is not displayed, so we may not know where the loop is currently unless we explicitly ask Stata
to display it.
. forvalues t = 2010/2017 {
2. display `t'
3. summarize gdp_per_capita if year == `t'
4. }
2010
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 239 17122 18892.45 660.211 125140.8
2011
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 242 17372.16 19354.81 682.4322 129349.9
2012
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 239 17645.64 19386.14 706.798 125302.1
2013
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 239 17833.08 19529.57 593.056 135318.8
2014
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 238 17885.82 19371.79 597.1352 130755.2
2015
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 237 18017.96 18936.6 621.5698 119872.6
2016
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 237 18227.89 18957.92 642.8735 118222.4
2017
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_per_ca~a | 235 18567.77 19230.34 661.24 116932
Note that the loop variable is a macro, not a scalar. This helps us write code where the loop variable is part of a variable name or is on the left-hand side.
. forvalues i = 1/5 {
2. generate gdp_per_capita_`i' = gdp_per_capita^`i'
3. }
(9,381 missing values generated)
(9,381 missing values generated)
(9,381 missing values generated)
(9,381 missing values generated)
(9,381 missing values generated)
. summarize gdp_?
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
gdp_1 | 6,459 15209.41 17872.38 354.2845 135318.8
gdp_2 | 6,459 5.51e+08 1.42e+09 125517.5 1.83e+10
gdp_3 | 6,459 3.13e+13 1.38e+14 4.45e+07 2.48e+15
gdp_4 | 6,459 2.33e+18 1.49e+19 1.58e+10 3.35e+20
gdp_5 | 6,459 2.06e+23 1.71e+24 5.58e+12 4.54e+25
Challenge
Write a loop that displays the first five square numbers.
Solution
Challenge (optional)
Write a do-file named “append-gdp-all.do” that appends all data which name matches the pattern gdp`year’.dta. Generate a variable called year that records the gdp year as indicated in the name of the file. Label the variables accordingly. Save the final dataset as gdp1990-2017.dta”.
Solution
You can set the step size of the loop. In this case, we have a start, a step size in round brackets, and an end number.
. forvalues i = 1(1)5 {
2. display `i'
3. }
1
2
3
4
5
Challenge
What would be the output of
forvalues i = 0(2)4 { display `i', 5*`i' }
Solution
Create an indicator variable for each decade.
forvalues decade = 1960(10)2010 {
generate decade`decade' = (int(year / 10) * 10 == `decade')
}
The loop variable increases in step size 10. Note the use of a boolean formula. Whenever it evaluates to true, its value will be 1, otherwise 0.
You can also loop over a list of arbitrary strings, but note the different syntax.
. foreach fruit in apple banana carrot {
2. display "`fruit'"
3. }
apple
banana
carrot
Note that loop variable is given the name fruit
. We can choose any name we want for the looping variables. We might have named it unicorn
and the loop would still work, as long as we correctly invoke the variable inside the loop. The loop variable is still a macro and is evaluated as part of the command.
. foreach fruit in apple banana carrot {
2. display `fruit'
3. }
apple not found
r(111);
The error is that in the first run, fruit
evaluates to apple
, and Stata would like to run display apple
. There is no variable or scalar with the name apple
, so we receive an error.
Note that the error breaks the loop.
The separator in the list is the space. If one of your list elements has spaces, use double quotes.
. foreach fruit in apple banana carrot "dragon fruit" {
2. display "`fruit'"
3. }
apple
banana
carrot
dragon fruit
Challenge
What would be the output of
foreach element in apple banana carrot dragon fruit { display "`element'" }
Solution
Challenge
What would be the output of
foreach fruit in apple banana carrot { display "`fruit's with `fruit's" }
Solution
Repeat the creation of index variable for population.
local base_year 1991
egen gdp_per_capita_`base_year' = mean(cond(year == `base_year', gdp_per_capita, .)), by(countrycode)
generate gdp_per_capita_index = gdp_per_capita / gdp_per_capita_`base_year' * 100
egen population_`base_year' = mean(cond(year == `base_year', population, .)), by(countrycode)
generate population_index = population / gdp_per_capita_`base_year' * 100
Copying and pasting are prone to errors. Not all will be easy to spot and fix.
local base_year 1991
foreach var in gdp_per_capita population {
egen `var'_`base_year' = mean(cond(year == `base_year', `var', .)), by(countrycode)
generate `var'_index = `var' / `var'_`base_year' * 100
}
We can also loop over variables rather than arbitrary words.
foreach var of varlist gdp_per_capita population {
egen `var'_`base_year' = mean(cond(year == `base_year', `var', .)), by(countrycode)
generate `var'_index = `var' / `var'_`base_year' * 100
}
Use for loops to ensure consistency and to minimize the risk the errors, not to save typing. Note that `var’ appears on both sides. It is a macro that is evaluated before the command is run, so it can become part of the variable name.
foreach var of varlist *_index {
generate log_`var' = log(`var' / 100)
}
Consistent variable names are friends of loops
If you build a consistent system of variable names, it is much easier to automate repetitive tasks with for loops. For example, you might have saved the mean of each variable with the naming patter
*_mean
.foreach var in gdp_per_capita population { egen `var'_mean = mean(`var') } * some more code foreach var in gdp_per_capita population { generate `var'_demeaned = `var' - `var'_mean }
It is much easier to reuse these variable names than if you had called them
mean_gdp
andavg_population
.
You can reuse the loop variable later in different loops. Note the use of variable name wildcards.
foreach var of varlist population* {
forvalues i = 1/5 {
generate `var'_`i'= `var'^`i'
label variable `var'_`i' "`var', polynomial of order `i'"
}
}
Key Points
Do not copy-paste your own code.
Use for loops to automate anything that happens more than twice.