banner



How To Log Transform Data In Stata

Introduction

In this guide, nosotros introduce some commonly used methods for transforming and standardising variables for use in analysis. Readers are provided links to the case dataset and encouraged to replicate this instance. An additional practice example is suggested at the end of this guide. The case assumes y'all have already opened the information file in Stata.

Contents

  • Transforming Variables

  • An Case in Stata: Mental Health Using the GSS 2004–2016

    • 2.1 The Stata Process
    • ii.2 Exploring the Stata Output
  • Your Turn

ane Transforming Variables

It is sometimes the case that the variables we want to use in our statistical analyses do non fulfil the assumptions of the method we wish to use. For example, a t-test requires that the dependent variable follows an approximately normal distribution unless the sample size is large. If our dependent variable is not normally distributed, it can be helpful to transform it before subjecting to the test so that its distribution is closer to normal than it is in its raw course. An example of this is in measures of income, where typically the distribution is positively skewed. In other cases, nosotros might desire to transform a variable so that its distribution can meaningfully be compared with the distribution of another variable. This transformation procedure is also known equally standardisation. The idea hither is that we transform the variables onto a mutual scale or metric. We might exercise this for the independent variables in a regression analysis and so that the strength of effects of the contained variables on the dependent variable can be compared. An boosted example where transformations might exist used is where we want to reverse code a variable and then that low scores become high scores then on. There is an unlimited number of potential transformations that could be practical to data. In this guide, nosotros will focus but on some commonly encountered types.

2 An Example in Stata: Mental Health Using the GSS 2004–2016

This guide provides an introduction to transforming and standardising variables for analysis in survey research. The data come from the General Social Survey (GSS) 2004–2016.

This case uses 4 variables from the GSS 2004–2016:

  • The number of days of reported poor mental health across the last 30 days (MNTHLTH)
  • Begetter's highest school twelvemonth completed (PAEDUC)
  • Respondent age (AGE)
  • Respondent sexual practice (Female person)

ii.ane The Stata Procedure

Nosotros kickoff past looking at the variable recording the number of days of reported poor mental health beyond the last 30 days.

Outset, we create a histogram of the variable of interest. Nosotros do then in Stata by entering the post-obit command in the Command window:

histogram MNTHLTH

Press Enter to produce a histogram.

Alternatively, y'all tin create a histogram by selecting options from the Menu equally follows:

Graphics → Histogram

In the histogram dialog box that opens, you lot will see a textbox labelled "Variable" in the upper left-hand corner. Apply the drop-down menu to select MNTHLTH from the list of variables as shown in Effigy 1. To the correct of the "Variable" box, you will come across two buttons request you to specify whether data are discrete or continuous. Ensure that the "Data are continuous" option has been selected. In the lower right-hand corner under "Y axis," select "Frequency." Click Submit to perform the analysis.

Effigy 1: Creating a Histogram From the Graphics Carte in Stata.

Figure

To explore the skew of the distribution, we can compile summary statistics using the summarize control, followed by the variables of interest. Enter the post-obit control in the Stata Command window:

summarize MNTHLTH, detail

Press Enter to produce summary statistics detailing the number of observations, hateful, standard divergence, skewness, and other data for the variable.

Alternatively, you tin achieve the aforementioned results by selecting the following options from the menu:

Statistics → Summaries, tables and tests → Summary and descriptive statistics → Summary statistics

In the dialog box that opens, you lot will run across a text box labelled "Variables: (exit empty for all variables)" at the summit. Use the drib-down menu to select MNTHLTH from the list of variables equally shown in Effigy 2.

Beneath it, in the "Options" section, cheque "Display boosted statistics."

Effigy two: Producing Descriptive Statistics From the Statistics Menu in Stata.

Figure

Click OK to produce the summary statistics.

To address the right skew evident in the days of poor mental wellness variable, we log transform the variable using the following control:

generate logMNTLHLTH = log(MNTLHLTH)

Press Enter to produce a new variable logMNTLHLTH.

Alternatively, we tin can transform the variable using the carte options. Select the following:

Information → Create or change data → Create new variable

In the "generate - Create a new variable" dialog box that opens, enter the name of the variable nosotros are creating (logMNTLHLTH) in the "Variable name" text box at the tiptop right every bit shown in Effigy 3.

Figure iii: Creating a New Log-Transformed Variable From the Data Bill of fare in Stata.

Figure

Click on Create. This opens an expression builder. From the options in the "Category" box, expand "Functions." Click on "Mathematical" and a list of options volition appear in the box to the right. Select "log()" and double-click on it. "log(10)" volition now announced in the box above. Supersede x with the variable name. Figure 4 shows what this looks similar in Stata.

Effigy 4: Specifying Values for a New Log-Transformed Variable From the Data Menu in Stata.

Figure

Click OK to return to the previous dialog box where you should see details of the variable you are creating in the text box labelled "Specify a value or an expression."

Press Submit to create the new variable.

Produce a histogram of the new log-transformed variable logMNTHLTH and summary statistics, following the same procedures as earlier.

Side by side, we inspect the variable for highest year of father'southward education. Once more, create a histogram and summary statistics for this variable, post-obit the instructions given before.

This fourth dimension, to accost the left skew of the begetter'due south highest school year, we create a new variable that is the foursquare of the education variable. Return to the "generate - Create a new variable" dialog box which should nevertheless be open. Clear the contents of the text boxes.

Call the new squared variable squPAEDUC in the "Variable name:" box. In the "Specify a value or an expression" text box, write "PAEDUC*PAEDUC" which is the variable multiplied by itself as shown in Figure 5.

Figure 5: Creating a New Squared Variable From the Data Menu in Stata.

Figure

Press Submit to create the new variable.

Alternatively, yous can enter the following command in the Stata Command window:

generate squPAEDUC = PAEDUC*PAEDUC

Press Enter to produce the new variable.

Use the previous instructions to inspect the distribution of the new squared variable.

Finally, we look at how to standardise the historic period variable in two dissimilar ways.

Again, yous should first explore the distribution of the original age variable, past producing a histogram and detailed summary statistics. We will use this information to convert the age variable to a z-score with a hateful of 0 and a standard deviation of one. For a z-score, the mean (in this case, 41.89) is subtracted from each value and the result divided by the standard deviation (12.87).

To transform the variable into a z-score, we return again to the "generate - Create a new variable" dialog box and clear the values. Call the new variable zAge. In the "Specify a value or an expression" text box, write "(AGE-41.89)/12.87" equally shown in Effigy 6.

Figure 6: Creating a New z-Score Variable From the Data Card in Stata.

Figure

You tin can create the variable also by writing the following command in the Control window:

generate zAGE =(Historic period-41.89)/12.87

Inspect a histogram of the standardised historic period score using the now familiar procedure.

Nosotros at present engage in a dissimilar mode of standardising a variable, transforming age again then that the values run from 0 to 1. We practise so by subtracting the minimum age from each value of age and dividing that by the difference between the oldest and youngest ages (the range). Call the new variable age01 and enter the following item in the value box "(AGE-18)/(81–18)" as shown in Figure 7.

Effigy vii: Creating a Standardised Variable From the Data Carte du jour in Stata.

Figure

Alternatively, y'all can write the command straight into the Command window as follows:

generate age01=(Age-18)/(81–18)

Again, you should inspect a histogram of the new age score.

Having transformed and standardised our variables for assay, we movement on to estimating a multiple regression of number of days of poor mental health on age and sexual practice. This can exist done in Stata past entering the regress command in the Command window, followed by the dependent variable MNTLHLTH, then the contained variables Female person and Age. The command is as follows:

regress MNTLHLTH Female person Age

Press Enter to run the assay.

The model can also exist estimated by using the menu options every bit follows:

Statistics → Linear models and related → Linear regression

In the "backslide Linear Regression" dialog box that opens, two text boxes are provided for you lot to specify the dependent and contained variables to be included in the model. In the "Dependent variable" box, select MNTLHLTH from the drop-down card. In the "Independent variables" text box, select Female person and AGE.

Once yous are done, click OK to perform the assay.

Effigy viii shows what the dialog box looks like in Stata.

Figure viii: Selecting Multiple Regression From the Statistics Menu in Stata.

Figure

To audit the residuals, nosotros create a new variable of the residuals of the model, which nosotros are calling resid1. We can then plot a histogram of the residuals. Enter the following commands in the Stata Command window:

  • predict resid1, residuals
  • histogram resid1, frequency normal

To practice the same using the bill of fare options, select the post-obit from the carte du jour:

Statistics → Postestimation

From the "Postestimation Selector" options, expand "Predictions," highlight "Predictions and their SEs, leverage statistics, distance statisics, etc." and press Launch. In the "predict - Prediction after interpretation" dialog box that opens, write "resid1" in the "New variable proper name:" text box at the top left. Cheque "Residuals" below it, as shown in Figure ix, and press Submit to create the new variable.

Figure 9: Saving Model Residuals From the Postestimation Options in Stata.

Figure

Next, utilise the "histogram - Histograms for continuous and categorical variables" dialog box as before to create the histogram, simply this time select the "Density plots" tab along the top. Tick "Add together normal-density plot" as shown in Figure x and and so printing Submit to create the histogram with an added distribution bend.

Effigy 10: Calculation a Distribution Curve to a Histogram in Stata.

Figure

Noting that the residuals are skewed, nosotros run the regression again, using the log-transformed mental health variable and the standardised age variable. The command to produce a second model is:

regress logMNTLHLTH Female person age01

Alternatively, you can use the menu options as before replacing MNTLHLTH with logMNTLHLTH as the dependent variable and Historic period with age01 every bit the independent variable as shown in Figure 11.

Figure eleven: Selecting Multiple Regression Using Transformed Variables, From the Statistics Menu in Stata.

Figure

Relieve the residuals of the second model calling them resid2. (Note that Stata will save residuals from the nigh recent model you have run.) Plot a histogram of the residuals of the 2nd model, adding a curve equally earlier.

two.2 Exploring the Stata Output

Figures 12 and 13 show the distribution and descriptive statistics of a variable which asks how many days the respondent has experienced poor mental wellness out of the concluding 30.

Effigy 12: Histogram of Untransformed Mental Health Report Variable.

Figure

Effigy 13: Descriptive Statistics for Untransformed Mental Health Report Variable.

Figure

The histogram looks correct skewed, with a long tail of pocket-sized numbers of respondents who report many days of poor health, albeit with heaping at 10, 15, 20, and thirty days. The skewness is 1.7.

Figure fourteen shows the same variable subsequently it has been log transformed. The skewness is now .23 (every bit shown in Effigy 15), and the distribution, accordingly, looks a little closer to normal, although it is not past any means perfect.

Effigy fourteen: Histogram of Log-Transformed Mental Health Report Variable.

Figure

Figure 15: Descriptive Statistics for Log-Transformed Mental Health Report Variable.

Figure

Figure 16 shows the distribution of another variable that records the highest school year completed by the respondent's father, and Figure 17 shows descriptive statistics for the aforementioned variable.

Figure 16: Histogram of Highest Year of School Completed Variable.

Figure

Figure 17: Descriptives of Highest Year of Schoolhouse Completed Variable.

Figure

This time the histogram looks somewhat left skewed in that in that location are relatively fewer fathers who complete a low number of school years and more that cluster around the 12–20 range. The skewness is −.60.

Following a squared transformation of the variable, Figure eighteen shows the transformed distribution, which now has a skewness of .49 (as seen in Figure 19).

Effigy xviii: Histogram of Squared Transformed Father's Highest Year of Schoolhouse Completed Variable.

Figure

Effigy xix: Descriptives of Squared Transformed Begetter's Highest Year of School Completed Variable.

Figure

The distribution looks more symmetrical than the untransformed variable, although the gains in this example are probably marginal, as the negative skew in the original variable is non particularly extreme.

Figure 20 shows the distribution of historic period in years for respondents in the GSS sample.

Effigy 20: Histogram of Age Variable.

Figure

Effigy 21 shows the same variable, transformed into a z-score. The shape of the transformed distribution is, unlike the previous transformations shown, the same as in the original variable, but the range is expressed in standard deviation units and the mean is aught. And then, it is possible to see that, for example, someone anile fourscore is three standard deviations above the mean, which implies that they are in approximately the oldest 1% of the population, bold population historic period follows a roughly normal distribution.

Figure 21: Histogram of Age Variable equally a z-Score.

Figure

Figure 22 shows the historic period variable transformed so that the maximum value is 1 and the minimum is 0. Once more, the shape of the distribution is unchanged, just the mean and range are different.

Figure 22: Histogram of Historic period Variable on a Score of 0–1.

Figure

Effigy 23 shows parameter estimates for the first, untransformed regression model.

Figure 23: Regression Output – Untransformed Variables.

Figure

Older individuals are less probable to report mental health problems, every bit are women. For each year older a respondent is, they written report poor mental health on .03 of a solar day less on boilerplate. Women report poor health on .88 fewer days than men. Yet, neither of the predictors is statistically significant at the .05 level. Figure 24 shows the standardised residuals resulting from plumbing fixtures the model. It tin can be seen that they are non commonly distributed, with significant correct skew and therefore do not encounter the assumptions necessary for ordinary to the lowest degree squares (OLS) regression.

Figure 24: Standardised Residuals From Model 1.

Figure

In the second model, we replaced the dependent variable with the log-transformed version and the historic period variable with the 0–i transformed version, shown in Figure 25. The coefficient for age is now statistically significant, while that for female remains insignificant. The interpretation of coefficients now is in terms of proportional alter in days of poor wellness, rather than absolute level, for a 1-unit alter in the contained variables. A one-unit change in age now represents going from minimum historic period in the sample (eighteen) to maximum (81). Therefore, the interpretation of the age coefficient −.55 is that an 81-year-sometime would exist expected to report poor health on 55% fewer days in a calendar month than an 18-twelvemonth-onetime. A 0–ane modify in the variable denoting female suggests that women report vii% fewer days in poor mental health compared to men, but this is not statistically significant.

Effigy 25: Regression Output – Transformed Variables.

Figure

The residuals from this model, shown in Figure 26, are much closer to a standard normal distribution than in the previous model and are therefore more closely approximating the necessary assumptions for OLS regression.

Effigy 26: Standardised Residuals From Model 2.

Figure

Your Turn

Y'all tin download this sample dataset along with a guide showing how to carry out the procedures using statistical software. The sample dataset also includes two additional variables, WORDSUM and EMAILHR. They mensurate vocabulary ability and number of hours per week spent doing emails, respectively. See whether you lot tin can reproduce the transformations in this guide, matching the distributions to the advisable transformations and and then carry out the same OLS regression, substituting WORDSUM for AGE and EMAILHR for MNTLHLTH.

How To Log Transform Data In Stata,

Source: https://methods.sagepub.com/dataset/howtoguide/rescaling-transforming-in-gss-2016-stata

Posted by: wisegion1993.blogspot.com

0 Response to "How To Log Transform Data In Stata"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel