You're reading from Data Analysis with STATA Explore the big data field and learn how to perform data analytics and predictive modelling in STATA

Product type Paperback

Published in Oct 2015

Publisher Packt

ISBN-13 9781782173175

Length 176 pages

Edition 1st Edition

Concepts

Data Analysis

Table of Contents (11) Chapters

Preface

1. Introduction to Stata and Data Analytics

2. Stata Programming and Data Management FREE CHAPTER

3. Data Visualization

4. Important Statistical Tests in Stata

5. Linear Regression in Stata

6. Logistic Regression in Stata

7. Survey Analysis in Stata

8. Time Series Analysis in Stata

9. Survival Analysis in Stata

Index

Reading data in Stata

Whenever data is inserted in Stata, it's copied into the RAM memory of the computer. Generally, some of the changes are not on the permanent side and are not saved. So, these changes are lost when you reopen the Stata session. You can enter the data into Stata in various ways. One of the most effective way is as follows:

Use E:\Stata1\t1  less  India pwt  80-2010.dta,  clear

The option at the end of the code, clear, makes Stata read the dataset again before you open another data file.

Another option with limited variables in the dataset is as follows:

use  country  year  using  "t1  less India  pwt  80-2010 . dta" ,  clear

Insheet

In order to read data in Stata, it has to be converted into a format other than Excel. Also, save the data in one of the following formats:

Excel
CSV (comma separated values)
Text (where the delimiter is a tab or comma)

You need to take into consideration certain rules and regulations while working on Stata:

Suppose that the first row in the Excel file contains the name of the variables or headers, that is, the sheet contains variable names (series/code/names). Then, the second row must have data. The title of the first row must be removed before saving the file.
In Stata, every single word is read; therefore, any additional lines below or to the right of the data, for example, footnotes or endnotes, should be deleted before saving it. If essential, delete the entire bottom row or the column on the right-hand side.
You should not put numbers in the beginning of the variable name. In Stata, a problem might occur when the file is arranged with years (1980, 1985) in the top row. In such cases, placing an underscore before numbers will be helpful, and this can be done by selecting the row, using the spreadsheet package, and finding replace tools; for example, 1980 becomes _1980, and so on.
The most important thing to note is the deletion of commas from the data because Stata won't be able to understand the starting point and finishing point of columns and rows. You can do this by leveraging the first find then replace option.
Notations such as double dots (..) or hyphens (-) might trouble Stata and will create confusion because Stata can read a single dot (.) as double dots or hyphens as text.

After saving the data in the CSV format, it can be read in Stata, as shown in the following code snippet:

insheet using "E:\Stata1|t1 less India pwt 80-2010.  txt",  clear

If any changes are made to the data by applying the cd command, then it can be read as follows:

insheet using "t1 less India pwt 80-2010.  txt",  clear

Many ways are available for the insheet command. Options are defined as additional qualities of standard commands, which are generally added once the command ends, should have commas in between, and so on. The following are some of the options used in Stata:

The clear option: This can be used to insert a new file, insheet, regardless of the selected data: insheet using "E:\ Stata1\t1 less India pwt 80-2010 . txt" , clear
The option name: This provides insights of data (usually from the first row), which helps Stata remember the file automatically. However, in certain cases, if this option does not work, then Stata uses variable names; an example is as follows:
```
insheet using "E:\Stata1 classes\t1 less India pwt 80-2010 . txt" , names  clear
```
The delimiter option: This gives instructions to Stata regarding data insertion to insheet. Stata has the ability to recognize tab as well as comma-delimited data, yet often other delimiters such as ; are used in datasets. Here is an example:
```
insheet using "E:\Ind-samp.txt", delimiter (";")
```

Infix

Along with insheet, you can use the infix command, as shown later.

Most times, CSV or tab-delimited datasets are utilized, and the ASCII format is still used to save older data. Let's take the example of a survey taken by the government. This example represents two lines from 2010:

      10862226023331    06 022  3  02220155500666600777000003331
      10001222228332    06 022  3  02555553006666000000000044441

A codebook or data dictionary usually comes in the PDF or text file format. It explains the data that shows us that the first two numbers, the row ID, and the other two numericals are survey records (2010 from the previously mentioned dataset), and the fifth number is the quarter (the first quarter in this case) of the interview, among other things. infix is required to read such types of data and provides information to Stata from the codebook. The following is an example:

infix rowtype 1-2 yr 3-4 quart 5 […] using 
"E:\ Stata1\Survey2010.dat", clear

In order to save many files, the dictionary file is used; it will save the codebook information and mark it as a separate file. The file can be seen as follows:

infix dictionary using Survey2010.dat 
{
  dta
  rowtype  1-2 
  yr  3-4 quart5 […]
}

The infix command is used after saving the data as Survey2010.dct. As a relative path is used in the dictionary file (Survey2010), it is believed that raw data will be inside the same file set that is either a dictionary or a catalogue file. This being the case, then referring data is not required. The file will look like this:

infix using "H:\ECStata\NHIS1986.dct", clear

Defining and constituting a dictionary file in a proper way is a tedious job. However, NHIS has a dictionary that can be read through the SAS program; this can be converted into Stata using the Stat/Transfer program.

The Stat/Transfer program

This program is used to convert various dataset formats into well-defined industry formats, such as SAS, R, SPSS, Excel, and so on. Before converting, the data should be examined thoroughly. As it is an extremely user-friendly tool, it can be used to change the data between various packages as well as formats. This is shown as follows:

Manual typing or copy and paste

Typing or copying and pasting is the same as in other programs, but here, it can be done through the Stata editor. Just select the required data columns in Excel and paste them in the Stata editor. However, this has some drawbacks; many times, data inaccuracy or missing values don't have any fixed procedure, and in certain cases, language problems may arise. For example, in selected countries, a comma is used instead of a decimal point.

Typing is an extremely tough job, especially when electronic data is unavailable because in that case, we have to type the data. This job becomes easy in Stata through the edit command as it will take you to a spreadsheet-like feature where new data can be entered and old data can be edited.

The rest of the chapter is locked