Reading data in Stata
Whenever data is inserted in Stata, it's copied into the RAM memory of the computer. Generally, some of the changes are not on the permanent side and are not saved. So, these changes are lost when you reopen the Stata session. You can enter the data into Stata in various ways. One of the most effective way is as follows:
Use E:\Stata1\t1 less India pwt 80-2010.dta, clear
The option at the end of the code, clear
, makes Stata read the dataset again before you open another data file.
Another option with limited variables in the dataset is as follows:
use country year using "t1 less India pwt 80-2010 . dta" , clear
Insheet
In order to read data in Stata, it has to be converted into a format other than Excel. Also, save the data in one of the following formats:
- Excel
- CSV (comma separated values)
- Text (where the delimiter is a tab or comma)
You need to take into consideration certain rules and regulations while working on Stata:
- Suppose that the first row in the Excel file contains the name of the variables or headers, that is, the sheet contains variable names (series/code/names). Then, the second row must have data. The title of the first row must be removed before saving the file.
- In Stata, every single word is read; therefore, any additional lines below or to the right of the data, for example, footnotes or endnotes, should be deleted before saving it. If essential, delete the entire bottom row or the column on the right-hand side.
- You should not put numbers in the beginning of the variable name. In Stata, a problem might occur when the file is arranged with years (1980, 1985) in the top row. In such cases, placing an underscore before numbers will be helpful, and this can be done by selecting the row, using the spreadsheet package, and finding replace tools; for example, 1980 becomes
_1980
, and so on. - The most important thing to note is the deletion of commas from the data because Stata won't be able to understand the starting point and finishing point of columns and rows. You can do this by leveraging the first find then replace option.
- Notations such as double dots (
..
) or hyphens (-
) might trouble Stata and will create confusion because Stata can read a single dot (.
) as double dots or hyphens as text.
After saving the data in the CSV format, it can be read in Stata, as shown in the following code snippet:
insheet using "E:\Stata1|t1 less India pwt 80-2010. txt", clear
If any changes are made to the data by applying the cd
command, then it can be read as follows:
insheet using "t1 less India pwt 80-2010. txt", clear
Many ways are available for the insheet
command. Options are defined as additional qualities of standard commands, which are generally added once the command ends, should have commas in between, and so on. The following are some of the options used in Stata:
- The
clear
option: This can be used to insert a new file,insheet
, regardless of the selected data:insheet using "E:\ Stata1\t1 less India pwt 80-2010 . txt" , clear
- The option name: This provides insights of data (usually from the first row), which helps Stata remember the file automatically. However, in certain cases, if this option does not work, then Stata uses variable names; an example is as follows:
insheet using "E:\Stata1 classes\t1 less India pwt 80-2010 . txt" , names clear
- The delimiter option: This gives instructions to Stata regarding data insertion to
insheet
. Stata has the ability to recognize tab as well as comma-delimited data, yet often other delimiters such as;
are used in datasets. Here is an example:insheet using "E:\Ind-samp.txt", delimiter (";")
Infix
Along with insheet
, you can use the infix
command, as shown later.
Most times, CSV or tab-delimited datasets are utilized, and the ASCII format is still used to save older data. Let's take the example of a survey taken by the government. This example represents two lines from 2010:
10862226023331 06 022 3 02220155500666600777000003331 10001222228332 06 022 3 02555553006666000000000044441
A codebook or data dictionary usually comes in the PDF or text file format. It explains the data that shows us that the first two numbers, the row ID, and the other two numericals are survey records (2010 from the previously mentioned dataset), and the fifth number is the quarter (the first quarter in this case) of the interview, among other things. infix
is required to read such types of data and provides information to Stata from the codebook. The following is an example:
infix rowtype 1-2 yr 3-4 quart 5 […] using "E:\ Stata1\Survey2010.dat", clear
In order to save many files, the dictionary
file is used; it will save the codebook information and mark it as a separate file. The file can be seen as follows:
infix dictionary using Survey2010.dat { dta rowtype 1-2 yr 3-4 quart5 […] }
The infix
command is used after saving the data as Survey2010.dct
. As a relative path is used in the dictionary file (Survey2010
), it is believed that raw data will be inside the same file set that is either a dictionary or a catalogue file. This being the case, then referring data is not required. The file will look like this:
infix using "H:\ECStata\NHIS1986.dct", clear
Defining and constituting a dictionary file in a proper way is a tedious job. However, NHIS has a dictionary that can be read through the SAS program; this can be converted into Stata using the Stat/Transfer program.
The Stat/Transfer program
This program is used to convert various dataset formats into well-defined industry formats, such as SAS, R, SPSS, Excel, and so on. Before converting, the data should be examined thoroughly. As it is an extremely user-friendly tool, it can be used to change the data between various packages as well as formats. This is shown as follows:
Manual typing or copy and paste
Typing or copying and pasting is the same as in other programs, but here, it can be done through the Stata editor. Just select the required data columns in Excel and paste them in the Stata editor. However, this has some drawbacks; many times, data inaccuracy or missing values don't have any fixed procedure, and in certain cases, language problems may arise. For example, in selected countries, a comma is used instead of a decimal point.
Typing is an extremely tough job, especially when electronic data is unavailable because in that case, we have to type the data. This job becomes easy in Stata through the edit
command as it will take you to a spreadsheet-like feature where new data can be entered and old data can be edited.