Video
Transcript
If you have RStudio open on your computer, and if you've opened the script that I sent you over e-mail, you should have something that looks like this. So reminder: anything in green text with that hashtag or pound symbol at the beginning, that is what we call a “comment” in RStudio; so RStudio does not actually do anything with that, it's there for you as a human to read. The computer pretty much ignores it. So there's lots of green text, you'll notice it looks like a longer file than last time, we're at about 100 lines, but most of it is green text because it's just going to be me speaking to you and giving you some teaching stuff. I've written it down in the script for you so if you wanted to come back to it at your own time, you've got all the information. RStudio will read the black text; so if we scroll down a little bit, our first black text is actually on line 23 so we've got a little ways to go.
So welcome everyone to Lesson 2; this is our introduction to statistics. Today, we will be talking a little bit about descriptive statistics and a little bit about inferential statistics, because most of the rest of the workshop series is about inferential statistics. The first thing we generally need to talk about when we talk about doing a statistic is your data type, because – this is really important so I've added some stars – the type of data you have generally determines the appropriate statistic or statistics. So the important thing we're going to review first is data types! What are the different data types that we have in R and that we might be using throughout the rest of this workshop series? So some common data types…we have on line 15, we talk about categorical data; this generally has distinct groups (for example, your favorite colour might be blue, or it might be purple). That's categorical. Continuous data often has a range of values (for example, if you talk about the temperature in degrees Celsius, this has a large range of values). That's what we normally call continuous data. It's important to ensure you have the right data type assigned in R for each of your variables, because if you have the wrong data type assigned, R might let you do things you shouldn't be allowed to do, or it might not let you do things that you should be allowed to do, based on the data type that you have. So your analysis might not work correctly if you don't have your data type set properly. The first thing we're going to do today is check our data type.
You might notice our global environment is empty; if you've just opened RStudio for the day, let's say you came to the workshop two weeks ago, you've closed you RStudio since then, your global environment is empty. We don't actually have any data to work with right now because it's a new R session. So what we have to do is first set the working directory, and second open the file. We're going to start with those steps first. So if you were here two weeks ago, you remember that we set our working directory by clicking the “Session” button at the top of the computer, clicking the “Set Working Directory” button, and then clicking the “Choose Directory” button. So we're going to do the point-and-click method so we can copy-and-paste the code for our working directory. So we click Session > Set Working Directory > Choose Directory. When we do that, it will open your File Explorer. You need to navigate through your File Explorer the way you normally would to find your file. So if you downloaded both the R script and the fake data file from me from e-mail this morning, and you've put them somewhere on your computer, you are telling the computer “where is that file”. So in my computer system, I'm in Documents > Classes and Workshops > Lindsay's workshops > micro workshops > RStudio, I don't see any files actually here in my File Explorer, but I know this is the correct location. So even if it doesn't show you anything, if you know you're in the right spot, you can click “Open”. What should happen is in your bottom right-hand corner, in your Files window, you should have a list of everything that's in that folder that you just selected. So I have micro workshop 1, micro workshop 2, micro workshop 3, our fake data files. You should also get some blue text in the bottom left, in your Console window, and what you need to do is copy this blue text. It should start with setwd() or set working directory. You want to copy this blue text and paste it on line 23.
This will tell the computer where you are working on out of your computer, which is different than my computer, we have different file structures. So if that worked, you should now have on line 23: setwd, and then left bracket to say it's a function, and then quotation marks, and the exact file path of where your files are stored [setwd(“YOURFILEPATH”)]. If this didn't work for you, feel free to toss something in the chat and I'll help you out.
So I said two steps: step one is setting the working directory, telling the computer where your file is located; step 2 is opening the file. So I sent you Fake_Data.csv over e-mail. So in the location in your bottom-right Files window, you should see Fake_Data.csv. And to open that, on line 25 – if you want to call it the same thing I called mine – I have called mine Fake_Data. And I have set that equal to the function read.csv, left bracket denotes that it's a function and then in quotations the exact name of the file with the file extension. So Fake_Data.csv, end quotation mark, end parentheses [read.csv(“Fake_Data.csv”)].
This function read.csv(“Fake_Data.csv”) will actually help us open the file in R so we can do work with it, so we can look at it and do things. So if we highlight line 25, we can click the “Run” button. And what this will do, is if we have called the name exactly correctly, and if we have set our working directory to the right location, in the global environment in the top-right, we should now have data that is called Fake_Data. It has 30 observations of 7 variables.
So we've done step one, we set our working directory. We've done step two, we opened our file. If you were here twp weeks ago, that took us about 30 minutes, two weeks ago. Today, it took us maybe five? You'll get faster at this as you keep going.
If this has worked and you have Fake_Data in your global environment, you can double-click this, and it will open kind of like it's an Excel file that you can look at. You should have a Gender column, Fake_Data1, Fake_Data2, Fake_Data3, Fake_Data4, Colour, and Group. If we look at this, we can see Fake_Data1 – 4; it has a range of values, it has decimals; this should be what we call continuous data. If we look at Gender, Colour, and Group, these range from 0 – 1 to 1 – 4. It's got buckets, it's got different categories; these are what we call categorical variables. They LOOK like numbers right now, but the numbers actually mean something else. And we'll get to that in a minute. If you're having problems and you need some help, feel free to throw it in the chat, but if you've set your working directory and opened your file and you were able to double-click on it, you're already caught up. You're exactly where you need to be.
All right. So far, we've set the working directory, we've opened the file. The next thing we need to do is look at the data we need to review the data. We already looked at it visually, but we're going to check using R to see what data types it thinks each of those groups are, each of those variables. If you have never installed this package before, you will need to run line 28, and it might take a minute or two, so feel free to highlight it and click the run button. Now line 28 says: install.packages(“tidyverse”).
Tidyverse is a very common package. It's got multiple different sub-packages within it, but it's a very common package for manipulating data and looking at data and figuring out what's going on with your data. So we need to first install the package from the Internet and put it on our computer, and then once that is done, if you get the blue pac-man (>) and you no longer have a stop sign, once that is done, you're ready to run line 29. If you already have the package installed (which I do so I didn't run line 28)…if you already have the package installed, you can highlight and run line 29. Line 29 is: library(tidyverse) and what this does is it loads our library. It's loading tidyverse right now in R, so that we can use the functions within tidyverse right now.
You'll get this output showing that we've installed a whole bunch of different stuff, and then we'll get our blue pac-man (>). And there's no stop sign, meaning it has completed running that library line.
We've taken tidyverse from the Internet and put it on our computer. We've installed tidyverse and we've run the library tidyverse; the library lets us use that package right now. So what we're going to do is use tidyverse to take a glimpse at our data. We're going to look at our data and see what's going on. We can use the glimpse() function, glimpse, left bracket to denote it's a function, and then you can give it the name of your data set, which is in your global environment. And you have to spell it exactly the same way. So if you've used Fake_Data, name it the exact same thing. And we can highlight this line and we can run it. And it gives us some information down in our Console. It says we have run this line, glimpse(Fake_Data). It has 30 rows and 7 columns, which matches the information that we have in the global environment, which is good. And it's got dollar sign “$” to denote which column are you looking at. It's indexing. What column are we in right now? The Gender column is listed as integer <int>, and it's got some 0s and some 1s. Dollar sign “$” Fake_Data1 is <dbl>, which means double, which is RStudio's way of saying it's continuous data, it's got some decimals. And we've got a bunch of different numbers here with decimals. Fake_Data1, 2, 3, 4 are all listed as this double <dbl> format. Dollar sign “$” Colour is listed as integer <int> or number, and dollar sign “$” Group is listed as integer <int> or number; and we've got 1 – 4 and we've got 1 – 3.
I've left you some code for if your file has opened slightly weirdly, which sometimes happens. So if I double-click on Fake_Data, sometimes in my computer the Gender column, or other columns, will download incorrectly, and it will have this weird format where it's got like an “i” with a double omicron dot dot and then the variable name. So I've left you a little bit of code here for if your stuff has downloaded incorrectly, especially if it's the Gender column because we're about to work with the Gender column. If yours looks like mine right now, where it's got Gender, Colour, Group, everything is fine and you don't have to worry about the next few lines. But if it downloaded weird, I left you some code just in case.
So we want to make sure we have renamed the Gender variable. Here, if we run glimpse(), it will look exactly the same because we didn't actually have to rename it because on my computer it worked this time. So that's okay as long as it says Gender, you're fine.
So we see Fake_Data1,2,3,4 double <dbl>, that's fine; it's got decimals, we can leave that. But Gender, Colour, and Group being <int> or integer, is not probably what we want; we want to set that as categorical data. There's a couple different versions we can do for that. So categorical data, I've left us an example for treating it as character <chr>, so treat it like it's text. So double <dbl> is continuous, so these ones are fine. Gender, Colour, and Group are categorical, let's change those. I've left you some code on lines 45, 46, and 47. Let me explain what they do. We're using the function as.character(). We're going to set it to character <chr> or text, so as.character, left bracket to say it's a function; we're going to set something to a character <chr>. What are we setting? We have to tell it which data set are we working out of, in case we had multiple open, let's say we had four open: Which one? So we say we want to change something in Fake_Data. And we've got these dollar signs “$”. Dollar sign is how we're going to tell it which column. So we say Fake_Data$ … , and then the exact name of the column you would like to change. So “Gender” [as.character(Fake_Data$Gender)]. So we're going to treat the Gender column as a character <chr> value. And we have to set that equal to something; if we run this by itself, it might not do exactly what we expect. So we have to say that is equal to Fake_Data$Gender. We're going to overwrite the values in that column to be treated as character <chr>. And if you didn't understand that, that's okay, all you need to know is we can highlight all three of these lines at the same time, if we want. We're going to change Gender, Colour, and Group; we're going to change the data types so that RStudio is treating them properly. So we can highlight those and click “Run”. We should get some blue text down in our Console to show that it worked. If we had red text, it would be an error code saying something went wrong. Maybe you forgot to install tidyverse. Maybe you called the column name incorrectly. Maybe you forgot to give it which data frame you're working out of. But if it's worked properly, you'll get some blue text in the Console. You want to make sure that it did actually work. It looks like it worked, but we're going to rerun glimpse(Fake_Data) to double-check that all of the columns in this data frame that we're using have been set correctly. So if we highlight line 49, we can click “Run”. And we see glimpse(Fake_Data): we have 30 rows and 7 columns; that didn't change. But what did change is Gender, Colour, and Group are now listed as <chr>, or character data type. So we have changed the data type from integer <int> or number to character <chr> which means text, so it's treating them as categorical data now, which is correct. We also still have Fake_Data1,2,3,4 listed as double <dbl>, which is correct because those are continuous variables. So that's our little intro based on data types.
Time commitment
5 - 10 minutes
Description
Reviewing data types in RStudio. Also includes setting a working directory and opening a file.
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.