Time commitment
Less than 2 minutes
Description
Using RStudio to conduct descriptive statistics (descriptives for continuous variables).
Video
Transcript
The next thing we're going to talk about is descriptive statistics. So what is a descriptive statistic? You use a descriptive statistic when you wish to describe the data, or generate some kind of summary. Generally, when we're calculating a descriptive statistic, we're interested in the sample, not the wider population. So if you've ever taken a stats class, we generally care about our population, and we pull samples to answer questions about populations. But when we're doing descriptive statistics, we normally actually care about the sample itself. We care about things like the mean, the median, the mode, the standard deviation; those are our descriptive statistics. So there are many ways to calculate descriptive statistics, and I've left a few examples for us to run through together. We can start with a summary; we can literally use the function summary(). So summary(Fake_Data) to say: give me a summary of all of the columns in my fake data set. If I click Run, we get summary of Fake_Data. The Gender column has a length of 30, it has 30 observations. It's character class, which means it's text; we're going to treat it like it's text, different buckets, different groups of information. It's categorical, so we don't have things like mean, median, and mode because those wouldn't make sense. How do you take the mean of someone's favorite colour? For example.
But what we see is for Fake_Data1,2,3,4, our continuous variables, summary is great. Summary gives us the minimum value, it gives us the first quartile, the median value, the mean value, the third quartile, and the maximum value. And it gives us that for each of our different variables in our data set. So we get a lot of useful descriptive statistics for our continuous variables using the summary() function. If you wanted something else, so let's say you needed to write a paper and you needed the mean and the standard deviation. Well, summary didn't give us the standard deviation. There are many functions that you can use that give you anything that you want. So we can use a different function: sapply(). So if we use sapply left bracket to say it's a function, we give it the name of the data frame, so Fake_Data, and we tell it what we would like: we would like standard deviation [sd]. There's another argument here that you probably haven't seen before, it's na.rm = TRUE (in all capitals). “na.rm” is an argument for this function that says: am I removing missing values (NA values)? “TRUE” means yes; if we have missing values, remove them. We actually don't have any missing values in our data set, but if you were working with a different data set, it's important to know what you want to do with those missing values. So we can run line 68 by highlighting it and clicking Run. And what it will do, is it will give you information for each of your columns. You might remember the example I just said: Does it make sense to have a mean of your favorite colour? Not really. It did provide us the standard deviation for our colour value because it's treating it using this function like it's numeric. So you want to be careful, because sometimes different functions will give you things that don't necessarily make sense. So for this function, we only want to be looking at standard deviation for our continuous variables. Fake_Data1, Fake_Data2, Fake_Data3, and Fake_Data4, and we get our standard deviation, so for fake data one our SD is 1.19.
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- Ask Chat is a collaborative service
- Ask Us Online Chat hours
- Contact Us