Video
Transcript
Now that we have set our working directory, the next thing we have to do is actually open a file. You can create data frames within RStudio itself, like I could make a tiny data frame like a two by two or something, but generally what we do, is we already have a computer code somewhere, or a script somewhere, or an Excel file, or a text file (there's lots of different types of files you could be using), but we normally already have a file on our computer, and we want to open that in RStudio. So if you want to open a file that's already on your computer, step one: set the working directory. Step two: we have to use code to tell R to open the file that we want.
How you open the file is determined by what type of file it is. So certain types of files can be opened in what we call “base R”, which means we don't have to do anything else, we already have everything we need in the software. So for example, if you were opening .csv file (a comma delimited file); a .csv file can use base R. So if I go to my script. We've got line 12. Oh, I should have covered this as well.; we've got a hashtag or a pound sign at the beginning with some green text. So on line 12, it says “# Open a file”. The green text means it's for you as a human to read; the computer skips it.
So anytime – there's lots of green text in this one – anytime there's this green text, the computer's not doing anything. The only thing it's doing are the lines that have this black text at the beginning. So setwd() is a function, the computer's reading this, and we know it's reading it because it's in black. Line 12 “# Open a file”, the computer skips that. It's a comment. Line 13 “# Some files use base R, no additional steps needed”. On line 14, we're going to use the function. You could copy it piece-by-piece on your own computer if you're working with a blank document, or you can use the document I've given you. But let me walk you through it. We've got the word “read”, a period “.”, the word “csv”, and then we've got our left bracket “(“. Anytime it's some text with round brackets, it's a function. So this is the “read.csv()” function to help us open a .csv file. We've already set our working directory, so what we have to do is, inside those round brackets and inside quotation marks, give it the exact name of the file we are trying to open: with the same uppercase and lowercase, the same spaces or underscores, and we also need to make sure it has the file extension at the end (so it has to say “.csv” because this is a .csv file). So here, we are working with our fake data set that I sent over e-mail. So it's called “Fake_Data.csv”.
We could run just this portion and if I run just this portion, just the function itself, what happens is: it opens my fake data set in the bottom left in our console.
We cannot work with this if it is just in the bottom left in the console, because if we look up in the top right, our global environment, there's nothing there. So it's opened in the console just for us to look at it, but we can't do anything with it. So not only do we have to run the function to open the file, we have to save it as something. So earlier we said “a = 2 + 2”, we can do the same sort of thing. We can give it a name. I've called mine – very exciting – “Fake_Data”. So I can say “Fake_Data = read.csv(“Fake_Data.csv”)”. If I run this whole line, all of line 14. I click our “Run” button: in our global environment, we now have a data set that says Fake_Data. And if we double-click on this from the global environment up top, it opens it in our source window so we can see it. Just like if we open the .csv file. Because if I go over here, it's this file here Fake_Data.csv, this is my Excel file which I have opened in RStudio. So we've got: a Gender column, Fake_Data1, Fake_Data2, Fake_Data3, Fake_Data4, Colour, and Group, and we've got 30 observations total.
Someone asks: “Do we need to upload this Excel file first?” So right now we are opening a .csv file, we haven't got to the Excel file yet. If you got my e-mail this morning, I had a .csv and an Excel file; they're the exact same file, just different file extensions to show you the difference in RStudio. If you download those from your e-mail, you can put them somewhere on your computer. They will need to be in the working directory you are working out of. If you try to open that file and you haven't set the working directory to the location where the file is, it won't work, and I'll show you that in a second. So make sure in your bottom right in your files window you can see I've got my two files: Fake_Data.csv and Fake_Data.xlsx. The .xlsx is Excel, and the .csv is comma delimited. Hopefully that answers your question.
So I mentioned when we open our file, it has to be spelled exactly correctly. If even one piece of that is spelled wrong, it won't work. So I have an example here on line 16 where I'm going to save our file as “Fake_Data2”. And we're going to use our read.csv() function, so we've got some text, we've got our round brackets. But I forgot the capitalization and I forgot our under score, so it's just called “fakedata.csv”. If I try to run this line, we got our first red text! Welcome to RStudio, this is an error code. We are getting red text in our console saying something went wrong. And you can read the code, generally it's pretty clear and it tells you what's wrong. Not always. You can always Google it if you don't know what's [not] working. But here it says:
“Error in file(file, “rt”) : cannot open the connection. In addition: Warning message: In file(file, “rt”) : cannot open file ‘fakedata.csv’: No such file or directory.”
So the important part here is that last piece: “No such file or directory”. So that is saying either this file does not exist as it is named (i.e., fakedata.csv, that file doesn't exist), so it's saying either your file doesn't exist or the directory is wrong.
So if you get this line of code later, you might get “No such file or directory”, and you're like “I got - I got it exactly right! Lindsay said it needed the uppercase, it needed the under score, it needed the .csv. Why is this not working?!” It might be you forgot to set the working directory. So if you're working in the wrong spot, or if you don't have the file name exactly right, you'll get this specific error code. And I think I put that in your word document as an example too, if you wanted to come back to it. So if we scroll, scroll, scroll…yeah, so if you get this “No such file or directory” error code, that means exactly what we just saw: either you've named the file wrong when you're trying to open it in RStudio, or you haven't set the working directory. So that's a very common error someone might get if they're working in RStudio for the first time. And I’ve left you the example of what that looks like.
So I said the kind of file determines how you open it: .csv is base R, all we had to do is use the read.csv() function. The read.csv() function helps us open the file.
If we were instead working in Excel, for example, Excel is a really common file type someone might be using. We cannot open an Excel file in base R which is just a fancy way of saying R by itself does not have the capability to open this file, same with RStudio, they can't open Excel files. You need to take something from the Internet that someone else has created to help you open this file. So we want to open an .xlsx file; it's a common Excel file type. This function (to open the Excel file) does not exist in base R. You could create it yourself; if you're a beginner, you probably don't want to. I don't even like creating my own functions, and I've been using R for a while.
Or! Someone smart has probably already created a function that can help you, so you can download packages from the Internet that someone else has created to help you do certain things you're trying to do. So we have a package that we can use that we can take from the Internet and put it on our computer, and then use on our computer to help us actually open that Excel file. For example, we can use the read_excel() function; super common. This is located in the readxl package.
What this means is if we go to RStudio: line 18 hashtag this is a comment, it's just for us. We would like to “# Open a file”. Line 19 hashtag this is a comment, it's just for us. “# Some files do not use base R. You will need to install and load packages.” Line 20 “# To open a .xlsx (Excel) file, install the readxl package:”. If this is your first time using R, you will need to run line 21. So you can highlight it like this and click our “Run” button. It's install.packages and then round brackets, so we know it's a function and inside our brackets and inside our quotation marks we have “readxl” [install.packages(“readxl”)]. This means we would like to take the readxl package from the Internet and put it on our computer. I have already run this on my computer so I don't need to run it again. It does take, depending how old or slow your computer is, a couple minutes to run, so feel free to run it and just wait. You're taking this from the Internet because someone else has created it and putting it on your computer so you can use it today. It will spew out a whole bunch of stuff in the console for you, so it's going to like, just keep going and keep going and keep going and keep going. Once you get your blue Pac-Man or your greater than sign in the bottom left hand corner here, you know you're done. The other way to know you're done is while it's thinking and thinking and thinking and thinking, you'll have a red stop sign up here; when the red stop sign goes away, it's finished. So for line 21, if you've never done this before, you have to do this one time per computer. I've already done this, so I don't need to do this. Once you've done it, if you wanted, you could put the hashtag in front so that you know you don't have to run it again. Because otherwise, if you come in and run this entire script top to bottom, it will try to re-run that for you. So that's up to you whether you want to comment it out so it doesn't read it the next time.
After you install the package, so you've taken this package that a smart person on the Internet has created for you, you've taken it from the Internet, you've put it on your computer so you can use it. The next thing is you have to tell the computer “I would like to use this package right now”. It doesn't load everything every single time, that would take a long time if you have a lot of packages like me. So what we have to do on line 23 is we have to load our library. So we can use the library() function and say library round brackets because it's a function readxl [library(readxl)]. You need to do this each session you want to use the package. So if I run this today and then close my computer and then tomorrow I open my computer and I want to open another Excel file, I have to run this line again: library(readxl). If I click “Run”, I get red text which normally is an error, but it says: “Warning message: package ‘readxl’ was built under R version 4.3.3”. That's okay, I'm not concerned about that text; it's a warning saying this might be a problem, but for me it's not.
If you try to load a library that you have not already installed…because step one is take it from the Internet, put it on your computer, step 2 is run the library line saying this is already on my computer, I would like to run it…line 24 is an example of a package you probably don't have on your computer, so if I try to run line 24, I get some red text. Another error code, very common, and it says:
“Error in library(pacman) : there is no package called ‘pacman’”.
So let's say today you came to my session, you tried to run line 23, and you had not run line 21. You might have gotten this same text saying we don't have a readxl package. This is just a warning saying you forgot step one; you forgot to install the package. You first install the package, then you load the library. So two steps because, of course, it's RStudio, it's got to be that little bit confusing. So if you have run library(readxl), we’re not actually using “pacman” today, that's just an example. But if you've run library(readxl), you might have got a warning or you might have just gotten some blue text saying it worked. You're good to go, you're ready to open your Excel file.
So if you're ready to open your file using the read_excel() function, it's pretty easy. It's read_excel(“”) and then in those quotation marks and inside the brackets, the exact name of the file (with capitals, if they exist; with spaces, if they exist; with underscores, if they exist), and then the file extension. So mine is “Fake_Data.xlsx”. And again, I could run just this right half of the function, just the function itself. If I run this, it'll think, and then it'll put that file in my console; it will open it so I can look at it. But if you were interested in actually doing something with that file, you will want to give it a name. So I've called mine “Fake_Data_excel”. And if I run this whole line saying: Fake_Data_excel = read_excel(“Fake_Data.xlsx”), I can run it and in my global environment, I get “Fake_Data_excel”, and if I double-click on it, I can look at my file. We've got: Gender, Fake_Data1, Fake_Data2, Fake_Data3, Fake_Data4, Colour, and Group, and it's got 30 rows.
So we've covered, how do you download the software if you've never done that before, we've covered how do you set a working directory? How do you actually tell the computer where your files are located on your own computer? And then we've covered two different file types to open: we've covered .csv files and we've covered Excel files.
Time commitment
5 - 10 minutes
Description
Using RStudio to complete step 1 of importing data: set the working directory.
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.