OpenRefine Tutorial Series: Video 2

Transcript

OpenRefine video 2. In this video, we’re going to start working hands-on in OpenRefine. We’ll create our first project, import the dataset, and walk through some basic data-cleaning and discovery tools. Before we begin, let’s make sure you have the dataset ready to go.
To get started, you’ll need a copy of the dataset we’ll be working with. You can find a link to download the file in the video’s description on YouTube. Look for Ontario_immigration_records_1865-1883_dataset_0, and save it somewhere easy to access.
Now let’s open OpenRefine. If it’s pinned to your taskbar, you can launch it from there. If you’re on Windows and don’t see it pinned, click the Search icon in the bottom-left corner of your screen and type “OpenRefine.”
When the program starts, it may take a moment to load. On Windows, you’ll see a small black command window appear - this is normal, and it needs to stay open in the background while you’re using OpenRefine, so just leave it running.
After a few seconds, your default web browser will open automatically, and OpenRefine will be ready to use on both Windows and Mac.
Let’s take a moment to look at the interface. Along the far-left edge of the screen, there’s a vertical menu arranged from top to bottom. The first three options are the ones you’ll use most often, because they deal with creating and accessing projects.
Create Project is what you select when starting a new project — and that’s what we’ll be doing today.
Open Project lets you open any projects that already exist on the computer you’re working on.
Import Project is for bringing in project files created on another machine.
Below these, you’ll find options for language settings and for adding extensions. If you’re interested in extensions, you can click Discover Extensions in the top-right corner of the screen to explore what’s available.
Now let’s return to Create Project, since that’s what we’ll use to start our work. OpenRefine can handle many file formats, including TSV, CSV, Excel, JSON, and XML. You can import data from your computer, from a URL, from the clipboard, or even from a database.
For this video, we’ll import the dataset from your computer. “This computer” is selected by default, so we don’t need to change anything - though you could choose to import data from a web address, a database, or other sources if needed.
Next, click the Choose Files button, located to the right of the list of data sources. A file browser window will appear, allowing you to navigate to the spreadsheet you downloaded: Ontario_immigration_records_1865-1883_dataset_0.csv.
Select the file and click Open at the bottom-right of the file explorer window. Once you’re back in OpenRefine, click the “Next” button, located just below the Choose Files area, to move to the import preview page.
On the import preview page, there are two main areas. The first is at the top of the screen, labelled “Configure Parsing Options”. This section includes a preview of the first 100 rows of your dataset, along with all the columns that OpenRefine has detected.
At the bottom of the screen, you’ll find the settings that control how OpenRefine reads and processes your data. Here you can:
Choose which row(s) to use as column headers,
Skip a number of rows at the top, or...
Limit the dataset by discarding certain rows and setting a maximum number of rows to load.
OpenRefine automatically tries to determine the best way to parse your data based on the file type. For example, .xml files are treated as XML, while files with unrecognized extensions - or data pasted from the clipboard - are treated as tab-separated or comma-separated. OpenRefine checks for tab characters first; if it finds any, it assumes that the data is tab-delimited.
If OpenRefine isn’t fully sure about the file format, it will show a list of possible data options in the “Parse Data As” pane, located in the bottom-left corner of the workspace. From here, you can choose a different format, set a custom separator, or simply continue and split columns later during the data-cleaning steps.
For spreadsheets with multiple worksheets, OpenRefine lists each sheet by name along with its row count. This list appears in the lower section of the workspace, under the heading Worksheets to Import and to the right of the parsing options. Only one worksheet can be previewed at a time.
Keep in mind that OpenRefine does not preserve formatting from the original file, such as cell shading, font styles, or text color. Hyperlinked text will be imported as plain text, but the links themselves will remain clickable within the OpenRefine interface.
If certain characters don’t display correctly in the preview, you can manually choose an encoding, such as UTF-8, UTF-16, or ASCII. Later, after the project is created, you can adjust the encoding for individual columns using the reinterpret() function.
At this stage, let’s name our project. In the top-right corner of the window, click the text field labelled 'Project Name' and replace the current text with: ON_immigration_1865_1883.
We’ll also check the option labelled Trim leading & trailing whitespace at the bottom left of the screen, which removes extra spaces that sometimes appear when data is transcribed from analogue records.
Once the preview looks correct, click the Create Project button in the top-right corner of the screen to continue.
Now that our project is open, let’s take a quick look at the interface. The window is divided into two main sections. On the right, the main area displays a portion of the dataset. Our dataset contains 29,064 rows in total, but the screen shows only the first 10 rows.
Although OpenRefine may look similar to Excel, it works differently: it shows only a subset of records at a time, because you work with data in batches rather than row by row.
On the left side of the workspace, there are two tabs. The area below them is empty for now, but as we work through tasks, items will appear here.
The left tab, labelled Facet / Filter, is where facets, filters, and other tools will appear to help you explore and clean your data.
The right tab, labelled Undo / Redo, shows a history of changes made to the dataset. All changes are reversible, and you can click any entry in this tab to return the dataset to that point in time.
At the top-right of the interface, there are three buttons. Open lets you launch another instance of OpenRefine. Export is used to save your cleaned dataset or project files to share with others. And Help opens the OpenRefine manual for reference.
Below these buttons, there is a drop-down menu for an extension called Wikibase, used by organizations working with Linked Open Data. You can learn more about it in the OpenRefine manual if you’re interested.
Just above the dataset area, on the left side, there are two options labelled Rows and Records. These control how OpenRefine displays your data. In Rows mode, each row is treated as a single record. In Records mode, multiple rows can be linked together if they belong to the same record. For example, in a dataset of journal articles, each article would be a record, and multiple authors could appear on separate rows linked to that same article.
For this workshop, we’ll be working in Rows mode.
Before making any changes, there are several actions you can take to get a better understanding of your data.
Let’s start with Sorting. Sorting helps you see the range of values in a column and spot missing entries.
For example, let’s look at the range in size of Party. Click the arrow within the “Total Number in Party” column header to open a dropdown menu. From that menu, select Sort.
A new window will appear. Choose the option Sort by number and select smallest. Click OK to apply the sort and close the window. Now we can see that the smallest number in a party is 0.
Looking at the metadata, we can see that zeros were entered during transcription whenever a blank appeared in the original records.
Next, let’s find the largest party. You can repeat the previous steps, or click the Sort button at the top-middle of the screen and choose by total number in Party and select reverse. The largest party in this dataset is 151.
To return the column to its original order, click the Sort button next to the column title again and select Remove sort option from the menu that appears.
The next feature we’ll explore is Faceting. Faceting summarizes the contents of a column, helping you see categories, ranges, and patterns. There are several types of facets you can use:
Text facets list all text values along with their counts.
Numeric or Timeline facets display graphs with sliders.
Custom facets include options like word facets, duplicates, text length, or blank cells.
And scatterplot facets show relationships visually between values.
Let’s explore the Nationality column to better understand where our immigrants were coming from. Click the arrow within the column header, hover over Facet in the menu that appears, and then select Text facet from the submenu.
On the left side of the workspace, a pane titled Nationality will open in the Facet / Filter tab. This provides a summary of all values in the column and their counts. We can identify inconsistencies in how nationalities were recorded. For example, there are three separate entries for “American” and three for “Belgian.”
To switch between views, select the count option at the top of the Nationality pane, just below the title. This lists the most common values in the column, with the top three being English, Irish, and Scotch. Keep in mind, this order may change once the data is cleaned so that each nationality is recorded consistently.
It’s a good idea to repeat this process for each text column to continue familiarizing yourself with the dataset. For this workshop, however, we’ll move on to the next feature: Text Filter.
Text filters allow you to quickly focus on specific subsets of data within a column. For example, let’s filter the Trade column for the word “farm” to see if there are any farm-related trade groups.
To do this, click the arrow within the “Trade” column header and select Text Filter from the menu that appears. A filter pane will appear in the Facet / Filter pane on the left side of the screen. Type farm into the field. Doing this shows that there are 246 records containing the word farm. In the first 10 records, the Trade value is listed as Farmer.
To check for other trades containing the root word “farm,” we can open a Text Facet for the Trade column. Click the arrow within the column header, hover over Facet, and then select Text facet from the submenu. The Trade facet pane will appear in the Facet / Filter tab, showing three variations of the word.
To remove the filter and return to all 29,064 records, click the X at the top left of the Trade / Filter pane.
It’s very important when working in OpenRefine to always be aware of how your data is being filtered or selected. Any actions you perform - whether cleaning, transforming, or exporting - will only affect the currently selected subset of data. Working with a filter active without realizing it can lead to incomplete changes or unexpected results, so double-check which rows are included before making edits.
In this video, we explored some of the key discovery tools in OpenRefine, including sorting, faceting, and filtering. These features help you understand your data more deeply before you begin cleaning or transforming it.
When you’re ready, move on to the next video, where we’ll begin working with OpenRefine’s data-cleaning tools.

Time commitment

5 - 10 minutes

Downloads

Ontario_immigration_records_1865-1883_dataset_0_0 [CSV]

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The library is committed to ensuring that members of our user community with disabilities have equal access to our services and resources and that their dignity and independence is always respected. If you encounter a barrier and/or need an alternate format, please fill out our Library Print and Multimedia Alternate-Format Request Form. Contact us if you’d like to provide feedback: lib.a11y@uoguelph.ca

chat loading...