Time commitment
2 - 5 minutes
Video
Transcript
OpenRefine video 1. OpenRefine is a powerful open-source tool for cleaning, reshaping, and editing large batches of messy or unstructured data. It’s as easy to use as a spreadsheet, but as powerful as a database.
It runs right on your computer - you simply download and install it. Even though it opens in your web browser, everything stays local. Your data never leaves your machine or gets uploaded to the internet.
In this series, we’ll explore how to use OpenRefine to work with messy datasets - from discovering and structuring data, to cleaning, enriching, validating, and finally publishing it.
In this first video, we’ll be introduced to the main data cleaning and preparation tasks, and the historical dataset we’ll be using throughout the series.
Now let’s take a quick look at these cleaning and preparation tasks — most of which you’ll engage in when working with any dataset in OpenRefine.
First, discovering. This is about getting to know your data - looking for patterns, spotting inconsistencies, and thinking about what corrections might be needed.
Next, structuring. Data comes in all shapes and sizes, so you may need to merge, reorder, or reshape it to make it ready for analysis.
Third is cleaning. Data is often full of errors and inconsistencies, and this “dirty data” can affect the accuracy of your results.
Fourth, enriching. You might find opportunities to add extra information to your existing data to make it more useful.
Fifth, validating. This is where you check the quality and consistency of your data - especially after making changes.
And finally, publishing. This step is about planning and delivering the output of your cleaned and prepared dataset.
In later videos, we’ll try out many of these tasks hands-on, so you can see how they work in practice.
Now let’s take a look at the dataset we’ll be using in this series. We’ll be working with the Immigration Records dataset, covering the years 1865 to 1883, sourced from the Government of Ontario Data Catalogue.
The dataset consists of four volumes of assisted immigration registers, created by the Toronto Emigrant Office between 1865 and 1883. These registers provide a chronological record of new immigrants who received government assistance to travel to destinations across southern Ontario. In total, over twenty-nine thousand entries have been transcribed from the original registers.
The dataset includes several variables. Let’s walk through the most important ones.
Names: Usually the immigrant or the head of a family. Sometimes it could be an agent, an interpreter, or someone traveling with children. Not all names were recorded.
Date of application: This is when the immigrant applied at the Toronto Office for an assisted fare. If the ticket wasn’t used, zeros were entered in all the numerical columns.
Nationality
Trade: Only appears in volume four.
Ship: The ship the immigrant arrived on at the port of entry.
Landed: The actual port of entry.
Destination: Either the immigrant’s final destination or the nearest railway station to where they were headed.
Railways: The railway used to reach the destination.
Male adults, female adults, and children: Children are separated by gender, and in volume four, infants are also recorded.
Total number of fares: Usually one person, but sometimes it represents a family or even a larger group traveling under one agent.
Reference code: The Archives of Ontario reference code for the material.
To fully understand the data structure and the content, I recommend checking the codebook that comes with the dataset. In the next video, we’ll launch OpenRefine, import some data, explore the interface, and use faceting and sorting to start understanding our dataset.
Downloads
License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- Ask Chat is a collaborative service
- Ask Us Online Chat hours
- Contact Us