OpenRefine Tutorial Series: Video 4

Transcript

OpenRefine video 4. In this video, we’ll explore how to structure and reshape your dataset in OpenRefine. You’ll learn how to remove unwanted records, hide, remove, or add columns, reorder or split them, restructure date columns, and consolidate similar values using facets, filters, and transformations.
When working with data, we often need to reshape it to suit our needs. This can involve removing unnecessary records, reordering or adjusting columns, splitting or combining data, and changing formats to make the dataset easier to work with.
As always, all changes are reversible using the Undo / Redo tab on the left.
At the moment, our dataset contains several columns that don’t add value to our research.
To hide a column temporarily, click the arrow next to its header, hover over View, and then select Collapse this column.
To delete a single column, click the arrow within the column header, hover over Edit column, and select Remove this column.
To remove multiple columns at once, click the arrow next to All at the top left of the sheet, hover over Edit columns, and choose Re-order / Remove columns. In the window that appears, drag the columns you want to remove to the pane on the right labelled Drop columns here to remove, then click OK.
Columns can also be reordered to suit your needs:
To move a single column, click the arrow next to its header, hover over Edit column, then select Move column to beginning or another position that fits your needs.
To reorder multiple columns, click the arrow next to All at the top left of the sheet, hover over Edit columns, then select Re-order / remove columns. In the window that appears, drag the columns on the left into your preferred order. When you’re finished, click OK in the bottom-right corner.
Often, datasets contain incomplete records. In our example, we’ll remove any rows without a Ship value.
To do this, click the arrow within the “Ship” column header.
Then hover over Facet, then Customized facets, and finally select Facet by blank (null or empty string) from the bottom of the submenu that appears.
A new facet pane titled Ship will appear on the left side of your screen showing two values: false (27,529) and true (1,535). Because we’re using the “null or empty string” facet, true represents the records that match that condition. In other words, 1,535 records have no Ship value, while the remaining 27,529 records do.
Click true in the Ship facet pane. The word true will turn red to show it’s been selected, and the main table will update to display only the records without a Ship value. To remove these records from your dataset, click the arrow next to All at the top left of the sheet, hover over Edit rows - not Edit all rows - and then select Remove matching rows.
In the Ship facet pane, true will now show 0 records. To return to the remaining 27,529 records that do have a Ship value, click false to switch back to the full view. When you’re finished, you can remove the Ship facet by clicking the X in the top-left corner of the facet pane.
Let’s repeat the same process for the Trade column, removing rows marked true and then removing the facet.
Click the arrow next to Trade, hover over Facet, then Customized facets, and finally select Facet by blank (null or empty string).
A new facet pane will appear on the left side of your screen with two values: false 2,499 and true 25,030. This indicates that 25,030 records do not have a Trade value.
Click true in the Trade facet pane. The main table will update to show only the records without a Trade value. Next, click the arrow next to All at the top left of the sheet, hover over Edit rows, and select Remove matching rows to delete these records.
The rows without a Trade value are now removed. To return to the remaining 2,499 records, click false, or simply close the Trade facet by clicking the X in the top-left corner of the facet pane.
At this point, your dataset should contain 2,499 rows, which is shown in the top-left corner of the sheet.
Next, we’ll restructure the Date of Application column to make it easier to work with. This involves splitting the original date into separate month, day, and year columns, while also keeping a copy of the original data for reference.
Start by creating a duplicate of the Date of Application column. Click the arrow within the column header, hover over Edit Columns, then select Add column based on this column. A new window will appear, with a text entry field at the top to assign the new column a name. Type Date of Application Original into this field and click the OK button at the bottom right of the window.
From here, we can examine the original column using a text facet to see how the month, day, and year are separated. Click the arrow within the Date of Application column header, then hover over Facet, and choose Text facet from the submenu that appears.
We’ll split the column into multiple columns based on blank spaces and rename them as Application Month, Application Day, and Application Year. Click the arrow within the Date of Application column header. In the menu that appears, hover over Edit column, then select Split into several columns from the submenu.
A new window will appear showing the column-split options. By default, “by separator” is selected, with a comma entered in as the separator. Keep “by separator” selected, but delete the comma and replace it with a single space. Other settings can remain as they are, including the option to remove the original column after splitting. Click OK to apply the split. OpenRefine will create three new columns.
Now rename each of these columns. Click the arrow in a column header, hover over Edit column, and select Rename this column. In the window that appears, type the new name - Application Month, Application Day, or Application Year - then click OK. Repeat this for all three columns.
Next, we’ll convert the original date column back into proper date format so that OpenRefine recognizes it as a date. Within the Date of Application Original column header, click the arrow, hover over Edit Cells, then hover over Common Transforms, and select To date from the submenu that appears.
After converting the column to a date, OpenRefine adds a time component by default, which isn’t necessary for our purposes. To simplify the display, we’ll convert the Date of Application Original column back to text. Within the column header, click the arrow, hover over Edit Cells, then hover over Common Transforms, and select To text.
Once it’s back in text format, we can clean up the values by removing the extra time string. Click the arrow within the column header again, hover over Edit Cells, then “Transform...”
A new window will appear. At the top, there’s a text entry box labelled Expression, where you type the transformation formula. Below that is a preview of your data showing how the changes will look. In the Expression box, type:
value.replace("T00:00:00Z","")
This tells OpenRefine to replace all instances of T00:00:00Z with nothing. Click OK, and the dates will now display in a simpler, easier-to-read format.
Next, we’ll look at consolidating trade types to make our dataset easier to work with. Sometimes similar trades are listed in multiple ways, and we can use a combination of filters and transformations to standardize them.
First, add a new column based on Trade. Click the arrow within the Trade column header, hover over Edit column, and then select Add column based on this column. A new window will appear, prompting you to name the column. Type Trade Categories into the text entry field labelled New column name, then click OK in the bottom right corner of the window.
Next, we’ll filter this new column to focus on records containing the word farm. Click the arrow within the Trade Categories column header, then select Text Filter. A filter pane will appear on the far-left side of the screen, within the Facet / Filter section of the workspace. Within this filter pane, type farm into the text entry field, and the filter will automatically apply, showing only records that contain the word “farm”.
Once again, click the arrow within the Trade Categories column header. From the menu that appears, hover over Edit Cells, and then select Transform…. A new window will open containing a text entry field labelled Expression. In this Expression box, type:
“Farm labour”
Click OK at the bottom right of the window. This will update all filtered records in the Trade Categories column to say Farm labour.
Once you’ve applied the transformation, return to the full dataset by clearing the text in the entry field of the Trade Categories filter pane, or by clicking the X at the top left of the pane.
Even after cleaning, you may notice that some values in a column are slightly different but really represent the same thing. For example, in the Trade Categories column, you might see “Farm labour” with a lowercase “l” in labour, “Farm Labour” with an uppercase “L” and the American spelling of labor, and “farm labor” where both words start with lowercase letters. Clustering helps you identify and merge these similar entries, making your dataset consistent and easier to analyze.
First, open the Trade Categories column facet. Click the arrow within the Trade Categories column header, hover over Facet, and select Text facet. In the facet pane that appears on the left side of the screen, make sure the option at the top labelled “name” is selected to sort the values alphabetically.
Next, click the Cluster button at the top right of the facet pane and a Clustering window will appear. At the top of the window, you’ll see dropdown menus to set the Method and Keying function. Check that Method is set to Key Collision and set Keying function to Daitch-Mokotoff.
“Below these options is a blank box where clusters will appear. To run the clustering process, click the Cluster button within this box. After clicking the button, the window populates with groups that OpenRefine has identified as similar values. The layout includes several key sections:
First, there’s a column called Merge, which has checkboxes. You use these to select which values you want to combine.
Next is the Values column, which lists all the words or entries that OpenRefine has grouped together as similar.
The New Cell Value column contains text entry fields. Here you can type the name that you want the merged values to have.
Then there’s Cluster Size, which tells you how many original values were grouped together in that cluster.
Finally, the Row Count column shows how many times the clustered values appear in the dataset.
To the right of these columns are filter sliders with coloured backgrounds. You can use these sliders to narrow down which clusters are shown based on various criteria, for example, to display only clusters with more than 100 rows.
Review the suggested clusters. For any items you want to combine, check the boxes in the Merge column. If needed, adjust the text in the New Cell Value column to define the consolidated name. Then, click the Merge Selected & Re-Cluster button at the bottom right of the window. Repeat this process until you’re satisfied with the results.
When you’re finished, click the Close button at the bottom right of the window to exit the clustering tool. Your similar values in Trade Categories are now consolidated.
Next, we’ll clean up any remaining variations using filters and GREL transformations.
First, apply a text filter to the Trade Categories column. Click the arrow within the column header, then select Text filter from the menu that appears. A filter pane will appear on the left side of the screen within the Facet / Filter panel. In the entry field, type agent. The table will now show only rows that contain the word agent.
Next, click the arrow within the same Trade Categories column header, hover over Edit cells, and select Transform. In the Expression box of the window that appears, type "Agent" with quotation marks. This will standardize all variations of agent to the same value. Click OK in the bottom right corner of the window to apply the change.
Repeat this process for other variations, such as filtering Eng and transforming it to "Engineer".
You can also make direct edits within a facet window. Hover over a row in the facet list. Small blue text labelled Edit will appear on the right side of the line. Click Edit to open a small window with a text entry field. Make your change, then click Apply in the bottom left of the window to update the value across the dataset.
Finally, let’s make one last adjustment to the Destination column. Since this column covers immigration to Ontario, we want to clearly indicate the province in each entry. This step ensures that all destinations consistently include “ON” for Ontario.
To add characters to the Destination column, click the arrow within the column header, hover over Edit cells in the menu that appears, then select Transform from the submenu.
In the dialog that opens, you will see the word value already typed in the expression box. Leave that as it is. After it, type a plus sign, then a quotation mark followed by a comma, then a space, the letters O and N in uppercase, and then a closing quotation mark.
Click the OK button in the bottom right of the dialog to apply the change to every row. This will add a comma, a space, and the letters O-N to the end of all values in the Destination column.
All changes to your dataset are automatically saved within OpenRefine, so you can continue working on your project at any time without doing anything extra. Exporting the project as an archived file is only needed if you want to create a backup, share it, or open it on another computer with OpenRefine.
To export your work, click the Export button at the top right of the screen. To save the entire project, including the data, metadata, and all recorded operations, select OpenRefine project archived to file from the menu that appears. This file can be opened on any computer that has OpenRefine installed.
To export only the cleaned data itself, select Comma-separated value or CSV from the Export menu. This allows you to open your data in other programs like Excel.
Finally, close the OpenRefine browser tab. You’ll still have the command window open - the one that launched OpenRefine. If you click the X to close that window, the OpenRefine server may continue running in the background.
To fully stop it and avoid leaving an extra process running, return to that command window and press Control + C. Doing this cleanly shuts down the OpenRefine server.
This concludes the series on the Ontario assisted immigration dataset, covering data cleaning, structuring, and refinement to make the historical records easier to work with and analyze.

Time commitment

5 - 10 minutes

Downloads

Ontario_immigration_records_1865-1883_dataset_0_2 [CSV]

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The library is committed to ensuring that members of our user community with disabilities have equal access to our services and resources and that their dignity and independence is always respected. If you encounter a barrier and/or need an alternate format, please fill out our Library Print and Multimedia Alternate-Format Request Form. Contact us if you’d like to provide feedback: lib.a11y@uoguelph.ca

chat loading...