Aims and objectives
This module will:
- explain what data is
- examine how data is used
- explore ways to analyse data.
After completing this module, you will be able to:
- find, clean and use data
- evaluate and select tools to analyse and display data.
4. Clean data
Data needs to be in a usable format to conduct an analysis. This often requires some preparation and “cleaning” of the data.Cleaning refers to the removal or modification of incorrect, incomplete or irrelevant data. Even open datasets may require some cleaning.
Structured or unstructured
Structured data is data that is well-organised. It follows a specific order, in a specific format. The data is categorised by different characteristics. Unstructured data has a range of information that is not organised into separate categories. It needs to be structured into a machine-readable format so that a computer can “read” the data.
Preparing data
You may have to sort or standardise the data. Dates, names and addresses are common data that require preparation.
Survey respondents may have provided a range of answers to a question:
e.g. "Where do you live?"
- Brisbane, QLD, Australia
- Alice Springs, NT
- Hobart, Aus
You would need to sort the responses into separate fields and standard formats e.g.:
City | State | Country |
---|---|---|
Brisbane | Queensland | Australia |
Alice Springs | Northern Territory | Australia |
Hobart | Tasmania | Australia |
Data formats
Data that you analyse by computer must be in a format that software can open and read. The delimited format, where each variable is separated by a comma, tab or space, can be imported into software for analysis.
The *.csv (comma-separated value) extension is useful because it maintains field separations across software. If you have created a table in Microsoft Excel, and want to export it to another type of software, it is better to save the file as a csv.
Save an Excel file as a csv:
- Click File
- Select Save As
- Then select the CSV (Comma delimited) *.csv option from the list
- Choose the location to save the file
- Click Save.
A csv file may not import correctly into other software if there are:
- blanks lines
- special (non-standard) characters e.g. # & @
- missing commas between fields.
Data in a "clean" csv file should appear like this example of female Oscar winners:
Index, Year, Age, Name, Movie
1, 2016, 26, Brie Larson, Room
2, 2017, 28, Emma Stone, La La Land
3, 2018, 60, Frances McDormand, Three Billboards Outside Ebbing Missouri
4, 2019, 45, Olivia Colman, The Favourite
5, 2020, 50, Renee Zellweger, Judy
Download this example data file (CSV, 1KB) and open it with Microsoft Excel or Google Sheets to see how it appears in a spreadsheet.
A data file may require special statistical software to open it. See the Working with files module for information about other file extensions and the software required to open them.
Tools for cleaning data
You can manually sort data into the correct fields but if you have a lot of data it can be very time consuming. You can use formulas and functions in software to do the task much faster.
Tool | Freely available | Licensed version available via UQ | On Library computers | Guides | Tutorials | UQ Library training |
---|---|---|---|---|---|---|
OpenRefine | Yes | No | Yes | OpenRefine Documentation | Introduction to Google Refine (YouTube, 6m47s) | Yes |
R | Yes | No | Yes | Essential R resources | Cleaning bad data in R (LinkedIn Learning course) requires a UQ login | Yes |
Microsoft Excel | No | Get Office 365 for personal use | Yes | Top 10 ways to clean your data | Cleaning data imported into Excel (LinkedIn Learning, 5m13s) requires a UQ login | Yes |
Our Text mining and text analysis guide has a lot of information about preparing text for analysis.
Duration: Approximately 30 minutes
Graduate attributes
Knowledge and skills you can gain to contribute to your Graduate attributes:
Independence and creativity
Critical judgement
Ethical and social understanding
Check your knowledge
Check what you know about this topic:
Support at UQ
Access UQ services to assist you with personal or study-related issues.