Working with data

Aims and objectives

This module will:

  • explain what data is
  • examine how data is used
  • explore ways to analyse data.

After completing this module, you will be able to:

  • find, clean and use data
  • evaluate and select tools to analyse and display data.

4. Clean data

Data needs to be in a usable format to conduct an analysis. This often requires some preparation and “cleaning” of the data.Cleaning refers to the removal or modification of incorrect, incomplete or irrelevant data. Even open datasets may require some cleaning.

Structured or unstructured

Structured data is data that is well-organised. It follows a specific order, in a specific format. The data is categorised by different characteristics. Unstructured data has a range of information that is not organised into separate categories. It needs to be structured into a machine-readable format so that a computer can “read” the data.

Preparing data

You may have to sort or standardise the data. Dates, names and addresses are common data that require preparation.

Survey respondents may have provided a range of answers to a question:

e.g. "Where do you live?"

  • Brisbane, QLD, Australia
  • Alice Springs, NT
  • Hobart, Aus

You would need to sort the responses into separate fields and standard formats e.g.:

Alice SpringsNorthern TerritoryAustralia

Data formats

Data that you analyse by computer must be in a format that software can open and read. The delimited format, where each variable is separated by a comma, tab or space, can be imported into software for analysis.

The *.csv (comma-separated value) extension is useful because it maintains field separations across software. If you have created a table in Microsoft Excel, and want to export it to another type of software, it is better to save the file as a csv.

Save an Excel file as a csv:

  1. Click File
  2. Select Save As
  3. Then select the CSV (Comma delimited) *.csv option from the list
  4. Choose the location to save the file
  5. Click Save.

A csv file may not import correctly into other software if there are:

  • blanks lines
  • special (non-standard) characters e.g. # & @
  • missing commas between fields.

Data in a "clean" csv file should appear like this example of female Oscar winners:

Index, Year, Age, Name, Movie
1, 2016, 26, Brie Larson, Room
2, 2017, 28, Emma Stone, La La Land
3, 2018, 60, Frances McDormand, Three Billboards Outside Ebbing Missouri
4, 2019, 45, Olivia Colman, The Favourite
5, 2020, 50, Renee Zellweger, Judy

Download this example data file (CSV, 1KB) and open it with Microsoft Excel or Google Sheets to see how it appears in a spreadsheet.

A data file may require special statistical software to open it. See the Working with files module for information about other file extensions and the software required to open them.

Tools for cleaning data

You can manually sort data into the correct fields but if you have a lot of data it can be very time consuming. You can use formulas and functions in software to do the task much faster.

ToolFreely availableLicensed version available via UQOn Library computersGuidesTutorialsUQ Library training 
OpenRefineYesNoYesOpenRefine DocumentationIntroduction to Google Refine (YouTube, 6m47s)Yes
RYesNoYesEssential R resourcesCleaning bad data in R (LinkedIn Learning course) requires a UQ loginYes
Microsoft ExcelNoGet Office 365 for personal useYesTop 10 ways to clean your dataCleaning data imported into Excel (LinkedIn Learning, 5m13s) requires a UQ loginYes

Our Text mining and text analysis guide has a lot of information about preparing text for analysis.

Duration:   Approximately 30 minutes

Graduate attributes

Knowledge and skills you can gain to contribute to your Graduate attributes:

 Independence and creativity

 Critical judgement

 Ethical and social understanding

Check your knowledge

Check what you know about this topic:

Take the quiz

Support at UQ

Access UQ services to assist you with personal or study-related issues.