Overview
- cleaning is cool
- names
- data hoarding
- redundancy
Data Janitor
Keeping data clean
Someone has to do it

Hygiene is a habit
- consistency > intensity
- keeps things presentable
- why should data be different?
Cleaning your data
- locate corrupt data
- remove, or repair corrupted data
- create procedure to prevent similar corruption
Is hygiene a hassle?
- not if performed regularly
- will make other work more pleasant
- think about the future
Bad Names
- vague
- redundant
- misleading
- hard to read
Comparison
- bigPaper.doc vs. economicEffectsOfPiracy.doc
- estFINAL.txt vs. quarterly_estimate.txt
Naming Conventions
- consistent delimiters
- keep words short
- explain what is in the file
- don't use spaces
Keep it Simple
What do you really need?
Scope
- know what you are looking for
- remove the extra stuff
Redundancy
- don't want it in your papers
- don't want it in your conversations
- don't want it in your data
Redundant data
Name |
NameMid |
name2 |
nameFull |
Bob |
Bobby |
Rob |
Bob Robertson |
James |
Danger |
007 |
James Bond |
Darth |
N/A |
Anakin |
Darth Vader |
Don't repeat yourself
- good for security
- bad for workflows