__STYLES__
Tools used in this project
Housing Data SQL Project

About this project

Introduction

The main objective of this project is to clean data in SQL. The data includes information about sale dates, addresses, owner names, value, size, etc.

Data

The first five rows of the original data are shown in the following table:

undefinedundefinedundefined

Objectives

  1. Standardize SaleDate column format.
  2. Populate PropertyAddress column.
  3. Split PropertyAddress and OwnerAddress into individual columns (Address, City, State).
  4. Change 'Y' and 'N' to 'Yes' and 'No' in SoldAsVacant column.
  5. Remove duplicates.
  6. Delete unused columns.

1. Standardize SaleDate column format

The date format in the SaleDate column includes date and hour, but only the date it's needed:

--To create a new column named "SaleDateConverted" in NashvilleHousing table
ALTER TABLE 
    HousingData.dbo.NashvilleHousing
ADD 
    SaleDateConverted Date

-- The clean date is assigned to the new column SaleDateConverted
UPDATE 
    HousingData.dbo.NashvilleHousing      
SET 
    SaleDateConverted = CONVERT(Date, SaleDate)


SELECT   
    SaleDateConverted
FROM 
    HousingData.dbo.NashvilleHousing

Result

undefined

2. Populate PropertyAddress column

There are some null values in the PropertyAddress column. To fill the empty cells, we are going to look for other rows with the same ParcelID but different UniqueID, that indicates that it is the same person, the same PropertyAddress.

UPDATE      
    a
SET  
    PropertyAddress = ISNULL(a.PropertyAddress, b.PropertyAddress) 
FROM 
    HousingData.dbo.NashvilleHousing AS a
JOIN HousingData.dbo.NashvilleHousing AS b
    ON a.ParcelID = b.ParcelID
    AND a.[UniqueID ] <> b.[UniqueID ]
WHERE
    a.PropertyAddress is null

3. Split PropertyAddress and OwnerAddress into individual columns (Address, City, State)

The PropertyAddress includes address and city, but it is more useful if they are in different columns: PropertySplitAddress and PropertySplitCity.

ALTER TABLE 
    HousingData.dbo.NashvilleHousing     
ADD                                       
    OwnerSplitAddress Nvarchar(255),
      OwnerSplitCity Nvarchar(255),
      OwnerSplitState Nvarchar(255)
    PropertySplitAddress Nvarchar(255),
      PropertySplitCity Nvarchar(255)
UPDATE 
    HousingData.dbo.NashvilleHousing       
SET 
    --It takes the part 3 of OwnerAddress string, then it assigns the substring to   OwnerSplitAddress
    OwnerSplitAddress = PARSENAME(REPLACE(OwnerAddress, ',', '.'), 3), 
    --It takes the part 2 of OwnerAddress string, then it assigns the substring to OwnerSplitAddress           
    OwnerSplitCity = PARSENAME(REPLACE(OwnerAddress, ',', '.'), 2), 
    --It takes the part 1 of OwnerAddress string, then it assigns the substring to OwnerSplitAddress                                                                         
      OwnerSplitState = PARSENAME(REPLACE(OwnerAddress, ',', '.'), 1)   
      --It takes the part 2 of PropertyAddress string, then it assigns the substring to PropertySplitAddress
        PropertySplitAddress = PARSENAME(REPLACE(PropertyAddress, ',', '.'), 2),
    --It takes the part 1 of PropertyAddress string, then it assigns the substring to PropertySplitCity
      PropertySplitCity = PARSENAME(REPLACE(PropertyAddress, ',', '.'), 1)                                                                      

Result:

undefined

4. Change 'Y' and 'N' to 'Yes' and 'No' in SoldAsVacant column

In SoldAsVacant column, some cells has 'Y' or 'N' instead of 'Yes' or 'No'. With the next query, the 'Y and 'N' are transformed to 'Yes' and 'No':

UPDATE 
    HousingData.dbo.NashvilleHousing                            
SET SoldAsVacant = CASE
                         WHEN 
          -- It looks for Y's in SoldAsVacant and then changes it for Yes
                              SoldAsVacant = 'Y' THEN 'Yes'    
                         WHEN
          -- It looks for N's in SoldAsVacant and then changes it for No
                              SoldAsVacant = 'N' THEN 'No'     
                         ELSE
                              SoldAsVacant
                         END

5. Remove duplicates

A Common Table Expression (CTE) is used for removing duplicate rows. The ROW_NUMBER() function is used to generate the row numbers by partition. PARTITION BY function is used to create partitions of rows based on multiple columns: ParcelID, PropertyAddress, SalePrice, SaleDate, and LegalReference.

WITH RowNumCTE AS(
    SELECT *,
           ROW_NUMBER() OVER(                        
             PARTITION BY                               
                       PropertyAddress,             
                 ParcelID,
                       SalePrice,
                       SaleDate,
                       LegalReference
                       ORDER BY
                               UniqueID) AS row_num   
    FROM 
        HousingData.dbo.NashvilleHousing
        )

DELETE
FROM
    RowNumCTE
WHERE row_num > 1

Result:

The above query generates a new column 'row_num' with a '1' or a '2': the '1' indicates that it is only one row with that data, the '2' indicates that row is repeated. Then those rows are deleted.

6. Delete unused columns

In this last query, some columns that are not useful for further analysis are removed from the data table. These columns are: OwnerAddress, TaxDistrict, PropertyAddress, and SaleDate.

ALTER TABLE
    HousingData.dbo.NashvilleHousing
DROP COLUMN
    OwnerAddress,
      TaxDistrict,
      PropertyAddress,
      SaleDate

SELECT *
FROM 
    HousingData.dbo.NashvilleHousing

Conclussions and Results

This was a data cleansing focused project in SQL. Some of the things done include: cleaning and formatting dates, populating columns with missing data, splitting columns, remove duplicates and delete columns.

Some of the functions used include: ALTER TABLE - ADD - DROP COLUMN, DELETE, UPDATE - SET, JOIN, CASE - WHEN - THEN - ELSE, CTE, etc.

The final data table:

undefinedundefinedAfter all this data cleaning process, the data will be much easier to work with because it is now in a more suitable format.

Discussion and feedback(0 comments)
2000 characters remaining
Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.