__STYLES__
The main objective of this project is to clean data in SQL. The data includes information about sale dates, addresses, owner names, value, size, etc.
The first five rows of the original data are shown in the following table:
The date format in the SaleDate column includes date and hour, but only the date it's needed:
--To create a new column named "SaleDateConverted" in NashvilleHousing table
ALTER TABLE
HousingData.dbo.NashvilleHousing
ADD
SaleDateConverted Date
-- The clean date is assigned to the new column SaleDateConverted
UPDATE
HousingData.dbo.NashvilleHousing
SET
SaleDateConverted = CONVERT(Date, SaleDate)
SELECT
SaleDateConverted
FROM
HousingData.dbo.NashvilleHousing
There are some null values in the PropertyAddress column. To fill the empty cells, we are going to look for other rows with the same ParcelID but different UniqueID, that indicates that it is the same person, the same PropertyAddress.
UPDATE
a
SET
PropertyAddress = ISNULL(a.PropertyAddress, b.PropertyAddress)
FROM
HousingData.dbo.NashvilleHousing AS a
JOIN HousingData.dbo.NashvilleHousing AS b
ON a.ParcelID = b.ParcelID
AND a.[UniqueID ] <> b.[UniqueID ]
WHERE
a.PropertyAddress is null
The PropertyAddress includes address and city, but it is more useful if they are in different columns: PropertySplitAddress and PropertySplitCity.
ALTER TABLE
HousingData.dbo.NashvilleHousing
ADD
OwnerSplitAddress Nvarchar(255),
OwnerSplitCity Nvarchar(255),
OwnerSplitState Nvarchar(255)
PropertySplitAddress Nvarchar(255),
PropertySplitCity Nvarchar(255)
UPDATE
HousingData.dbo.NashvilleHousing
SET
--It takes the part 3 of OwnerAddress string, then it assigns the substring to OwnerSplitAddress
OwnerSplitAddress = PARSENAME(REPLACE(OwnerAddress, ',', '.'), 3),
--It takes the part 2 of OwnerAddress string, then it assigns the substring to OwnerSplitAddress
OwnerSplitCity = PARSENAME(REPLACE(OwnerAddress, ',', '.'), 2),
--It takes the part 1 of OwnerAddress string, then it assigns the substring to OwnerSplitAddress
OwnerSplitState = PARSENAME(REPLACE(OwnerAddress, ',', '.'), 1)
--It takes the part 2 of PropertyAddress string, then it assigns the substring to PropertySplitAddress
PropertySplitAddress = PARSENAME(REPLACE(PropertyAddress, ',', '.'), 2),
--It takes the part 1 of PropertyAddress string, then it assigns the substring to PropertySplitCity
PropertySplitCity = PARSENAME(REPLACE(PropertyAddress, ',', '.'), 1)
In SoldAsVacant column, some cells has 'Y' or 'N' instead of 'Yes' or 'No'. With the next query, the 'Y and 'N' are transformed to 'Yes' and 'No':
UPDATE
HousingData.dbo.NashvilleHousing
SET SoldAsVacant = CASE
WHEN
-- It looks for Y's in SoldAsVacant and then changes it for Yes
SoldAsVacant = 'Y' THEN 'Yes'
WHEN
-- It looks for N's in SoldAsVacant and then changes it for No
SoldAsVacant = 'N' THEN 'No'
ELSE
SoldAsVacant
END
A Common Table Expression (CTE) is used for removing duplicate rows. The ROW_NUMBER() function is used to generate the row numbers by partition. PARTITION BY function is used to create partitions of rows based on multiple columns: ParcelID, PropertyAddress, SalePrice, SaleDate, and LegalReference.
WITH RowNumCTE AS(
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY
PropertyAddress,
ParcelID,
SalePrice,
SaleDate,
LegalReference
ORDER BY
UniqueID) AS row_num
FROM
HousingData.dbo.NashvilleHousing
)
DELETE
FROM
RowNumCTE
WHERE row_num > 1
The above query generates a new column 'row_num' with a '1' or a '2': the '1' indicates that it is only one row with that data, the '2' indicates that row is repeated. Then those rows are deleted.
In this last query, some columns that are not useful for further analysis are removed from the data table. These columns are: OwnerAddress, TaxDistrict, PropertyAddress, and SaleDate.
ALTER TABLE
HousingData.dbo.NashvilleHousing
DROP COLUMN
OwnerAddress,
TaxDistrict,
PropertyAddress,
SaleDate
SELECT *
FROM
HousingData.dbo.NashvilleHousing
This was a data cleansing focused project in SQL. Some of the things done include: cleaning and formatting dates, populating columns with missing data, splitting columns, remove duplicates and delete columns.
Some of the functions used include: ALTER TABLE - ADD - DROP COLUMN, DELETE, UPDATE - SET, JOIN, CASE - WHEN - THEN - ELSE, CTE, etc.
After all this data cleaning process, the data will be much easier to work with because it is now in a more suitable format.