__STYLES__
Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this project, I have done a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. I will first perform some basic data cleaning and then look for insights in the data to devise strategies for driving growth and retention.
The dataset contains web scraped data of more than 10k Play Store apps and can be found here. Dataset is contained in a single table named GPlaystore and each record in the dataset consists of 13 columns, which are used to describe a given app. The columns are quite self explanatory and are namely the app name, Category, User Rating, Number of Reviews, Size in MB, Number of Installs, Type (Free or Paid), Price(if applicable), Content Rating (Suitability for various Audience), Genre, Last Updated Date, Current Version of the app and Android Version it supports.
Data cleaning is one of the most essential subtask for any data science project. Although it can be a very tedious process, it's worth should never be undermined. The main approach taken by me for data cleaning in this project is to inspect every column for it's null and unique values and accordingly identify potential problems and then handle them. So, below I have done some inspection and thus cleaning for any column, if required.
There are no null or inconsistent values
Checked distinct values for category column Most categories are in the form of words separated by an _(underscore). In this case, it is clear from the distinct values that 1.9 is a wrong input for this column as a category being 1.9 does not make any sense. But if there were a lot of distinct categories, then this manual search would become tedious and also susceptible to error. So, I use pattern matching to identify inconsistent data, which is both accurate and efficient. Based on a simple lookup on google play store, the category of this record is changed to LIFESTYLE.
This column is stored in PostgreSQL using numeric data type constraint, so there is no need to check for any non-numeric values like strings. However, it is wise to check for ratings that are not-a-number or go beyond 5.There are 1474 records with NaN rating and 1 record with 19 rating. So, I replace all NaN with null and search the actual rating of the app with 19 rating on play store and update it using following queries.
It is already stored as a numeric column. So just checked for null, NaN and negative values, and no issues found.
Size column is represented using string as it contains data like 1.2M (1.2 MB), 8.5k (8.5kB) and ‘varies with device’ (the storage memory varies from device to device). So, I execute the following query to check inconsistent data and get only one problematic record to be treated. So, I look up the size of this app on Play store and update it with the below query.
The install values are present like 500+, 5,00,000+ etc. So I check the distinct values present in the dataset, and there is one value ‘Free’ that is inconsistent. This row has been problematic in previous data cleaning steps too! A closer look at the other values in the row shows that the whole row contains inconsistent data. The remaining part of the row is shown below. Since this is only one row, there is no point in spending time cleaning all the values in this row, as it will keep getting highlighted again and again as I proceed with data cleaning. So, I decide to drop this record altogether.
It contains only three distinct values, Paid, Free and NaN. I search for the record that has type NaN and replace it with free as per information from Google Play Store.
All the prices are either 0 or in the format of $4.5, so there are no inconsistent values. This column is currently of character data type, so I have removed $ before the values so that I can later use these values for numerical operations by casting them into numeric data type.
This column does not contain any inconsistent or invalid values. Below are the distinct values in this column.
The genres column does not contain any inconsistent data.
This column contains date in string format as January 7, 2016. So, I have converted it into the standard date format. Finally, I have altered the column data type price to Numeric from Char and Last Updated to Date from Char.
I could not change these data types before, because then PostgreSQL would not have imported data in to the table from external CSV file. In other words, if I had given numeric data type to Price column when I was creating the table in PostgreSQL, it would have not taken any value because it had $ in front of the digits. So I imported the data, keeping Price column as character type, cleaned it and then changed the data type to numeric. Similar operation is done for Last Updated Column.
Now the data is finally ready for analysis. So, in the following section I have extracted some valuable insights from the data using SQL queries.
Insights: Education apps hold the baton with 4.4 average rating and the highest number of reviews which make up this average value. However, before making this claim, I need to drill down further and investigate the genres that come under Education, to see if there a particular genre causing highest value for average rating.
Furthermore, Gaming apps also seem to represent equally popular category as it has an average rating of 4.3 with the highest number of total reviews of all the categories (I separately executed a query to verify this). Social and shopping apps also represent their importance, given an average rating of 4.3 and high number of reviews.
Overall, only looking at average rating is not going to be much insightful, because the total number of reviews that make up rating is equally important. Education tops my list because in 39595786 reviews, its minimum rating is not below 3.5.
So, as a result, I will now drill down on Education and Game apps to uncover more hidden insights.
Insights: As is clear from the above table, under Education category, apps specifically built for Education take 98% of the reviews which cause Education apps to have the highest rating. Other Education apps built for Brain Games, Pretend Play, Music, Video, Creativity, Action and Adventure have negligible contributions to the grand average as compared to these two genres.
As per the last output, if a company want to launch a new app under Education Category, then that app should specifically target education genre. But the above analysis is not enough to arrive at this conclusion.
Every company runs on a business model, making revenue one the most important things to be considered before making any decisions.
Calculating revenue is simple, as I need to multiply the Price by number of installs. **But in this dataset, the number of installs is not an exact number, but something of the form 10000+. **So, I really have no option to get the exact number of installs for any app. To just get an overall idea, I have decided to drop the + in every column and just take the base value as number of installs for getting an estimate of the revenue. Insights: Based upon an estimated analysis, Pretend Play genre has highest rating and that too under the paid apps version. It has much higher installs as compared to Paid Education genre apps. Education genre has the highest installs under its free versions and that accounts for almost all the reviews and rating this it gets.
Actionable Recommendation: Pretend Play is the genre to go for someone looking to introduce a new app under Education category with an aim to generate high revenues. For example, private firms, startups etc.
If an app is to be launched as a part of an initiative to promote free learning, then Education specific apps are recommended under Education category. For example, by a government program.
This query includes RANK function also, because there a lot of records under GAME category, so ranking them based on total revenue gives us a kind of easy to spot value when looking for highest, lowest or any other ranked revenue. Insights: The data itself answers the question that Action Genre is the way to go for new apps under GAME category. It has the highest number of installs and reviews under all paid genres and generates the highest revenue of all.
If one thinks about the analysis done above in steps 2 and 3, the main factor that shapes actionable insights is the total revenue that is being generated by paid versions of different category apps. So, it is time to drill it down further.
Insights: Based on the overall analysis, Arcade; Action & Adventure generates the highest revenue under FAMILY category of all the categories for its paid apps. Similarly, one can check about other categories also.
But still, one cannot be confident regarding these recommendations because this data does not have the exact number of installs for every app, just an estimate of that has been used for this analysis. As the next step, I have tried to analyze the data from a different angle to gain other set of possible insights.
Insights: Apps having content for everyone account for about 80% of the apps available on Google Play Store. And overall, in all the Content Ratings, as obvious free apps hold the major chunk of the released products.
To further study the impact of Content Rating on app business, I have modified the previous query to also include content rating in the analysis and see if the revenue generating app genres and categories also fall inline with the above analysis. Just added the content rating column in the query, rest everything is same as previous query. Insights: The above table presents a much clearer picture regarding Content Rating. Most top revenue apps are the ones that are for everyone. But a new entrant in app market need to refer the above analysis if it is aspiring to launch something under GAME or DATING category. All other category apps shall be the ones focused on content for everyone.
So, some final insights that can be derived from the above analysis are:
Last but not the least, there are a lot of potential insights which can still be uncovered from this dataset. In case anyone reading this project has any suggestion for me, please feel free to reach out and I will be more than happy to discuss and improve on my project.
Thanks a lot for taking time to look at my project and being a patient audience.