__STYLES__
📊 Analyzing the Rise and Fall of Programming Languages Using Stack Overflow Data using R
Proud to share my recent project where I delved into a decade's worth of Stack Overflow questions to discern the popularity trends of programming languages. Here's a brief glimpse:
🔹 Objective: Determine which programming languages are gaining traction and which ones are waning in terms of usage and popularity.
🔹 Data Source: Stack Overflow's open data from the Stack Exchange Data Explorer, comprising over 16M questions.
🔹 Key Findings:
R and Python are on a steady rise, with Python especially showing a significant upward trend.
JavaScript remains a popular choice, but traditional heavyweights like Java and C# are seeing a decline.
Tags related to newer tools and technologies like Angular and Node.js are seeing increased activity, indicating their growing relevance in the developer community.
🔹 Visualization Tools: Leveraged R's ggplot2 for comprehensive visualizations, helping in clear trend identification.
Here is the Project
How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?
One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We’re going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.
Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there’s a tag for languages like R or Python, and for packages like ggplot2 or pandas.
Our data consists of 4 columns of each tag in stackoverflow and number of this tag in each year and the total per year. and about 40k rows
## [1] 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
length(unique(data$tag))
## [1] 4080
so we have the data from 2008 to 2018 for 10 years and we have 4080 unique tags.
we are adding new column to show the percentage of each tag in each year.
data = mutate(data,number_percentage=(number/year_total))head(data)
## year tag number year_total number_percentage
## 1 2008 .htaccess 54 58390 9.248159e-04
## 2 2008 .net 5910 58390 1.012160e-01
## 3 2008 .net-2.0 289 58390 4.949478e-03
## 4 2008 .net-3.5 319 58390 5.463264e-03
## 5 2008 .net-4.0 6 58390 1.027573e-04
## 6 2008 .net-assembly 3 58390 5.137866e-05
R_over_time = filter(data,tag=="r")head(R_over_time)
## year tag number year_total number_percentage
## 1 2008 r 8 58390 0.0001370098
## 2 2009 r 524 343868 0.0015238405
## 3 2010 r 2270 694391 0.0032690516
## 4 2011 r 5845 1200551 0.0048685978
## 5 2012 r 12221 1645404 0.0074273552
## 6 2013 r 22329 2060473 0.0108368321
as we can see R tags are increasing rabidly over years but let’s visualize it to have a closer look.
ggplot(R_over_time,aes(x=year,y=number_percentage)) + geom_line(color="blue")
as we can see that R language after is in continuous increase over the years so that show us that it is worth to learn and practice. so let’s see the ggplot2 and dplyr tags too
s_tags = c("dplyr","r","ggplot2")
s_over_time = filter(data,tag%in%s_tags)ggplot(s_over_time,aes(x=year,y=number_percentage ,color=tag))+geom_line()
as we can see that ggplot2 and dplyr are growing their question are not many as R.
sorted_tags = arrange(summarise(group_by(data,tag),item_total= sum(number)),desc(item_total))head(sorted_tags)
## # A tibble: 6 × 2
## tag item_total
## <chr> <int>
## 1 javascript 1632049
## 2 java 1425961
## 3 c# 1217450
## 4 php 1204291
## 5 android 1110261
## 6 python 970768
so as we see that c# , java script and java are having most question through the history
s_tags = c("javascript","java","c#","php","android","python")
s_over_time = filter(data,tag %in% s_tags)ggplot(s_over_time,aes(x=year,y=number_percentage ,color=tag))+geom_line()+coord_cartesian(xlim = c(2008,2018), ylim = c(0,0.1))
as we can see that over the years Java script and python is increasing in question percentage every year ,while C# , android, php and Java are decreasing each year where that means that their usage are decreasing
s_tags = c("r","python","powerbi","excel")
s_over_time = filter(data,tag%in%s_tags)ggplot(s_over_time,aes(x=year,y=number_percentage ,color=tag))+geom_line()
as we can see both R an Python are growing but python is more popular and is used more.
s_tags = c("r","python")
s_over_time = filter(data,tag%in%s_tags)ggplot(s_over_time,aes(x=year,y=number_percentage ,fill=tag))+geom_bar(stat="identity", position="dodge", width=0.7) +
theme_minimal()
Biggest Changes over the years
library(viridis)
## Warning: package 'viridis' was built under R version 4.2.3
## Loading required package: viridisLite
increases = data %>%
group_by(tag)%>%
summarize(change = number_percentage[which.max(year)]-number_percentage[which.min(year)])head(arrange(increases,desc(change)))
## # A tibble: 6 × 2
## tag change
## <chr> <dbl>
## 1 python 0.0633
## 2 javascript 0.0598
## 3 android 0.0597
## 4 angular 0.0283
## 5 r 0.0265
## 6 node.js 0.0261
tail(arrange(increases,desc(change)))
## # A tibble: 6 × 2
## tag change
## <chr> <dbl>
## 1 windows -0.0206
## 2 sql-server -0.0213
## 3 c++ -0.0267
## 4 asp.net -0.0555
## 5 c# -0.0733
## 6 .net -0.0929
as we can see here python has the highest increase happened ever and blender and convolution has the highest decreased and it make sense that we don’t know any thing about them
increased=c("python","javascript","android","angular","r","node.js")ggplot(filter(increases,tag%in%increased),aes(x=reorder(tag,change),y=change,fill=tag))+geom_col()+theme_minimal()+scale_fill_manual(values = viridis(6))