Warning: High_Density of Bar plots; And, Apologies for it, I couldn’t get time to improve the visualization / aesthetics
library(tidyverse)
library(highcharter)
library(scales)
library(cowplot)
library(viridis)
library(wordcloud)
library(tidytext)
library(RColorBrewer)
library(ggrepel)
pal <- brewer.pal(9,"BuGn")
# loosely based on @kailex's code snippet
make_multi_automl_plot <- function(start = 'Q9_Part_1', end = "Q9_Part_8") {
survey_aml %>%
filter(!is.na(automl)) %>%
select(start:end,automl) %>%
pivot_longer(cols = start:end, values_to = "Edu") %>%
drop_na() %>%
select(-name) %>%
filter(Edu != "None") %>%
group_by(Edu, automl) %>%
summarize(freq = n()) %>%
ungroup() %>%
group_by(Edu) %>%
mutate(tot = sum(freq)) %>%
ungroup() %>% #View()
ggplot(aes(reorder_within(automl, -freq, Edu), y=freq, fill = factor(Edu))) +
geom_bar(stat = "identity") +
scale_x_reordered() +
facet_wrap(~ Edu, scales = "free") +
scale_fill_viridis(discrete = TRUE, alpha=0.7, option="D") +
theme_minimal() +
labs(x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="none") +
geom_label_repel(aes(label = paste0(freq, " - ", round(freq/tot*100), "%")),
colour = "gray20",
size=2.5, vjust = -0.1,
nudge_y = 0.25,
direction = "y",
angle = 90,
segment.size = 0.2,
fill = "white"
)
}
make_single_automl_plot <- function(col = "Q9") {
survey_aml %>%
filter(!is.na(automl)) %>%
select(col,automl) %>%
# pivot_longer(cols = start:end, values_to = "Edu") %>%
#drop_na() %>%
#select(-name) %>%
rename(Edu = col) %>%
filter(Edu != "None") %>%
group_by(Edu, automl) %>%
summarize(freq = n()) %>%
ungroup() %>%
group_by(Edu) %>%
mutate(tot = sum(freq)) %>%
ungroup() %>% #View()
ggplot(aes(reorder_within(automl, -freq, Edu), y=freq, fill = factor(Edu))) +
geom_bar(stat = "identity") +
scale_x_reordered() +
facet_wrap(~ Edu, scales = "free") +
scale_fill_viridis(discrete = TRUE, alpha=0.7, option="D") +
theme_minimal() +
labs(x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="none") +
geom_label_repel(aes(label = paste0(freq, " - ", round(freq/tot*100), "%"), group = factor(Edu)),
colour = "gray20",
size=2.5, vjust = -0.1,
nudge_y = 0.25,
direction = "y",
angle = 90,
segment.size = 0.2,
fill = "white"
)
}
Companies like Google, H2O, DataRobot are making huge investments with AutoML as their front cover. In fact, they managed to do well on Gartner’s magic quadrant and secure venture funding due the same fact because it sounds revolutionary.
But the real question here is, How receptive the Data science community has been in adopting these AutoML tools? Are they really being used in real life or just being a marketing material? These questions are quite hard to answer at any level.
This notebook tries to leverage a few questions in the Kaggle 2019 Survey to understand the who and what part of AutoML.
Considering AutoML itself a very small niche, I’ve attempted to carve out the niche from this Huge Survey.
Instead of treating this as a mere analysis on AUTOML behavior, this notebook can be extended with multiple practical purposes by the AutoML Providers like Google/H2O/Datarobot
Let’s get started with this bumpy ride of Bar charts!
We’re trying to create a flag
where if the respondent has answered any AutoML tool for Q33
(other than None
) then they are part of AutoML Niche (True
) and if not, they’re not (False
)
Only 9% of the entire survey respondents are part of AutoML Niche. While 26% explicitly mentioned they don’t use any AutoML, 64% of respondents didn’t care to answer this.
survey_aml %>%
count(automl) %>%
mutate(percent = round(n / sum(n),2)*100) %>%
mutate(automl = ifelse(is.na(automl),"I didn't care to respond",automl)) %>%
rename('automl_niche' ='automl' )
## # A tibble: 3 x 3
## automl_niche n percent
## <chr> <int> <dbl>
## 1 FALSE 5175 26
## 2 TRUE 1840 9
## 3 I didn't care to respond 12702 64
Thus going forward, we’ll use only the cases where they’re either part of (True
) of AutoML Niche or Not (False
) - leaving out the ones that didn’t respond any
If you have been watching this space, this plot actually reveals a very important aspect which is open-source - the leading solutions in this are primarily open source or like Google AutoML available as trial with a click of a button.
I think that’s a very important aspect for any enterpise to break into the ML developer ecosystem these days.
survey_aml %>%
filter(automl) %>%
select(starts_with("Q33")) %>%
pivot_longer(cols = 'Q33_Part_1':'Q33_Part_12', values_to = "AutoML") %>%
drop_na() %>%
select(-name) %>%
count(AutoML) %>%
hchart("bar",hcaes(x = AutoML, y = n, color = n))
## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.
26% of male respondents are part of AutoML Niche while 25% of Female respondents are part of it. It doesn’t strike any significant difference in the type of gender and being AutoML enthusiast.
45% of The Youngest age group of respondents (18-21) has embraced AutoML while in no other age cohort other than 22-24 age bracket AutoML adoption is more than 30% .
In fact, the ideal age bracket 25 - 34 where the repondents are concentrated, AutoML Adoption is around 23-24%.
This fact can also be correlated that AutoML is definitely at its nascent stage but the younger ones are open to emrbace (probably putting their ego of data scientist aside)
survey_aml %>%
filter(!is.na(automl)) %>%
select(automl, Q3) %>%
count(Q3,automl) %>%
ungroup() %>%
pivot_wider(names_from = automl,
values_from = n) %>%
mutate(t2f_ratio = round(`TRUE`/`FALSE`,3),
total = `TRUE` + `FALSE`) %>%
arrange(desc(t2f_ratio)) -> survey_aml_country
Looking at the below table, it’s easier to say that there’s not a major Kaggle-dominating country in the top 10 countries where automl_adoption-to-no_automl ratio is high. I think, the point here is exactly the same. It’s not the established ML markets being open for the change but countries like Nigeria, Taiwan, Indonesia have been good at receiving this new wave of AutoML.
It’s also hidden in this data that 55% of Indian Respondents (who are also a strong Kaggle Dominating community) have been part of AutoML Niche
## # A tibble: 10 x 5
## Q3 `FALSE` `TRUE` t2f_ratio total
## <chr> <int> <int> <dbl> <int>
## 1 Tunisia 5 8 1.6 13
## 2 Morocco 11 12 1.09 23
## 3 Pakistan 28 27 0.964 55
## 4 Viet Nam 21 20 0.952 41
## 5 Indonesia 36 27 0.75 63
## 6 Egypt 18 13 0.722 31
## 7 Nigeria 65 42 0.646 107
## 8 Philippines 10 6 0.6 16
## 9 Taiwan 68 40 0.588 108
## 10 Kenya 23 13 0.565 36
x <- c("Country", "Total", "AutoML Yes-No Ratio")
y <- sprintf("{point.%s}", c("Q3", "total", "t2f_ratio"))
tltip <- tooltip_table(x, y)
survey_aml_country %>%
hchart("scatter", hcaes(x = total, y = t2f_ratio,
size = t2f_ratio,
color = total)) %>%
hc_xAxis(type = "logarithmic") %>%
hc_tooltip(useHTML = TRUE, headerFormat = "", pointFormat = tltip)
This again doesn’t provide much significant variability between different cohorts but what stands apart is the fact that organizations where they predominantly generate insights and also have got models in PROD for 2+ years the adoption of AutoML seems high.
The roles where they’re on working on improving SOTA models and also managing Data Infrastructure is where AutoML is used more.
As it’s obvious, The places where the money spent on cloud is lesser is also where the AutoML adoption is lesser.
Kagglers / Respondents who seem to be keeping themselves updated from Blogs such as Towards Data science and Traditional Publications are the least favored ones to use AutoML
HackerNews-following data scientists and Podcast listeners form the most in AutoML Niche Community.
Whatsoever reason it might be, 39% Data scientists who learn from Dataquest
have embraced AutoML. Guess, this must be news for Marketing Departments of these AutoML Providers that they can advertise on these platforms.
Data Scientists who learn from Coursera are the ones still seem to be least welcoming with AutoML, Quite old-school perhaps. Speaking of which, Data scientists who learn from traditional schools are also least receptive.
This is the most important part of this piece I’d say. This is also the place where ML Engineers love to live forever. The algorithms.
The below plot is quite evident in speaking where AutoML users predominatly come from and their interest areas. It’s Model Architectures, Hyperparameter tuning, ML Pipelines - all automated.
Almost ~44% of data scientists who use GANs (heavily) are the ones who’ve used AutoML. The same goes with other algorithms that require heavy lifting.
This plot is a nice validation to see the idea of the company - Google that pushes AutoML and also offers TPUs.
Almost 78% of those who never used TPUs also never used AutoMLs. Once again resonating the fact that AutoML being preferred by those who normally do heavylifting of algorithms than simply running Linear Models.
48% of Those who use Spark MLLib are part of the AutoML Niche while 72% of those who use Scikit-Learn don’t use AutoML. While it could have a lot of overlaps between different libraries.
This below plot carves out a nice picture of modern deep learning framework users form the better part of AutoML niche than conventional ML framework users.
Percentages might not give the real picture in this case. So, let’s see through the numbers. Cloud platforms like AWS, GCP, Azure are market leaders. But with this it seems AWS users are the ones not very keen on AutoML while other Cloud Platform users seem to adopt AutoML Tools and solutions.
# Hosted Notebooks
Approx 33% of the biggest Hosted Notebook provider (Kaggle & Google Colab) users tend to be part of AutoML niche
While other providers may not be as big as the top 2, it’s a clear sign that those who are ready to use hosted network at least more than a quarter of them have already adopted AutoML
What comes as surprise is that % of R users who use AutoML is 4 percent pt. more than Python numbers. While it could be attributed to the very large base number of Python uses in this survey, it’d be an interesting exercise to investigate if it’s actually the natural tendency of R users or the limitations of R itself make R programmers embrace AutoML than their Python counterparts
In fact it also seems to be the case of programmers who code in Java, Matlab, C++ and C
Perhaps, the verastility of Python isn’t letting Python developers to explore less-geeky territories? Needs a Randomized experiment to answer that ;)
All these below bars indicate an important part which is ecosystem - like, If you’re part of GCP already, you’re more likely to embrace their AutoML tool because you’ve already entered the system.
As a result of this analysis, we could manage to conclude a few personas that prefer AutoML
Young Audience who’s willing to try out new things including new technology that can challenge their egos
ML Engineers who work on powerful heavy lifting tasks
Cloud-Friendly Data scientists who have already emrbaced some cloud provider especially GCP/Azure
Data scientists who leverage Hosted Notebooks (kinda Cloud)
Data Scientits whose role involve having Model in Production and also been doing it for quite some time
ML Engineers working on improving SOTA models
Usage of Clustering Techniques for Persona Creation
Creating this as a supervised problem to identify indicators that’s driving positive tendency towards AutoML
Create an AutoML Niche survey in the developing countries as identified above (where we’ve high Automl-to-No_AutoML ratio) to take those learnings and forward it elswhere
Create a randomized experiment with the two set of audience we created to establish causality and other driving behaviors
Thanks to @Kailex’s Notebook (https://www.kaggle.com/kailex/education-languages-and-salary) that helped me get started quickly. Few base-codes used here are loosely borrowed from there, Thanks!