Warning: High_Density of Bar plots; And, Apologies for it, I couldn’t get time to improve the visualization / aesthetics

Start

# loosely based on @kailex's code snippet

make_multi_automl_plot <- function(start = 'Q9_Part_1', end = "Q9_Part_8") {
  
  survey_aml %>% 
    filter(!is.na(automl)) %>% 
    select(start:end,automl) %>% 
    pivot_longer(cols = start:end, values_to = "Edu") %>% 
    drop_na() %>% 
    select(-name) %>% 
    filter(Edu != "None") %>% 
    group_by(Edu, automl) %>% 
    summarize(freq = n()) %>% 
    ungroup() %>% 
    group_by(Edu) %>% 
    mutate(tot = sum(freq)) %>% 
    ungroup() %>% #View()
    ggplot(aes(reorder_within(automl, -freq, Edu), y=freq, fill = factor(Edu))) +                            
    geom_bar(stat = "identity") +
    scale_x_reordered() +
    facet_wrap(~ Edu, scales = "free") + 
     scale_fill_viridis(discrete = TRUE, alpha=0.7, option="D") +
    theme_minimal() +
    labs(x = "", y = "") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="none") +
    geom_label_repel(aes(label = paste0(freq, " - ", round(freq/tot*100), "%")),
                   colour = "gray20", 
                   size=2.5, vjust = -0.1,  
    nudge_y      = 0.25,
    direction    = "y",
    angle        = 90,
    segment.size = 0.2,
    fill = "white"
  ) 
}
  
make_single_automl_plot <- function(col = "Q9") {
  
  survey_aml %>% 
    filter(!is.na(automl)) %>% 
    select(col,automl) %>% 
   # pivot_longer(cols = start:end, values_to = "Edu") %>% 
    #drop_na() %>% 
    #select(-name) %>% 
    rename(Edu = col) %>% 
    filter(Edu != "None") %>% 
    group_by(Edu, automl) %>% 
    summarize(freq = n()) %>% 
    ungroup() %>% 
    group_by(Edu) %>% 
    mutate(tot = sum(freq)) %>% 
    ungroup() %>% #View()
    ggplot(aes(reorder_within(automl, -freq, Edu), y=freq, fill = factor(Edu))) +                            
    geom_bar(stat = "identity") +
    scale_x_reordered() +
    facet_wrap(~ Edu, scales = "free") + 
    scale_fill_viridis(discrete = TRUE, alpha=0.7, option="D") +
    theme_minimal() +
    labs(x = "", y = "") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="none") +
    geom_label_repel(aes(label = paste0(freq, " - ", round(freq/tot*100), "%"), group = factor(Edu)),
                   colour = "gray20", 
                   size=2.5, vjust = -0.1,  
    nudge_y      = 0.25,
    direction    = "y",
    angle        = 90,
    segment.size = 0.2,
    fill = "white"
  ) 
  
}

Objective

Companies like Google, H2O, DataRobot are making huge investments with AutoML as their front cover. In fact, they managed to do well on Gartner’s magic quadrant and secure venture funding due the same fact because it sounds revolutionary.

But the real question here is, How receptive the Data science community has been in adopting these AutoML tools? Are they really being used in real life or just being a marketing material? These questions are quite hard to answer at any level.

This notebook tries to leverage a few questions in the Kaggle 2019 Survey to understand the who and what part of AutoML.

Considering AutoML itself a very small niche, I’ve attempted to carve out the niche from this Huge Survey.

Implictions/Applications of this Kaggle Notebook

Instead of treating this as a mere analysis on AUTOML behavior, this notebook can be extended with multiple practical purposes by the AutoML Providers like Google/H2O/Datarobot

Let’s get started with this bumpy ride of Bar charts!

Identifying AutoML Niche

We’re trying to create a flag where if the respondent has answered any AutoML tool for Q33 (other than None) then they are part of AutoML Niche (True) and if not, they’re not (False)

AutoMl Niche Stats

Only 9% of the entire survey respondents are part of AutoML Niche. While 26% explicitly mentioned they don’t use any AutoML, 64% of respondents didn’t care to answer this.

## # A tibble: 3 x 3
##   automl_niche                 n percent
##   <chr>                    <int>   <dbl>
## 1 FALSE                     5175      26
## 2 TRUE                      1840       9
## 3 I didn't care to respond 12702      64

Thus going forward, we’ll use only the cases where they’re either part of (True) of AutoML Niche or Not (False) - leaving out the ones that didn’t respond any

What are these AutoML Tools actually?

If you have been watching this space, this plot actually reveals a very important aspect which is open-source - the leading solutions in this are primarily open source or like Google AutoML available as trial with a click of a button.

I think that’s a very important aspect for any enterpise to break into the ML developer ecosystem these days.

## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.

Gender - Nothing Significant

26% of male respondents are part of AutoML Niche while 25% of Female respondents are part of it. It doesn’t strike any significant difference in the type of gender and being AutoML enthusiast.

Age - The Youngest ones embrace

Country Index - The Emerging ones emerge togethe

Looking at the below table, it’s easier to say that there’s not a major Kaggle-dominating country in the top 10 countries where automl_adoption-to-no_automl ratio is high. I think, the point here is exactly the same. It’s not the established ML markets being open for the change but countries like Nigeria, Taiwan, Indonesia have been good at receiving this new wave of AutoML.

It’s also hidden in this data that 55% of Indian Respondents (who are also a strong Kaggle Dominating community) have been part of AutoML Niche

## # A tibble: 10 x 5
##    Q3          `FALSE` `TRUE` t2f_ratio total
##    <chr>         <int>  <int>     <dbl> <int>
##  1 Tunisia           5      8     1.6      13
##  2 Morocco          11     12     1.09     23
##  3 Pakistan         28     27     0.964    55
##  4 Viet Nam         21     20     0.952    41
##  5 Indonesia        36     27     0.75     63
##  6 Egypt            18     13     0.722    31
##  7 Nigeria          65     42     0.646   107
##  8 Philippines      10      6     0.6      16
##  9 Taiwan           68     40     0.588   108
## 10 Kenya            23     13     0.565    36

Incorporate machine learning methods into their business?

This again doesn’t provide much significant variability between different cohorts but what stands apart is the fact that organizations where they predominantly generate insights and also have got models in PROD for 2+ years the adoption of AutoML seems high.

Role at Work

The roles where they’re on working on improving SOTA models and also managing Data Infrastructure is where AutoML is used more.

Money for Cloud

As it’s obvious, The places where the money spent on cloud is lesser is also where the AutoML adoption is lesser.

Favorite media sources that report on data science topics

Learning Sources

ML Tools - Model Arch, ML Pipelines, Hyperparams

This is the most important part of this piece I’d say. This is also the place where ML Engineers love to live forever. The algorithms.

The below plot is quite evident in speaking where AutoML users predominatly come from and their interest areas. It’s Model Architectures, Hyperparameter tuning, ML Pipelines - all automated.

Algorithms? GANs, Evolutionary Approaches, Transformers

Almost ~44% of data scientists who use GANs (heavily) are the ones who’ve used AutoML. The same goes with other algorithms that require heavy lifting.

TPUs ~ AutoML Niche

This plot is a nice validation to see the idea of the company - Google that pushes AutoML and also offers TPUs.

Almost 78% of those who never used TPUs also never used AutoMLs. Once again resonating the fact that AutoML being preferred by those who normally do heavylifting of algorithms than simply running Linear Models.

Spark ~ AutoML

48% of Those who use Spark MLLib are part of the AutoML Niche while 72% of those who use Scikit-Learn don’t use AutoML. While it could have a lot of overlaps between different libraries.

This below plot carves out a nice picture of modern deep learning framework users form the better part of AutoML niche than conventional ML framework users.

Cloud ~ AutoML

Percentages might not give the real picture in this case. So, let’s see through the numbers. Cloud platforms like AWS, GCP, Azure are market leaders. But with this it seems AWS users are the ones not very keen on AutoML while other Cloud Platform users seem to adopt AutoML Tools and solutions.

# Hosted Notebooks

Language Preferences

What comes as surprise is that % of R users who use AutoML is 4 percent pt. more than Python numbers. While it could be attributed to the very large base number of Python uses in this survey, it’d be an interesting exercise to investigate if it’s actually the natural tendency of R users or the limitations of R itself make R programmers embrace AutoML than their Python counterparts

In fact it also seems to be the case of programmers who code in Java, Matlab, C++ and C

Perhaps, the verastility of Python isn’t letting Python developers to explore less-geeky territories? Needs a Randomized experiment to answer that ;)

Few more Bars!

All these below bars indicate an important part which is ecosystem - like, If you’re part of GCP already, you’re more likely to embrace their AutoML tool because you’ve already entered the system.

FIN

As a result of this analysis, we could manage to conclude a few personas that prefer AutoML

Way Forward from here - Future Developments

Credit

Thanks to @Kailex’s Notebook (https://www.kaggle.com/kailex/education-languages-and-salary) that helped me get started quickly. Few base-codes used here are loosely borrowed from there, Thanks!