1 About the Project

We are members of the MBA Class of 2018 from the University of Illinois Geis College of Business. This is our capstone project to complete our concentration in Business Data Analytics. We were excited about this project not only because we wanted to learn more about an exciting topic, crowdfunding, but also knew it would be the perfect blend of business strategy and data analytics. We set out to answer two primary questions:

Can we use machine learning to predict which Kickstarter projects will be funded?

Can we discover any interesting insights that could help future creators bring their projects to life?

1.1 Tech Stack

While most academic projects are pre-fabricated with existing data and clear direction, this was not. We did this one from scratch! Here is our end to end process tech stack:

Collaborative programming: Github, RStudio, Atom & Z shell
Web Scraping: Python, Beautiful Soup
HTML Parsing: Python, RegEx
Data Base Management: SQLite, SQL
Exploratory Analysis: R
Visualizations: ggplot2
Machine Learning: R
Final Report: R Markdown

1.2 Data Collection and Cleaning

We collected 30GB of raw data over several months from Kickstarter using our custom python web scraping and parsing program. These data included 50,596 projects and 120 variables pertaining to the projects and their creators from the launch of Kickstarter in 2009 through December 2013.

Of these projects, possible outcome states included failed, successful, suspended, canceled, and purged. We excluded 1,246 projects with suspended, canceled, or purged outcome states. We excluded four additional projects whose funding state is inaccurate in terms of the amount_pledged (e.g., state is listed as failed when the amount_pledged exceeded the goal). The final dataset for analysis contained 49,350 projects.

Of the original 120 variables, 41 contained meaningful information for analyses. Selecting from, transforming, and performing additional computations resulted in 29 variables used in subsequent exploratory analysis and machine learning. These variables are described in the Data Dictionary below.

1.3 Crowdfunding

The phenomenon of crowdfunding, an alternative financing approach, involves raising funds for new business ventures via small amounts of capital from a large number of individuals. Crowdfunding is a relatively new phenomenon enabled by wide access to social media and internet-based financial technology services (Fintech)., It makes obtaining funding more accessible for entrepreneurs and small businesses, as compared to traditional banking and lending services.

Little academic research has been conducted on crowdfunding, and there are many interesting areas for investigation. From a financial perspective, it is disrupting the small- and medium- enterprise (SME) lending market. Economically, it may be changing the prevalence and makeup of SMEs. In terms of marketing, it gives consumers a greater say in the products they would like to see available but also exposes them to increased risk. Regarding information and technology, it is enabling innovation through a public platform.

2 Key Findings

2.1 Machine Learning Predictions

We set out on this project to answer one question, “Can we use machine learning to predict which Kickstarter projects will get funded?” As it turns out, we can!

Using relatively simple machine learning techniques and feature engineering, we raised our baseline accuracy from 56% to 70% (a 25% improvement)!

First, it is essential to establish a baseline metric for success. Given that in our collected data projects were funded at a rate of 56%, a model in the absolute simplest version would predict that every project was funded and be correct 56% of the time. For our predictions to have any value, we must beat this benchmark. During our exploratory analysis, we discovered a variable with exciting implications, staff pick. As machine learning applications become more widely used and increase in efficacy, often the benchmark is, “Can it do better than a human?” staff_pick serves as a proxy for the best human judgment has to offer with 84% receiving funding. Kickstarter - Projects We Love

## # A tibble: 2 x 3
##   staff_pick count avg_funded
##        <dbl> <int>      <dbl>
## 1          0 44081      0.524
## 2          1  5269      0.842

Kickstarter staff clearly has an eye for promising projects. However, predicting project success anywhere near this rate is likely impossible. Projects fortunate enough to be a staff_pick get prominently featured on the website, newsletters, and blogs. We do not doubt that such promotion materially effects the chance a project is funded.

Scoping our target prediction accuracy range to 56-84%, we began construction models. We built a LASSO Regression and Decision Tree models. Incorporating variables about the campaign such as goal, rewards, category, and usa along with information from the creator’s profile such as social_media_count both models achieved an accuracy rate of ~70%–right in the middle of our target range.

2.2 `percent_funded`

This histogram of percent_funded was one of the most interesting we saw during the project. Notice that almost every project that achieves ~75% of their goal made it to, or well past 100% (we found an outlier at 4 million percent). Virtually no projects fall just short of their goal!

This generates significant insights into the workings of Kickstarter’s business model. Each project has two key financial stakeholders, the project creator and Kickstarter itself (who collects a 5% fee on funded projects), each willing to pull whatever levers it can to avoid the worst case scenario–a 99% funded project. You can imagine, an almost funded creator will call all his family and friends–or even open their own wallet–to get a project across the finish line. Kickstarter will unleash all its marketing power through its almost funded page, newsletters, and emails to make the goal.

This insight should have a direct impact on a creator’s goal-setting strategy. While we have seen that higher goals have lower funding rates, we actually encourage creators to set aggressive goals for reasonable absolute sums. For example, if a creator believes their project needs $1,000 in funding and can most likely procure the necessary backers, he should strongly consider raising the goal to $1,200-$1,300. In so doing, he would allow the regular backer support to raise the first $1,000 and then let his business partner (Kickstarter) aggressively market the project for him to raise the addition $200+, thereby covering the 5% fee and more.

Initially, we thought percent funded would be a good continuous outcome variable for predictive modeling. It is more precise than binary classification, which required selecting a somewhat arbitrary decision threshold. However, plotting the histogram shows that the data has a non-normal distribution. Because of this pattern, percent_funded was unsuitable for predictive modeling. This distribution also prevented conducting any natural experiments, such as regression discontinuity, to determine differences between barely-failed and barely-successful projects.

2.3 In-progress Causality

During our exploratory analysis, we discovered many interesting variables. However, it quickly became apparent that our project as a complicated confounding variable, the point in time the project was observed. Our data is collected from the project as it exists now–completed.

The significance of this is that as we build our models to predict the outcome of a project at its launch, we have access to information from the “future,” beyond just the outcome variable of funded. Examples include comments_count, updates count, and whether a project gets featured on the almost funded page. We will use comments_count to dive a bit deeper.

We can see an obvious correlation, as comments_count increases so does its chance of funding. Unsurprisingly (this is data analytics after all) we find ourselves wrestling with questions of causation. “Do comments cause a project to get funded?” “Or, do great projects that are destined for funding prone to receiving more comments?”

Answering this question would require tracking projects over time while they are active and performing controlled experiments. For example, we could build two virtually identical projects and begin seeding the comments of only one, allowing us to identify the causal impact of seeding a project’s comment section.

Luckily, we don’t have to validate every hypothesis with statistical rigor to begin making business decisions. Our intuition tells us that the impact of comments_count is surely a mixture of causality. We can, therefore, recommend to creators that in addition to designing a compelling project, find a backer or two who are passionate about your project and encourage them to start a discussion on the project page. It might be just what you need to get tip the scales in your favor.

3 Exploratory Analysis

3.1 Ex Post Facto

Ex Post Facto variables that are generated after the start of the project. These are interesting to examine and can provide valuable insight into Kickstarter, however, they are not appropriate to use in our predictive models as they are pseudo outcome variables. Some, such as comments, can provide direction to a project creator on what to do mid-project to increase their cahnce of funding.

3.1.1 `backers_count`

backers_count is a powerful predictor of project funding. We can see that the distribution resembles a logistic function. We found it surprising that even at the top 5% of backers_count there are still projects that are not funded. We hypothesize that these projects have an extremely large goal.

3.1.2 `comments_count`

The vast majority of projects received fewer than 20 comments. Chance of Funding increases substantially as comments increases. The most notable feature is receiving as few as two comments can increase Chance of Funding by 30%+. We hypothesize that a good project has a causal relationship with more comments. The correlation is enough to advise any creator to make a concerted effort to start a conversation in the comments section of their project.

3.1.3 `updates_count`

Another solid indicator of funding updates_count. Just as with the other ex post facto variables, the causality is likely reversed as creators are probably more willing to update a project that is getting traction. Given the continued improvements throughout the deciles, it is surely worth regularly providing updates for your project to finish off the funding, or possibly move well past the 100% funded mark.

3.1.4 `spotlight`

We were surprised to find a variable with 100% predictive power occurring in over 20,000 projects. We dug deeper and found that spotlight denotes projects to be featured on Kickstarter’s recently funded page! Kickstarter Spotlight

## # A tibble: 2 x 3
##   spotlight `n()` `mean(funded)`
##       <dbl> <int>          <dbl>
## 1         0 21831              0
## 2         1 27519              1

3.2 Day Zero

Day Zero variables are any which can be observed and/or controlled at the start of the project. These are the most important for our predictive models as they allow us to predict a project’s funding before any Kickstarter activity.

3.2.1 `goal`

One of the most obvious, and ultimately significant predicters is goal. It shows a clear downward trend in funding success as the amount increases. While this is intuitive, it is worth noting that it does not appear linear. For this reason, we used the quantile function to account for this distribution.

3.2.2 `category`

Some categories never fail in this dataset (only considered if n > 50):

design/product design (1098 projects)
film & video/documentary (2202 projects)
film & video/shorts (3513 projects)
games/tabletop games (1064 projects)

Most successful parent categories (only considered if n > 100):

7 1,725 projects 81.2% successful
11 12,087 projects 65.2% successful
14 14,635 projects 59.2% successful

Least successful parent categories (only considered if n > 100):

18 8,725 projects 39.7% successful
16 1,638 projects 46.2% successful
12 4,125 projects 46.9% successful

## # A tibble: 15 x 10
##    category    count success_rate avg_goal med_perc_funded avg_perc_funded
##    <fct>       <int>        <dbl>    <dbl>           <dbl>           <dbl>
##  1 music       14635         59.2    7589.           102.             117.
##  2 film&video  12087         65.2   22409.           102.             301.
##  3 publishing   8725         39.7    7951.            19.8            336.
##  4 art          6122         51.2    9139.           100              260.
##  5 games        4125         46.9   38009.            41.2           1214.
##  6 design       1725         81.2   12981.           130              487.
##  7 technology   1638         46.2   68119.            47.0            195.
##  8 crafts         61         96.7    3235.           111.             146.
##  9 comics         57        100     11760.           176.             364.
## 10 theater        57        100      5994.           113.             127.
## 11 food           45        100     19737.           116.             147.
## 12 fashion        36        100     18178.           136.             580.
## 13 journalism     15        100     23405            111.             139.
## 14 dance          11        100      3165.           120.             120.
## 15 photography    11        100      9061.           166.             218.
## # ... with 4 more variables: avg_backer_count <dbl>,
## #   med_backer_count <dbl>, avg_contribution <dbl>, med_contribution <dbl>

The distribution of projects by category shows Kickstarter has an intense focus on creative projects. We hypothesize that the minimal appearance of some categories suggests that Kickstarter’s classification system tends to favor large, general grouping. It may also be arbitrary in some instances as many dance and photography projects could readily be placed in art.

*** > The chance of funding does not follow a similar pattern to the category frequency distribution. In fact, the sparsely populated categories have near perfect funding rates. Further investigation into these anomalies, such as exploring correlation to variables such as spotlight may reveal a selection bias for obscure classification.

3.2.3 ‘launched_at’

The number of projects increased exponentially 2009 - 2012 and seemed to be increasing more gradually after 2012. We only collected data through December of 2013 and anticipate continued growth for subsequent years.

We explored fundy by mo_launched to see if seasonality impacts Kickstarter. The most dramatic dips occur in May and December. This is consistent with our understanding of financial markets in general… they slow down early summer and have much lower volume around the holiday season.

3.2.4 `country`

The majority of projects are based in the United States. Domestic projects have a success rate about 7% higher than international projects. We believe two factors drive this difference. First, Kickstarter is a U.S. based company and will, therefore, better meet the needs of its customers. Secondly, crowdfunding requires a critical mass of people to support a project ecosystem. As backer are most likely to fund projects in their country, any new regional expansions will have lower success rates while the critical mass develops.

## # A tibble: 5 x 3
##   country count funded_rate
##   <fct>   <int>       <dbl>
## 1 US      47007       0.561
## 2 GB       1986       0.492
## 3 CA        278       0.464
## 4 AU         54       0.556
## 5 NZ         25       0.52

## 
## Call:
## lm(formula = funded ~ usa, data = df_country)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5610 -0.5610  0.4390  0.4390  0.5096 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49040    0.01026  47.814  < 2e-16 ***
## usa          0.07058    0.01051   6.717 1.88e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 49348 degrees of freedom
## Multiple R-squared:  0.0009133,  Adjusted R-squared:  0.0008931 
## F-statistic: 45.11 on 1 and 49348 DF,  p-value: 1.88e-11

3.2.5 `photo_key`

There are very few projects that do not at least have a photo. Consequently, a t-test shows no significant difference between having and not having a photo. This is probably not because photos don’t matter, but rather because the sample with no photo is too small to have statistical power.

## # A tibble: 2 x 3
##   photo_key `n()` `mean(funded)`
##       <dbl> <int>          <dbl>
## 1         0    25          0.68 
## 2         1 49325          0.558

## 
##  Welch Two Sample t-test
## 
## data:  funded by photo_key
## t = 1.2854, df = 24.026, p-value = 0.2109
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07413234  0.31899802
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6800000       0.5575672

3.2.6 `video_status`

We hypothesized that video_status would be a powerful predictor as it is a proxy for whether or not the project has a video. We can see that the t-test ’video_status` to statistically significantly impact the success of the project.

df_engr %>%
  group_by(video_status) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(video_status)

## # A tibble: 2 x 3
##   video_status `n()` `mean(funded)`
##          <dbl> <int>          <dbl>
## 1            0  8882          0.408
## 2            1 40468          0.591

t.test(funded ~ video_status, data = df_engr)

## 
##  Welch Two Sample t-test
## 
## data:  funded by video_status
## t = -31.728, df = 13074, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1940137 -0.1714361
## sample estimates:
## mean in group 0 mean in group 1 
##        0.407791        0.590516

3.2.7 Social media connectedness

Social media shows an impact. Facebook seems to be the strongest and Youtube has a negative coefecient. Our hypothesis is that Facebook and Twitter may be used for promotion, while creators focusing on YouTube may over rely on their product content. Yet the most successfull creators have all three, which supports that YouTube is effective when paired with a comprehensive social media campaign.

#facebook
df_engr %>%
  group_by(facebook) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(facebook)

## # A tibble: 2 x 3
##   facebook `n()` `mean(funded)`
##      <dbl> <int>          <dbl>
## 1        0 36654          0.541
## 2        1 12696          0.605

#twiter
df_engr %>%
  group_by(twitter) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(twitter)

## # A tibble: 2 x 3
##   twitter `n()` `mean(funded)`
##     <dbl> <int>          <dbl>
## 1       0 46741          0.554
## 2       1  2609          0.617

#youtube
df_engr %>%
  group_by(youtube) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(youtube)

## # A tibble: 2 x 3
##   youtube `n()` `mean(funded)`
##     <dbl> <int>          <dbl>
## 1       0 45570          0.560
## 2       1  3780          0.528

#social_media_count
df_engr %>%
  group_by(social_media_count) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(social_media_count)

## # A tibble: 4 x 3
##   social_media_count `n()` `mean(funded)`
##   <fct>              <int>          <dbl>
## 1 0                  34484          0.543
## 2 1                  11226          0.594
## 3 2                   3061          0.581
## 4 3                    579          0.615

df_engr %>% 
  ggplot(aes(x = social_media_count, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
  labs(title = "Funding By Social Media Count",
       x="Social Media Count",
       y="Chance of Funding")

3.2.8 `campaign_duration`

Interestingly, campaign duration has an inverse relationship to the likelihood of receiving funding; longer campaign are associated with higher failure rates.

3.2.9 `description_length`

We believe that a description can have a significant impact on a project’s funding. In this projects next steps, we hope to extract more meaningful variables from the text analysis we have conducted. However, even in the most basic form, description_length shows a clear trend. It improves the chance of funding and then levels off. This implies that putting in the effort to make a detailed description is worthwhile. However, excessive wordiness and/or novel style descriptions quickly hit diminishing returns.

3.2.10 `rewards`

We suspected that how a creator structures their reward scheme would have a significant impact on the project also. Due to time constraints and the complex nature of its nested data structure, for this iteration, we explored the schemes length. We see a steep slope that flattens with a final jump at the end. This tells the story. The first bucket consists primarily of projects without rewards and that backers are not impressed. By the fifth bucket, we see diminishing returns, likely due to unnecessary detail and complexity. In the last bin, it looks like some creators go the extra mile and their backers appreciate it.

df_engr %>% 
  ggplot(aes(x = reward_length_10, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Rewards",
    x="Length of Rewards", 
    y="Chance of Funding")

Not Fully Explored More granular location variables would require more cleaning and may produce regional insights.

location_name
location_state
location_type
fx_Rate
profile_blurb
profile_state

Rejected We looked at this, yet did not find them to be predictive

project_id
disable_communicaiton

##$project_id
#A random identifier, cannot easily observe a pattern
range(db_cleaned$project_id)

## [1]      21109 2147466649

4 Text Analysis

full_description contains the complete project description from kickstarter. Unstructured data like this require more cleaning and transformation to be useful, but have the potential to be a source of rich information. Our application of text analysis had three primary motives:

Examine word frequency with word counts
Visualize word frequency with wordclouds
Contruct topic models
Binary calssification to predict project funding status

4.1 Word Frequency

We began by transforming the strings of text in full_description into a data frame with one word per row. Then we removed English stop words, common words that carry little semantic meaning and are thus immaterial to analyses (e.g., “and”, “the”, “of”). Finally, we determined word counts for the entire dataset.

***

Next, we examined the correlation between word proportions of successful and failed project descriptions. Word proportion represents the percentage of time that a given word is used out of the total number of words in the document. In this case, the documents are the collection of all successful project descriptions and all failed project descriptions. We observed, both visually and in terms of Pearson’s correlation coefficient, that the terms used in successful and failed project descriptions were overwhelmingly similar.

## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Failed Projects
## t = 1167.9, df = 111150, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.961139 0.962025
## sample estimates:
##       cor 
## 0.9615845

4.2 Wordclouds

Another way to visualize word frequency is by constructing wordclouds, which scale the size of text of a word to match its frequency in the document relative to other words’ frequencies. We constructed a wordcloud for the descriptions from the entire dataset. Unsurprisingly, “project”, “kickstarter”, and “goal” were among the most frequent terms used.

Wordclouds can be a useful way to observe differences in word variety and frequency between different groups of documents. Although they cannot be used in subsequent modeling, they are a tool for understanding unstructured text data and formulating hypotheses. Therefore, we grouped our dataset into documents:

by year to identify trends over time, and
by funded to identify differences between successful and failed projects

Prior to generating the wordclouds, we also created a custom set of stop words to weed out common terms in our dataset that could mask points of distinction between documents.

In the wordclouds by year, we see that music was initially the most prevalent in 2009, but film began to emerge as the predominant category 2010 - 2011. In 2012 - 2013, games appear as the biggest category. These wordclouds also give us a hint regarding the variety of projects. From 2009 - 2011, the wordclouds become larger and word frequency is less concentrated around the same terms. Abruptly in 2012, the projects seems to become less disparate, but in 2013 variety increases again. This suggests that the degree of project variety on kickstarter may be cyclical; perhaps artists and entrepreneurs in the same field turn to kickstarter after hearing about colleagues’ successes. However, more years of data are needed to verify the hypothesis of three-year periodicity.

## [1] 2009

## [1] 2010

## [1] 2011

## [1] 2012

## [1] 2013

***

In the wordclouds by funding status, we observed a high degree of similarity in both terms and frequency between successful and failed projects. Books seemed more likely to fail due to the higher prevalence of “book” in the failed wordcloud. There also seemed to be more variety in the successful projects wordcloud, perhaps indicating richer project descriptions. But generally, high world frequency may not be the best delineator of successful versus failed projects.

## [1] "failed"

## [1] "successful"

4.3 Inverse Document Freuqency

Sometimes the best way to determine points of difference between two similar documents are the terms which are unique between the two documents, rather than the most frequent terms. For example, two books written by the same author would likely generate similar wordclouds, yet the unique characters and places in the books would enable us to detect which book is which.

To see if this might be the case in our collection of successful and failed projects, we examined the term frequency-inverse document frequency (tf-idf). tf looks for terms that are common; idf decreases the weight placed on commonly used terms in the collection and increases the weight placed on words that are not commonly used in the collection (i.e., common in a few documents). To remove nonsensical words from the analysis, we only considered words with a frequency of greater than 500, which is a reasonably low cutoff in a dataset with 700,000+ unique terms.

The results of this analysis suggest that board games and film are likely to be successful (dice, unlocked, filmmaker(s), expansion, boards, filmmaking, premiere). However, although the games category overall had a high success rate, it appears that games involving war and violence were less likely to receive funding (weapon, battles, security, agent), as were online games (multiplayer, server, playable, modes, animations).

4.4 Topic Modeling

The analyses in the previous section have focused on the “bag-of-words” approach and word frequency as a method for natural language processing, the means by which computers make sense of human language. Although this is a common and useful approach, there are other useful ways to describe text data.

One such method is topic modeling. Topic models assume that word or groups of words (called n-grams) which appear frequently together in a dataset are explained by underlying, unobserved groups (called topics). By examining word or n-gram overlap in the documents comprising a dataset, these topics can be detected. Although the computer cannot provide a semantic label for the topics, a human who is familiar with the dataset could examine the top words and determine a theme.

4.5 Latent Dirichlet allocation

We chose Latent Dirichlet allocation (LDA) as our statistical model for topic detection. LDA examines text by word frequency and co-occurence in documents, which are individual project descriptions in our case. LDA assumes that each document covers a small number of topics and a small set of words it uses frequently, and so it is good at assigning documents to topics.

To feed data into the model, we first processed the text to transform it to lowercase, remove punctuation, and remove stop words. In this section, we also performed word stemming, which groups words together that have the same root but different suffixes. This process helps ensure that words with the same semantic meaning, but different verb conjugations and the like, are assessed as the same word. As a result, our results show some incomplete word stems.

After processing the text, we used it to generate documents, a vocabulary of terms in the dataset, metadata to construct the model. Consistent with our tf-idf analysis above, we only considered terms that appeared in at least 500 documents. We ran iterations of the LDA model specifying both 20 and 40 topics. The model did not reach convergence over 10 or 20 iterations; however, meaningful topics emerged with 20 iterations over 40 topics.

***

Visualizing the results of our topic model, we see some meaningful topics emerge, some centered on the mechanisms of the platform, and others identifying product categories or subcategories. For example, Topic 18 could be labeled Funding Requests and includes terms like “goal”, “donate”, “money”, “raise”, and “reach”.

## Topic 18 Top Words:
##       Highest Prob: goal, donat, will, kickstart, money, rais, reach 
##       FREX: donat, goal, reach, amount, rais, money, dollar 
##       Lift: --noth, deadlin, incent, exceed, donat, fundrais, reach 
##       Score: goal, donat, pledg, kickstart, money, reach, rais

***

On the other hand, Topic 37 seems to describe a certain subcategory of Design and could be labeled Graphic Design with terms like “design”, “print”, “edit”, “poster”, “shirt”, and “sticker”.

## Topic 37 Top Words:
##       Highest Prob: print, edit, will, design, sign, limit, poster 
##       FREX: print, poster, sign, edit, limit, paper, sticker 
##       Lift: ink, poster, sticker, print, paper, shirt, sign 
##       Score: print, poster, edit, sign, design, color, paper

***

The theme of the projects is clear from some topics, although the type of project is not easily distinguished. For example, Topic 6 is about Education, but could span many types of projects.

## Topic 6 Top Words:
##       Highest Prob: school, learn, student, children, kid, educ, program 
##       FREX: student, school, children, educ, teach, kid, teacher 
##       Lift: teacher, classroom, teach, student, educ, children, school 
##       Score: school, student, children, educ, kid, learn, teach

***

We also visualized the correlations between the 40 topics. The green nodes indicate topics, and the dashed lines represent relatedness between topics. The length of the dashed lines indicate the degree of overlap between two topics. Our topic models are highly related to one another, both in terms of the number of connections and the distance of connections.

***

In natural language processing, data often arrive with little metadata to categorize the text. Although we have project category in our dataset, we have no mechanism, aside from text mining, to determine topic categorization, which may be highly related to success or failure. Therefore, the results of the LDA model could be useful for classification of successful and unsuccessful projects.

5 Machine Learning Models

While machine learning has achieved an intimidating buzzword status, it is often the easiest and least time-consuming part of the project to implement. This project was no exception. Based on the collected projects, we observed that 55.9% of projects were funded. As such, we could achieve that accuracy level by predicting every project is funded. This is, therefore, the baseline for which our model must surpass to begin providing value. We decided to use two models that we enjoy working with, LASSO and Decision Tree. There is room to expand the complexity of the project by considering a more robust assortment of models, however, extracting the last bit of accuracy was not our goal. Both models achieved an approximately identical accuracy rate in the test sample of 70%.

5.1 LASSO

LASSO is a type of linear regression model. Its unique feature is that it has a mechanism to avoid overfitting by adding a penalty term for each variable it uses. Based on prior experience with LASSO, we expected to see it decide not to use many of the features we gave it. We were surprised to see coefficients for every variable! This phenomenon is likely driven by both the size of our dataset (which increases the statistical significance of any factor) and our intuitive understanding of engineering variables with predictive power. During our exploratory analysis, we observed many of the continuous variables did not appear to be linear. We, therefore, decided to use the quantile versions of goal, description_length, and reward_length. This allows LASSO to treat each group independently from the others, emulating a more complex function identification.

train_y <- train$funded
train_x <- model.matrix(funded ~ 
  campaign_duration +
  usa +
  social_media_count +
  photo_key +
  video_status +
  mo_launched +
  category +
  goal_20 +
  description_length_10 +
  reward_length_10, data = train)


test_y <- test$funded
test_x <- model.matrix(funded ~ 
  campaign_duration +
  usa +
  social_media_count +
  photo_key +
  video_status +
  mo_launched +
  category +
  goal_20 +
  description_length_10 +
  reward_length_10, data = test)

cvfit <- cv.glmnet(x=train_x, y=train_y, alpha = 1)
coef(cvfit, s = "lambda.min")

## 71 x 1 sparse Matrix of class "dgCMatrix"
##                                                     1
## (Intercept)                               0.664914095
## (Intercept)                               .          
## campaign_duration                        -0.001999695
## usa                                       0.130496091
## social_media_count1                       0.007694106
## social_media_count2                      -0.018798148
## social_media_count3                      -0.012401064
## photo_key                                -0.325786082
## video_status                              0.114512423
## mo_launched02                             0.012042709
## mo_launched03                             0.009070564
## mo_launched04                            -0.005003108
## mo_launched05                            -0.032815830
## mo_launched06                            -0.020027388
## mo_launched07                            -0.042405213
## mo_launched08                            -0.044413291
## mo_launched09                            -0.020568204
## mo_launched10                            -0.019607878
## mo_launched11                            -0.022795763
## mo_launched12                            -0.030831188
## categorycomics                            0.404898773
## categorycrafts                            0.305221540
## categorydance                             0.296892593
## categorydesign                            0.291811031
## categoryfashion                           0.469312227
## categoryfilm&video                        0.175696086
## categoryfood                              0.470958744
## categorygames                            -0.032333145
## categoryjournalism                        0.518167715
## categorymusic                             0.086738376
## categoryphotography                       0.448134568
## categorypublishing                       -0.092499720
## categorytechnology                        0.050521804
## categorytheater                           0.434534961
## goal_20(500,750]                         -0.057788659
## goal_20(750,1e+03]                       -0.117671705
## goal_20(1e+03,1.5e+03]                   -0.131514952
## goal_20(1.5e+03,1.8e+03]                 -0.129787711
## goal_20(1.8e+03,2e+03]                   -0.179015169
## goal_20(2e+03,2.5e+03]                   -0.201020201
## goal_20(2.5e+03,3e+03]                   -0.205210875
## goal_20(3e+03,3.5e+03]                   -0.222480198
## goal_20(3.5e+03,4.5e+03]                 -0.263936072
## goal_20(4.5e+03,5e+03]                   -0.325905098
## goal_20(5e+03,5.2e+03]                   -0.392920567
## goal_20(5.2e+03,7e+03]                   -0.343285549
## goal_20(7e+03,8e+03]                     -0.349492016
## goal_20(8e+03,1e+04]                     -0.414197836
## goal_20(1e+04,1.2e+04]                   -0.433992904
## goal_20(1.2e+04,1.6e+04]                 -0.466483730
## goal_20(1.6e+04,2.5e+04]                 -0.525832603
## goal_20(2.5e+04,5e+04]                   -0.614624699
## goal_20(5e+04,2.15e+07]                  -0.752777020
## description_length_10(754,1.11e+03]       0.047569711
## description_length_10(1.11e+03,1.44e+03]  0.094708324
## description_length_10(1.44e+03,1.81e+03]  0.115029118
## description_length_10(1.81e+03,2.22e+03]  0.138664965
## description_length_10(2.22e+03,2.74e+03]  0.166152398
## description_length_10(2.74e+03,3.46e+03]  0.170785197
## description_length_10(3.46e+03,4.58e+03]  0.175105438
## description_length_10(4.58e+03,6.64e+03]  0.228388846
## description_length_10(6.64e+03,1.4e+05]   0.287361125
## reward_length_10(2.62e+03,3.71e+03]       0.063860414
## reward_length_10(3.71e+03,4.61e+03]       0.111644718
## reward_length_10(4.61e+03,5.49e+03]       0.139014324
## reward_length_10(5.49e+03,6.43e+03]       0.181298204
## reward_length_10(6.43e+03,7.52e+03]       0.198659617
## reward_length_10(7.52e+03,8.86e+03]       0.211436713
## reward_length_10(8.86e+03,1.08e+04]       0.259518926
## reward_length_10(1.08e+04,1.45e+04]       0.271572314
## reward_length_10(1.45e+04,1.37e+05]       0.351476224

Accuracty of LASSO in Test Set

## [1] 0.7014184

5.2 Decision Tree

Decision trees are a classification model that finds breakpoints in the data that classify every (remaining) observation in a binary fashion. One of the major advantages is that it does not assume variables to behave linearly. As such, you will notice we did place the continuous variables into quantiles for the tree as it will find the breakpoints itself.

set.seed(1)
tree <- rpart(funded ~ 
  campaign_duration +
  usa +
  social_media_count +
  photo_key +
  video_status +
  mo_launched +
  category +
  goal +
  description_length +
  reward_length, data = train)

5.2.1 Simple tree

In this simple tree, we used the default complexity parameter of 1%. This limits how many nodes the decision tree will grow. While this is not our most accurate model, its simplicity clearly illustrates how the tree works. We also found it impressive that using only three levels goal, reward, and category, it was able to predict funding with ~65% accuracy.

Accuracty of Simple Tree in Test Set

## [1] 0.6521783

5.2.2 Complex tree

For a more detailed tree, we reduced the complexity parameter to 0.01%. In so doing, we encourage the tree to make many more splits that show even slight predictive power. It is important to note, that if we were to lower the complexity parameter enough, the tree would find a way to perfectly predict each in sample observation; however, that would be a classic case of overfitting the model.

As you can see, that is one crazy tree. To avoid overfitting, we underwent a process called “pruning.” This process finds the complexity parameter that has the lowest cross-validated error. Although it still produces many branches, it is much more streamlined as compared to the prior iteration. This is the tree we will use to make our final predictions.

index <- which.min(tree$cptable[ , "xerror"])
tree_min <- tree$cptable[index, "CP"]

pruned_tree <- prune(tree, cp = tree_min)
prp(pruned_tree, extra = 1, box.palette = "auto")

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Accuracty of Complex Tree in Test Set

## [1] 0.7011145

6 Data Dictionary

Name	Description	Type	Values
funded	Amount pledged compared to goal by deadline	factor	0: failed; 1: successful
comments_count	Number of comments users post during campaign	integer	0 - 393041
goal	Goal set at beginning of campaign, in local currency	numeric	0.01 - 21474836
updates_count	Number of times page was updated during campaign	integer	0 - 301
backers_count	Number of backers that contributed to the project	integer	0 - 87142
full_description	Text description of the project	character	N/A
campaign_duration	Days between launch and deadline	numeric	1.5 - 91.96
avg_contribution	Mean amount pledged per backer, in local currency	numeric	1 - 9606
percent_funded	Percent of goal received (%)	numeric	0 - 4153501
spotlight	If successful, indicates if project is featured	factor	0: no spotlight; 1: spotlight
staff_pick	Staff selected to receive ‘Projects We Love’ badge	factor	0: no badge; 1: ‘Projects We Love’ badge
usa	Indicates location in the US or in another country	factor	0: other countries; 1: USA
social_media	Indicates if creator provided any links to social media	factor	0: no links to social media; 1: one or more links
facebook	Indicates if creator linked to Facebook	factor	0: no Facebook link; 1: Facebook link provided
twitter	Indicates if creator linked to Twitter	factor	0: no Twitter link; 1: Twitter link provided
youtube	Indicates if creator linked to YouTube	factor	0: no YouTube link; 1: YouTube link provided
social_media_count	Number of social media links provided by creator	integer	c(“0”, “1”, “2”, “3”)
photo_key	Indicates if the project page had a photo	factor	0: no photo; 1: has photo
video_status	Indicates if the project page had a video	factor	0: no video; 1: has video
reward_length	Number of characters in reward structure description	integer	76 - 136827
description_length	Number of characters in full project description	integer	0 - 140229
date_launched	Date of project launch (yyyy-mm-dd)	Date	2009-04-24 to 2013-12-18
mo_yr_launched	Month and year of project launch (mm-yyyy)	Date	01-2010 to 12-2013
yr_launched	Year of project launch (yyyy)	Date	2009 - 2013
mo_launched	Month of project launch (mm)	Date	01 - 12
goal_20	Ventile assigned to goal	factor	c(“[0.01,500]”, “(500,750]”, “(750,1e+03]”, “(1e+03,1.5e+03]”, “(1.5e+03,1.8e+03]”, “(1.8e+03,2e+03]”, “(2e+03,2.5e+03]”, “(2.5e+03,3e+03]”, “(3e+03,3.5e+03]”, “(3.5e+03,4.5e+03]”, “(4.5e+03,5e+03]”, “(5e+03,5.2e+03]”, “(5.2e+03,7e+03]”, “(7e+03,8e+03]”, “(8e+03,1e+04]”, “(1e+04,1.2e+04]”, “(1.2e+04,1.6e+04]”, “(1.6e+04,2.5e+04]”, “(2.5e+04,5e+04]”, “(5e+04,2.15e+07]”)
description_length_10	Decile assigned to full description length	factor	c(“[0,754]”, “(754,1.11e+03]”, “(1.11e+03,1.44e+03]”, “(1.44e+03,1.81e+03]”, “(1.81e+03,2.22e+03]”, “(2.22e+03,2.74e+03]”, “(2.74e+03,3.46e+03]”, “(3.46e+03,4.58e+03]”, “(4.58e+03,6.64e+03]”, “(6.64e+03,1.4e+05]”)
reward_length_10	Decile assigned to reward description length	factor	c(“[76,2.62e+03]”, “(2.62e+03,3.71e+03]”, “(3.71e+03,4.61e+03]”, “(4.61e+03,5.49e+03]”, “(5.49e+03,6.43e+03]”, “(6.43e+03,7.52e+03]”, “(7.52e+03,8.86e+03]”, “(8.86e+03,1.08e+04]”, “(1.08e+04,1.45e+04]”, “(1.45e+04,1.37e+05]”)
category	One of 15 buckets categorizing project field	factor	art, comics, dance, design, fashion, food, film&video, games, journalism, music, photography, technology, theater, publishing, crafts

6.1 Acknowledgements

The following resources were invaluable to the completion of the project:

Text Mining with R: A Tidy Approach (Silge & Robinson, 2018; https://www.tidytextmining.com)
stm: R Package for Structural Topic Models (Roberts, Stewart, & Tingley; https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf)
Binary text classification with Tidytext and caret (Hvitfeldt, 2018; https://www.hvitfeldt.me/2018/03/binary-text-classification-with-tidytext-and-caret/)
naivebayes package documentation (ftp://cran.r-project.org/pub/R/web/packages/naivebayes/naivebayes.pdf)
Create Awesome HTML Table with knitr::kable and kableExtra (Zhu, 2018; https://haozhu233.github.io/kableExtra/awesome_table_in_html.html)

Machine Learning Kickstarter

Brant Faulkner & Courtney Allen

1 About the Project

1.1 Tech Stack

1.2 Data Collection and Cleaning

1.3 Crowdfunding

2 Key Findings

2.1 Machine Learning Predictions

2.2 `percent_funded`

2.3 In-progress Causality

3 Exploratory Analysis

3.1 Ex Post Facto

3.1.1 `backers_count`

3.1.2 `comments_count`

3.1.3 `updates_count`

3.1.4 `spotlight`

3.2 Day Zero

3.2.1 `goal`

3.2.2 `category`

3.2.3 ‘launched_at’

3.2.4 `country`

3.2.5 `photo_key`

3.2.6 `video_status`

3.2.8 `campaign_duration`

3.2.9 `description_length`

3.2.10 `rewards`

4 Text Analysis

4.1 Word Frequency

4.2 Wordclouds

4.3 Inverse Document Freuqency

4.4 Topic Modeling

4.5 Latent Dirichlet allocation

5 Machine Learning Models

5.1 LASSO

5.2 Decision Tree

5.2.1 Simple tree

5.2.2 Complex tree

6 Data Dictionary

6.1 Acknowledgements

Machine Learning Kickstarter

Brant Faulkner & Courtney Allen

1 About the Project

1.1 Tech Stack

1.2 Data Collection and Cleaning

1.3 Crowdfunding

2 Key Findings

2.1 Machine Learning Predictions

2.2 percent_funded

2.3 In-progress Causality

3 Exploratory Analysis

3.1 Ex Post Facto

3.1.1 backers_count

3.1.2 comments_count

3.1.3 updates_count

3.1.4 spotlight

3.2 Day Zero

3.2.1 goal

3.2.2 category

3.2.3 ‘launched_at’

3.2.4 country

3.2.5 photo_key

3.2.6 video_status

3.2.7 Social media connectedness

3.2.8 campaign_duration

3.2.9 description_length

3.2.10 rewards

4 Text Analysis

4.1 Word Frequency

4.2 Wordclouds

4.3 Inverse Document Freuqency

4.4 Topic Modeling

4.5 Latent Dirichlet allocation

5 Machine Learning Models

5.1 LASSO

5.2 Decision Tree

5.2.1 Simple tree

5.2.2 Complex tree

6 Data Dictionary

6.1 Acknowledgements

2.2 `percent_funded`

3.1.1 `backers_count`

3.1.2 `comments_count`

3.1.3 `updates_count`

3.1.4 `spotlight`

3.2.1 `goal`

3.2.2 `category`

3.2.4 `country`

3.2.5 `photo_key`

3.2.6 `video_status`

3.2.8 `campaign_duration`

3.2.9 `description_length`

3.2.10 `rewards`