Using mySQL to perform preliminary EDA and to query data
- sam33frodon
- Jan 20, 2021
- 10 min read
Updated: Jan 31, 2021
library(RMySQL)
sqlQuery <- function (query) {
# creating DB connection object with RMysql package
DB <- dbConnect(MySQL(),
user='TTTTT',
password = 'TTTTT',
dbname = 'kickstarter',
host = '127.0.0.1')
# send Query to obtain result set
rs <- dbSendQuery(DB, query)
# get elements from result sets and convert to dataframe
result <- fetch(rs, -1)
# close db connection
dbDisconnect(DB)
# return the dataframereturn(result)
}
cons <- dbListConnections(MySQL())
for(con in cons)
dbDisconnect(con)
First, we need to check how many tables in the database.
sqlQuery("
SHOW tables;
")
## Tables_in_kickstarter
## 1 campaign
## 2 category
## 3 country
## 4 currency
## 5 sub_category
There are fives tables, namely: campaign, category, country, currency, and sub_category Check the content of individual table.
sqlQuery("
SHOW COLUMNS
FROM campaign
")
## Field Type Null Key Default Extra
## 1 id bigint NO PRI <NA>
## 2 name text YES <NA>
## 3 sub_category_id bigint YES MUL <NA>
## 4 country_id bigint YES MUL <NA>
## 5 currency_id bigint YES MUL <NA>
## 6 launched datetime YES <NA>
## 7 deadline datetime YES <NA>
## 8 goal double YES <NA>
## 9 pledged double YES <NA>
## 10 backers bigint YES <NA>
## 11 outcome text YES <NA>
The description of each column as follow
ID: unique project
ID name: project name
sub_category_id: what industry/category was the project in?
country_id: id number of country of origin
currency_id: currency funding was given in
launched: date fundraising began
deadline: when target amount must be raised by
goal: desired amount of funding
pledged: how much was promised (whether or not the goal was reached)
backers: how many people contributed to the campaign?
outcome: the status of project (successful/failed/…)?
sqlQuery("
SHOW COLUMNS
FROM category
")
## Field Type Null Key Default Extra
## 1 id bigint NO PRI <NA>
## 2 name text YES <NA>
There are only two columns : the id of category and the category name.
sqlQuery("
SHOW COLUMNS
FROM country;
")
## Field Type Null Key Default Extra
## 1 id bigint NO PRI <NA>
## 2 name text YES <NA>
There are only two columns : the id of country and the country name.
sqlQuery("
SHOW COLUMNS
FROM currency;
")
## Field Type Null Key Default Extra
## 1 id bigint NO PRI <NA>
## 2 name text YES <NA>
There are also only two columns : the id of currency and the currency name
Exploring each table in data base
How many records in the table campaign?
sqlQuery("
SELECT COUNT(*)
FROM campaign;
")
## COUNT(*)
## 1 15000
sqlQuery("
SELECT COUNT(id)
FROM campaign;
")
15000 id, because id is primary key => not replicate The table campaign contains 15000 rows. To display the first 15 rows
sqlQuery("
SELECT *
FROM campaign
LIMIT 15;
")
## id name
## 1 1 Ragdolls
## 2 2 One To Ten
## 3 3 Future Gun - A short film about kids and advanced hardware
## 4 4 Fourth Wave Apparel—Vintage Fashion for the Modern Feminist
## 5 5 The Rich Lifestyle
## 6 6 Perils Galore!
## 7 7 In a Dark Dark House - Directed by Larry Moss
## 8 8 CHARLIEFOXTROTNY
## 9 9 Fat Englishmen - Beer and Sumo
## 10 10 bear elements old-fashioned shave and beard care for men
## 11 11 DigIndustrial Art
## 12 12 Expanding Farm to service the local Food Bank
## 13 13 Church Hive. Church hiring made easy!
## 14 14 The Visual Media Directory Project
## 15 15 Too Sweet - The Not So Serious Side to Diabetes
## sub_category_id country_id currency_id launched
## 1 23 2 2 2013-04-25 00:00:00
## 2 47 1 1 2015-11-16 00:00:00
## 3 24 2 2 2013-08-28 00:00:00
## 4 52 2 2 2014-07-22 00:00:00
## 5 9 2 2 2016-08-24 00:00:00
## 6 27 2 2 2012-09-25 00:00:00
## 7 20 2 2 2014-07-01 00:00:00
## 8 23 2 2 2014-05-21 00:00:00
## 9 54 2 2 2015-03-03 00:00:00
## 10 12 2 2 2016-04-14 00:00:00
## 11 97 2 2 2014-06-19 00:00:00
## 12 107 2 2 2016-06-13 00:00:00
## 13 40 2 2 2016-09-01 00:00:00
## 14 26 2 2 2012-03-07 00:00:00
## 15 10 2 2 2014-01-16 00:00:00
## deadline goal pledged backers outcome
## 1 2013-05-25 00:00:00 15000.00 20.00 3 failed
## 2 2015-12-16 00:00:00 223.68 413.81 23 successful
## 3 2013-09-27 00:00:00 5000.00 1497.00 28 failed
## 4 2014-08-21 00:00:00 6000.00 8795.00 218 successful
## 5 2016-09-28 00:00:00 2000000.00 2.00 2 failed
## 6 2012-11-01 00:00:00 2500.00 230.00 5 failed
## 7 2014-08-15 00:00:00 55000.00 9316.00 45 failed
## 8 2014-06-20 00:00:00 6000.00 161.00 4 failed
## 9 2015-04-02 00:00:00 50000.00 0.00 0 failed
## 10 2016-05-14 00:00:00 12000.00 306.00 11 failed
## 11 2014-07-09 00:00:00 30000.00 111.00 4 failed
## 12 2016-07-13 00:00:00 2500.00 405.00 5 failed
## 13 2016-10-01 00:00:00 16800.00 101.00 2 failed
## 14 2012-05-06 00:00:00 10000.00 2.00 2 failed
## 15 2014-02-15 00:00:00 2500.00 2525.00 6 successful
This table contains country_id and currency_id, the country_id can be found in table country. the id of currency appears in the table campaign and the table country.
How many subcategory, country, and currency ?
sqlQuery("
SELECT
COUNT(DISTINCT sub_category_id) as subidnb,
COUNT(DISTINCT country_id) as countrynb,
COUNT(DISTINCT currency_id) as currencynb
FROM campaign;
")
## subidnb countrynb currencynb
## 1 157 22 13
launched date and deadline date (datetime)
sqlQuery("
SELECT launched AS launched_date,
deadline AS deadline
FROM campaign
ORDER BY launched_date DESC
LIMIT 10;
")
## launched_date deadline
## 1 2018-01-02 00:00:00 2018-03-03 00:00:00
## 2 2018-01-01 00:00:00 2018-01-31 00:00:00
## 3 2018-01-01 00:00:00 2018-02-11 00:00:00
## 4 2018-01-01 00:00:00 2018-01-31 00:00:00
## 5 2017-12-30 00:00:00 2018-01-29 00:00:00
## 6 2017-12-29 00:00:00 2018-01-28 00:00:00
## 7 2017-12-29 00:00:00 2018-01-28 00:00:00
## 8 2017-12-28 00:00:00 2018-01-29 00:00:00
## 9 2017-12-27 00:00:00 2018-01-26 00:00:00
## 10 2017-12-27 00:00:00 2018-01-31 00:00:00
There is data for 2017 and 2018.
sqlQuery("
SELECT launched AS launched_date,
deadline AS deadline
FROM campaign
ORDER BY launched_date ASC
LIMIT 10;
")
## launched_date deadline
## 1 2009-05-06 00:00:00 2009-08-04 00:00:00
## 2 2009-05-07 00:00:00 2009-07-02 00:00:00
## 3 2009-05-10 00:00:00 2009-06-16 00:00:00
## 4 2009-05-12 00:00:00 2009-05-22 00:00:00
## 5 2009-06-04 00:00:00 2009-07-16 00:00:00
## 6 2009-06-05 00:00:00 2009-09-02 00:00:00
## 7 2009-06-20 00:00:00 2009-07-01 00:00:00
## 8 2009-06-30 00:00:00 2009-07-16 00:00:00
## 9 2009-07-14 00:00:00 2009-09-05 00:00:00
## 10 2009-07-28 00:00:00 2009-10-02 00:00:00
Data covers periods from 2009-05-06 to 2018-01-02 The year span is between 2009 to 2018. To check
sqlQuery("
SELECT DISTINCT(YEAR(launched)) as launchedyears
FROM campaign
ORDER BY launchedyears;")
## launchedyears
## 1 2009
## 2 2010
## 3 2011
## 4 2012
## 5 2013
## 6 2014
## 7 2015
## 8 2016
## 9 2017
## 10 2018
Similarly, we can check the year span for deadline.
sqlQuery("
SELECT DISTINCT(YEAR(deadline)) as deadlineyears
FROM campaign
ORDER BY deadlineyears;
")
## deadlineyears
## 1 2009
## 2 2010
## 3 2011
## 4 2012
## 5 2013
## 6 2014
## 7 2015
## 8 2016
## 9 2017
## 10 2018
We should extract year, month, and day of month.
sqlQuery("
SELECT COUNT(*)
FROM campaign
WHERE YEAR(launched) = YEAR(deadline);
")
## COUNT(*)
## 1 13923
There are 13923 projects that the launched data and the deadline were in the same year.
sqlQuery("
SELECT COUNT(*)
FROM campaign
WHERE YEAR(launched) <> YEAR(deadline);
")
## COUNT(*)
## 1 1077
Therefore, there are 1077 projects whom the launched year and the deadline year were different.
sqlQuery("
SELECT id, DATEDIFF(deadline, launched) as daynb, goal, pledged, backers, outcome
FROM campaign
ORDER BY daynb DESC
LIMIT 10
")
## id daynb goal pledged backers outcome
## 1 6082 92 50000 1025.00 2 failed
## 2 6407 91 1500 1900.00 40 successful
## 3 10511 91 185 196.00 9 successful
## 4 13590 91 5000 951.00 17 failed
## 5 14870 91 10000 15.00 3 failed
## 6 4637 91 7500 6030.00 26 failed
## 7 14111 91 1000 2597.68 122 successful
## 8 550 91 1300 80.00 3 failed
## 9 5931 91 10000 10049.99 131 successful
## 10 6448 91 55000 5780.00 27 failed
The max duration is 91-92 days.
sqlQuery("
SELECT id, DATEDIFF(deadline, launched) as daynb, goal, pledged, backers, outcome
FROM campaign
ORDER BY daynb ASC
LIMIT 10
")
## id daynb goal pledged backers outcome
## 1 3936 1 20 20.00 1 successful
## 2 12640 1 8888 7670.00 86 failed
## 3 3233 1 100000000 0.00 0 failed
## 4 9672 1 100 320.00 47 successful
## 5 2260 1 500 500.00 7 successful
## 6 11246 1 8888 8385.00 71 failed
## 7 8311 2 400 464.66 34 successful
## 8 667 2 100 12701.00 101 successful
## 9 7947 3 25000 1.00 1 failed
## 10 10022 3 5000 26.00 2 failed
There some projects that launched and end at the same day Project 3233 had a goal of 100000000 and pledged 0 ==> failed.
sqlQuery("
SELECT id, goal
FROM campaign
ORDER BY goal DESC
LIMIT 10;
")
## id goal
## 1 12761 100000000
## 2 3233 100000000
## 3 7118 100000000
## 4 176 100000000
## 5 9944 100000000
## 6 8150 88767573
## 7 14319 44336067
## 8 6145 17708936
## 9 4574 11086826
## 10 3101 10000000
What is the minimum goal and the maximum goal?# the year span is between 2009 to 2018 it seems date the date column only contain date with the format yyyy-mm-dd (we have to change that).
sqlQuery("
SELECT MIN(goal) as min_goal,
MAX(goal) as max_goal
FROM campaign;
")
# min_goal max_goal
## 1 0.75 1e+08
The minimum is 0.75 (we do not know what currency yet) The maximun goal is 100 000 000 We can find which currency.
sqlQuery("
SELECT name, currency_id, MAX(goal)
FROM campaign;
")
## name currency_id MAX(goal)
## 1 Ragdolls 2 1e+08
There are five projects that had a goal of 1e+08 (currency id is 2).mm
sqlQuery("
SELECT name
FROM currency
WHERE id = 2
")
## name
## 1 USD
Therefore, the maximum goal was 1e+08 USD. This query will be easier once we join tables. The goal 0.75 might be mistake?
sqlQuery("
SELECT name, currency_id, MIN(pledged)
FROM campaign;
")
## name currency_id MIN(pledged)
## 1 Ragdolls 2 0
There are some money for this project. it might be not a mistake
Similarly, we can find the maximum and the minimum of pledged.
sqlQuery("
SELECT MIN(pledged) as min_pledged,
MAX(pledged) as max_pledged
FROM campaign;
")
## min_pledged max_pledged
## 1 0 5408917
There is project that did not get any money. The maximum pledged is 5408917.
Category table
sqlQuery("
SELECT *
FROM category
")
## id name
## 1 1 Publishing
## 2 2 Film & Video
## 3 3 Music
## 4 4 Food
## 5 5 Design
## 6 6 Crafts
## 7 7 Games
## 8 8 Comics
## 9 9 Fashion
## 10 10 Theater
## 11 11 Art
## 12 12 Photography
## 13 13 Technology
## 14 14 Dance
## 15 15 Journalism
There are 15 categories.
SubCategory table
sqlQuery("
SELECT *
FROM sub_category
")
## id name category_id
## 1 1 Poetry 1
## 2 2 Narrative Film 2
## 3 3 Music 3
## 4 4 Film & Video 2
## 5 5 Restaurants 4
## 6 6 Food 4
## 7 7 Drinks 4
## 8 8 Product Design 5
## 9 9 Documentary 2
## 10 10 Nonfiction 1
## 11 11 Indie Rock 3
## 12 12 Crafts 6
## 13 13 Games 7
## 14 14 Tabletop Games 7
## 15 15 Design 5
## 16 16 Comic Books 8
## 17 17 Art Books 1
## 18 18 Fashion 9
## 19 19 Childrenswear 9
## 20 20 Theater 10
## 21 21 Comics 8
## 22 22 DIY 6
## 23 23 Webseries 2
## 24 24 Animation 2
## 25 25 Food Trucks 4
## 26 26 Public Art 11
## 27 27 Illustration 11
## 28 28 Photography 12
## 29 29 Pop 3
## 30 30 People 12
## 31 31 Art 11
## 32 32 Family 2
## 33 33 Fiction 1
## 34 34 Accessories 9
## 35 35 Rock 3
## 36 36 Hardware 13
## 37 37 Software 13
## 38 38 Weaving 6
## 39 39 Gadgets 13
## 40 40 Web 13
## 41 41 Jazz 3
## 42 42 Ready-to-wear 9
## 43 43 Festivals 10
## 44 44 Video Games 7
## 45 45 Anthologies 8
## 46 46 Publishing 1
## 47 47 Shorts 2
## 48 48 Electronic Music 3
## 49 49 Radio & Podcasts 1
## 50 50 Apps 13
## 51 51 Cookbooks 4
## 52 52 Apparel 9
## 53 53 Metal 3
## 54 54 Comedy 2
## 55 55 Hip-Hop 3
## 56 56 Periodicals 1
## 57 57 Dance 14
## 58 58 Technology 13
## 59 59 Painting 11
## 60 60 World Music 3
## 61 61 Photobooks 12
## 62 62 Drama 2
## 63 63 Architecture 5
## 64 64 Young Adult 1
## 65 65 Latin 3
## 66 66 Mobile Games 7
## 67 67 Flight 13
## 68 68 Fine Art 12
## 69 69 Action 2
## 70 70 Playing Cards 7
## 71 71 Makerspaces 13
## 72 72 Punk 3
## 73 73 Thrillers 2
## 74 74 Children's Books 1
## 75 75 Audio 15
## 76 76 Performance Art 11
## 77 77 Ceramics 11
## 78 78 Vegan 4
## 79 79 Graphic Novels 8
## 80 80 Fabrication Tools 13
## 81 81 Performances 14
## 82 82 Sculpture 11
## 83 83 Sound 13
## 84 84 Stationery 6
## 85 85 Print 15
## 86 86 Farmer's Markets 4
## 87 87 Events 4
## 88 88 Classical Music 3
## 89 89 Graphic Design 5
## 90 90 Spaces 4
## 91 91 Country & Folk 3
## 92 92 Wearables 13
## 93 93 Mixed Media 11
## 94 94 Journalism 15
## 95 95 Movie Theaters 2
## 96 96 Animals 12
## 97 97 Digital Art 11
## 98 98 Horror 2
## 99 99 Knitting 6
## 100 100 Small Batch 4
## 101 101 Installations 11
## 102 102 Community Gardens 4
## 103 103 DIY Electronics 13
## 104 104 Embroidery 6
## 105 105 Camera Equipment 13
## 106 106 Jewelry 9
## 107 107 Farms 4
## 108 108 Conceptual Art 11
## 109 109 Fantasy 2
## 110 110 Webcomics 8
## 111 111 Experimental 10
## 112 112 Science Fiction 2
## 113 113 Puzzles 7
## 114 114 R&B 3
## 115 115 Music Videos 2
## 116 116 Calendars 1
## 117 117 Video 15
## 118 118 Plays 10
## 119 119 Blues 3
## 120 120 Bacon 4
## 121 121 Faith 3
## 122 122 Live Games 7
## 123 123 Woodworking 6
## 124 124 Places 12
## 125 125 Footwear 9
## 126 126 3D Printing 13
## 127 127 Academic 1
## 128 128 Zines 1
## 129 129 Musical 10
## 130 130 Workshops 14
## 131 131 Photo 15
## 132 132 Immersive 10
## 133 133 Letterpress 6
## 134 134 Gaming Hardware 7
## 135 135 Candles 6
## 136 136 Television 2
## 137 137 Space Exploration 13
## 138 138 Couture 9
## 139 139 Nature 12
## 140 140 Robots 13
## 141 141 Typography 5
## 142 142 Crochet 6
## 143 143 Translations 1
## 144 144 Textiles 11
## 145 145 Pottery 6
## 146 146 Interactive Design 5
## 147 147 Video Art 11
## 148 148 Quilts 6
## 149 149 Glass 6
## 150 150 Pet Fashion 9
## 151 151 Printing 6
## 152 152 Romance 2
## 153 153 Civic Design 5
## 154 154 Kids 3
## 155 155 Literary Journals 1
## 156 156 Taxidermy 6
## 157 157 Literary Spaces 1
## 158 158 Chiptune 3
## 159 159 Residencies 14
There are 159 subcategories.
Comments