Using SQL to know which project tags are most frequently found on FLOSS projects
- sam33frodon
- Feb 1, 2021
- 2 min read
Updated: Feb 11, 2021
library(RMySQL)
For information on FLOSS project (https://en.wikipedia.org/wiki/Free_and_open-source_software)
sqlQuery <- function (query) {
# creating DB connection object with RMysql package
DB <- dbConnect(MySQL(), user='root', password = 'AAAAAA', dbname = 'test', host = '127.0.0.1')
# send Query to obtain result set
rs <- dbSendQuery(DB, query)
# get elements from result sets and convert to dataframe
result <- fetch(rs, -1)
# close db connection
dbDisconnect(DB)
# return the dataframe
return(result)
}
cons <- dbListConnections(MySQL())
for(con in cons)
dbDisconnect(con)
sqlQuery("
SHOW tables;
")
## Tables_in_test
## 1 fc_project_tags
We can discover the total number of project-tag combinations, keeping in mind that a project can have multiple tags
sqlQuery("SELECT COUNT(*)
FROM fc_project_tags;")
## COUNT(*)
## 1 353400
We can calculate the total number of projects
sqlQuery("SELECT count(DISTINCT project_id)
FROM fc_project_tags;")
## count(DISTINCT project_id)
## 1 46510
How many unique items are in our dataset?
sqlQuery("SELECT count(DISTINCT tag_name)
FROM fc_project_tags;")
## count(DISTINCT tag_name)
## 1 11006
sqlQuery("SELECT tag_name, COUNT(project_id)
FROM fc_project_tags
GROUP BY 1
HAVING COUNT(project_id) >= 2325
ORDER BY 2 DESC;")
## tag_name COUNT(project_id)
## 1 GPL 21182
## 2 POSIX 16875
## 3 Linux 16288
## 4 C 10292
## 5 OS Independent 10180
## 6 Software Development 9619
## 7 Internet 8100
## 8 Windows 7572
## 9 Java 6394
## 10 Web 6267
## 11 English 6000
## 12 C++ 5893
## 13 Libraries 5740
## 14 PHP 5452
## 15 Unix 5100
## 16 Mac OS X 4823
## 17 multimedia 4815
## 18 Communications 4450
## 19 Perl 4244
## 20 Python 4191
## 21 LGPL 3525
## 22 Utilities 3298
## 23 Dynamic Content 3199
## 24 GPLv3 2876
## 25 Networking 2821
## 26 Scientific/Engineering 2679
## 27 Games/Entertainment 2528
## 28 BSD 2497
## 29 Desktop Environment 2335
Comentarios