Kaggle is a very popular data science competition platform. On it, you can not only compete in various data analysis questions, but also practice your skills with real datasets from various industries.
In this article we will introduce 10 datasets, from novice to advanced people. These datasets are fun and great for practicing skills before an interview.
Let's take a look!
01、Titanic dataset (beginner)
The Titanic dataset is one of the most popular datasets on Kaggle. It is a good introductory dataset, which involves 13 variables and over 1500 records. The dataset contains information about the passengers who traveled on the Titanic.
The goal is to predict whether passengers will survive based on their characteristics. Based on the dataset, you can see that married women have a higher probability of survival than single men.
The variables in this dataset are
Age
Gender
Married or single
Class of boat ticket (first class, second class, third class)
Embarkation location (London, Southampton)
Passenger ticket number
......
There are already many tutorials online on how to handle this dataset. If you want to challenge yourself, try to predict the survival rate of passengers boarding the ship at different locations.
Link to the Titanic dataset.
https://www.kaggle.com/c/titanic
02. Iris dataset (elementary)
This dataset is a classic binary classification problem. The objective is to predict which of the three species (Setosa (Mountain Iris), Versicolour (Miscellaneous Iris), Virginica (Virginia Iris)) the iris belongs to by attributes such as calyx length, calyx width, etc.
For example, the petals of the mountain iris are shorter and the sepals are wider. If the petals are longer than 3 cm and the sepals are smaller than 6 cm, then the flower is likely to belong to the mountain iris.
The variables in this data set are as follows.
Petal length
Sepal width
Petal length
......
Again, there are many tutorials available for working with this dataset. One of the most popular is "Using Scikit-learn on the Iris dataset". This is a very good tutorial for beginners, as it shows how to use Scikit-learn and also has a pre-build feature that will help you train the model easily.
Link to the Iris dataset.
https://www.kaggle.com/uciml/iris
03、Train dataset (primary)
The train dataset is also a very popular dataset on Kaggle. This dataset contains information about passengers riding on Amtrak trains that travel to and from Boston and Washington, DC.
The goal is to predict whether a passenger will get off at a particular stop. Based on the dataset, it can be seen that passengers who get off in Baltimore have a higher probability of getting off than those who get off in Philadelphia.
The variables in the dataset are as follows.
Age
Rail type (road, freight)
Weekend or holiday
......
Based on these variables, there are multiple ways to predict whether someone will get off at a particular stop.
Link to train dataset.
https://www.kaggle.com/c/train-occupancy-prediction/data
04. Boston Housing Dataset (Primary)
The Boston Housing Dataset contains information about housing in the city of Boston. There are over 200,000 records and 18 variables, and the goal is to predict whether housing prices are expensive or not. The dataset has three different categories: expensive, normal, and cheap.
The variables include
Number of bedrooms
Number of bathrooms
Average number of rooms
......
If you are interested in the field of data science, this dataset is a good one to try. The content is interesting and not too difficult.
Link to the Boston Housing Dataset.
https://www.kaggle.com/c/boston-housing
05. Alcohol and Drug Relationships (Intermediate)
The Alcohol and Drug Relationships dataset is a great dataset to practice data visualization skills. It contains information about the interactions between different drugs.
The goal of the dataset is to predict whether two drugs will interact with each other based on their chemical structures. For example, the dataset indicates that ibuprofen and paracetamol can interact with each other because they are both anti-inflammatory drugs (NSAIDs).
The variables in the dataset include
Drug A structure (compound)
Drug B structure (compound)
Drug A and B activity (yes/no)
......
This is a good dataset to practice your data visualization skills. You can try creating charts in it to show the interactions between different drugs.
Link to the Alcohol and Drugs dataset.
https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018
06. Wisconsin Breast Cancer (Intermediate)
For those who are more experienced in data science, the Wisconsin Breast Cancer dataset is a great challenge. This dataset contains information on breast cancer patients in Wisconsin.
The goal of this dataset is to predict whether or not a patient has cancer based on their characteristics.
For example, you can see from the dataset that a patient has a 98% chance of survival if the tumor size is less than 0.50 cm, while a patient has only a 15% chance of survival if the tumor size is greater than or equal to 0.80 cm.
The variables in the dataset are
Tumor size
Grade of tumor
Lymph nodes affected
......
There are some tutorials online on how to handle this dataset. If you want to challenge yourself, try predicting survival rates for different tumor sizes.
Link to the Wisconsin Breast Cancer Dataset.
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
07. Diabetes in Indians (Intermediate)
This dataset is about predicting diabetes. There are over 150,000 examples in this competition and you need to predict whether a patient will develop diabetes (binary classification).
The variables are fairly simple, as there is only one feature: the
Diabetes.
The goal of this challenge is to predict whether a patient will develop diabetes within five years. This is a good way to practice your skills with binary classification problems.
Link to the Indian Diabetes Dataset.
https://www.kaggle.com/uciml/pima-indians-diabetes-database
08. Amazon Review Dataset (Intermediate)
The Amazon reviews dataset is good for practicing text analysis. It contains reviews of products on Amazon.com.
This dataset is interesting in that there are both positive and negative reviews. The goal of the dataset is to predict whether the reviews are positive or negative.
The variables are
Review text (a string)
There are also many tutorials on how to handle this dataset. To make it more difficult, you can try predictive sentiment analysis and then build a model based on that.
Link to the Amazon review dataset.
https://www.kaggle.com/bittlingmayer/amazonreviews
09, MNIST handwritten digital image recognition (advanced)
This dataset contains many handwritten digital images, which consist of images of size 28x28 pixels, with 60,000 training instances and 10,000 test instances.
The goal of this dataset is to correctly classify all digits in the training and test sets. For this type of problem, a convolutional neural network (CNN) is usually used.
There are many tutorials online on how to approach this type of problem, so I suggest you start with the basics and then move on to more advanced methods.
Link to the MNIST handwritten digit dataset.
https://www.kaggle.com/c/digit-recognizer
10. CIFAR-100 (Advanced)
The CIFAR-100 dataset is great for practicing your machine learning skills. The dataset contains 100 images of objects in six categories: airplanes, cars, cats, deer, dogs, and boats. Each image is 32x32 pixels and has three color channels (red, green, and blue).
The goal of this data is to predict which of these six categories each image belongs to.
The variables in the dataset are
Pixels
Red channel
Green channel
Blue channel
......
There are many tutorials on how to tackle this challenge. To make it more difficult, try predicting image labels that are distorted or transformed in some way.
Link to the CIFAR-100 dataset.
https://www.kaggle.com/fedesoriano/cifar100
Concluding remarks.
The 10 datasets listed in this article are a great way to hone your data analysis skills. If you are just starting out, you can try some of the simpler datasets first and progress from easy to hard.
Reference link.
https://towardsdatascience.com/10-datasets-from-kaggle-you-should-practice-on-to-improve-your-data-science-skills-6d671996177