您的位置 > 首页 > 商业智能 > The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need

The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need

来源:分析大师 | 2019-08-19 | 发布:BOB体育娱乐平台之家

I love working on recommendation engines. Whenever I come across any recommendation engine on a website, I can’t wait to break it down and understand how it works underneath. It’s one of the many great things about being a data scientist!What truly fascinates me about these systems is how we can group similar items, products, and users together. This grouping, or segmenting, works across industries. And that’s what makes the concept of clustering such an important one in data science.Clustering helps us understand our data in a unique way – by grouping things together into – you guessed it – clusters.In this article, we will cover k-means clustering and it’s components comprehensively. We’ll look at clustering, why it matters, its applications and then deep dive into k-means clustering (including how to perform it in Python on a real-world dataset).And if you want to directly work on the Python code, jump straight here. We have a live coding window where you can build your own k-means clustering algorithm without leaving this article!Learn more about clustering and other machine learning algorithms (both supervised and unsupervised) in the comprehensive ‘Applied Machine Learning‘ course.Let’s kick things off with a simple example. A bank wants to give credit card offers to its customers. Currently, they look at the details of each customer and based on this information, decide which offer should be given to which customer.Now, the bank can potentially have millions of customers. Does it make sense to look at the details of each customer separately and then make a decision? Certainly not! It is a manual process and will take a huge amount of time.So what can the bank do? One option is to segment its customers into different groups. For instance, the bank can group the customers based on their income:
Can you see where I’m going with this? The bank can now make three different strategies or offers, one for each group. Here, instead of creating different strategies for individual customers, they only have to make 3 strategies. This will reduce the effort as well as the time.The groups I have shown above are known as clusters and the process of creating these groups is known as clustering. Formally, we can say that:Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.Can you guess which type of learning problem clustering is? Is it a supervised or unsupervised learning problem?Think about it for a moment and make use of the example we just saw. Got it? Clustering is an unsupervised learning problem!Let’s say you are working on a project where you need to predict the sales of a big mart:Or, a project where your task is to predict whether a loan will be approved or not:
We have a fixed target to predict in both of these situations. In the sales prediction problem, we have to predict the Item_Outlet_Sales based on outlet_size, outlet_location_type, etc. and in the loan approval problem, we have to predict the Loan_Status depending on the Gender, marital status, the income of the customers, etc.So, when we have a target variable to predict based on a given set of predictors or independent variables, such problems are called supervised learning problems.Now, there might be situations where we do not have any target variable to predict.Such problems, without any fixed target variable, are known as unsupervised learning problems. In these problems, we only have the independent variables and no target/dependent variable.In clustering, we do not have a target to predict. We look at the data and then try to club similar observations and form different groups. Hence it is an unsupervised learning problem.We now know what are clusters and the concept of clustering. Next, lets look at the properties of these clusters which we must consider while forming the clusters.How about another example? We’ll take the same bank as before who wants to segment its customers. For simplicity purposes, lets say the bank only wants to use the income and debt to make the segmentation. They collected the customer data and used a scatter plot to visualize it:
本文已经过优化显示,查看原文请点击以下链接:
查看原文:https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
京ICP备11001960号  京ICP证090565号 京公网安备1101084107号 论坛法律顾问:王进律师知识产权保护声明免责及隐私声明   主办单位:人大经济论坛 版权所有
联系QQ:2881989700  邮箱:service@pinggu.org
合作咨询电话:(010)62719935 广告合作电话:13661292478(刘老师)

投诉电话:(010)68466864 不良信息处理电话:(010)68466864