AI and Data Science: Practical implementation of two Machine Learning algorithm in Python
K nearest neighbour(classification-Supervised) and k-means clustering(clustering- Unsupervised machine learning)
1. K nearest neighbour :-
The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity (distance metric like Euclidian distance) to make classifications or predictions about the grouping of an individual data point. It is very popular, useful and simplest classification and regression classifiers used in machine learning.
The IRIS dataset is an ideal benchmark for beginners learning classification. It contains 150 samples of iris flowers, with 50 samples for each of the three species: Iris setosa, Iris versicolor, and Iris virginica.Each sample has four numerical features (measured in cm): Sepal length, Sepal width Petal length, Petal width.Visualizations often show that Iris setosa is linearly separable from the other two species, while versicolor and virginica have some overlap.
1
2
3
4
5
6
7
8
| # step 1 : Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
#from sklearn.preprocessing import StandardScaler # for simplicity we do not use it
|
1
2
| # step 2: load the IRIS dataset
iris = load_iris()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
| # step 3: Optional - we can see 5 sample data
# Create the DataFrame with the features
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Add the target column to the DataFrame
df['target'] = iris.target
# Optional: Make it more readable by adding the actual species names instead of just 0, 1, 2
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# View the first 5 rows (including the target)
df.head()
|
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | species |
|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 | setosa |
|---|
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 | setosa |
|---|
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 | setosa |
|---|
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 | setosa |
|---|
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 | setosa |
|---|
1
2
3
4
5
6
| # step 4: assign data to variables
X = iris.data # Features (sepal length, etc.) -(independent variable - matrix)
y = iris.target # Target labels (species) - (dependent variable - targer vector)
# supervised learning - label 0: = Iris Setosa , 1 = Iris Versicolor, 2 = Iris Virginica
|
1
2
3
4
| # Step 5. Split the data into training and testing sets
# A 75:25 ( we can use 80:20) split is used here, which is a common default.
# The random_state ensures reproducibility of the split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
|
Step ()- (Optional but recommended) Standardize features # Here we will not use it for make it simple KNN is distance-based, so features should be on a similar scale.
scaler = StandardScaler()
1
2
3
4
5
| # Step 6. Create and train the KNN classifier
# We set n_neighbors (k) to 5, a common starting point.
k = 5
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train) # to fit train the model
|
1
2
| # Step 7. Make predictions
y_pred = knn.predict(X_test)
|
1
2
3
| # Step 8. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy for k={k}: {accuracy:.2%}")
|
1
| Model accuracy for k=5: 100.00%
|
Nice result as the Model accuracy for k=5: is 100.00%
1
2
3
4
5
6
7
8
| # Step 9. Test with a new, single data point (data will be supplied by us)
new_observation = np.array([[5, 3, 1, 1.2]]) # Example data
# new_observation_scaled = scaler.transform(new_observation)
predicted_class = knn.predict(new_observation) # if we use scale we have to supply new_observation_scaled
predicted_species = iris.target_names[predicted_class]
print(f"\nFeatures: {new_observation[0]} --> Predicted species: {predicted_species[0]}")
|
1
| Features: [5. 3. 1. 1.2] --> Predicted species: setosa
|
1
2
3
4
5
6
7
8
| # Repeat Step 9. with another data
new_observation = np.array([[4, .3, 4, .2]]) # Example data
# new_observation_scaled = scaler.transform(new_observation)
predicted_class = knn.predict(new_observation) # if we use scale we have to supply new_observation_scaled
predicted_species = iris.target_names[predicted_class]
print(f"\nFeatures: {new_observation[0]} --> Predicted species: {predicted_species[0]}")
|
1
| Features: [4. 0.3 4. 0.2] --> Predicted species: versicolor
|
2. K-Means Clustering
K-Means Clustering is a popular unsupervised machine learning algorithm.
The “K” in K-Means represents the number of clusters (groups) you want the algorithm to find. The “Means” refers to the averaging of the data used to find the center of those clusters. The step-by-step process is Initialization, Assignment, ,Update (The “Mean”), Repeat until converge or upto a prespecified endpoint
Here we will use sklearn and IRIS data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
# Step 1. Load Data
iris = load_iris()
X = iris.data # Features: sepal/petal length and width
y_true = iris.target # We save the answers to check our work later!
# (we will not use this tn train or test as it is unsupervised learning)
# Step 2. Apply K-Means (3 clusters for 3 iris species)
kmeans = KMeans(n_clusters=3, random_state=42, n_init="auto")
y_kmeans = kmeans.fit_predict(X)
# Step 3. (This step is not mandatory) Compare the K-Means clusters to the actual species
df = pd.DataFrame({'Actual Species': y_true, 'K-Means Cluster': y_kmeans})
# Create a cross-tabulation table to see how the groups align
comparison_table = pd.crosstab(df['Actual Species'], df['K-Means Cluster'])
print(comparison_table)
centers = kmeans.cluster_centers_
# Step 4. Visualize Clusters (using first two features: sepal length/width)
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', edgecolor='k', s=100)
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, marker='X', label='Centroids')
plt.title("K-Means Clustering on Iris Dataset")
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.legend()
plt.show()
|
| K-Means Cluster |
|---|
| 0 | 1 | 2 |
|---|
| 0 | 50 | 0 | 0 |
|---|
| 1 | 0 | 48 | 2 |
|---|
| 2 | 0 | 14 | 36 |
|---|
How to Read the Table
Reading this table tells you exactly where K-Means succeeded and where it struggled:
The Perfect Match (Row 0): Look at the first row. All 50 of the actual Iris Setosa flowers were placed into K-Means Cluster 0. None of them were placed in Cluster 1 or 2. This means K-Means perfectly identified this species without ever being told what it was!
The “Confusion” (Rows 1 and 2): Look at the second row (Iris Versicolor). 48 of them were placed into Cluster 1, but 2 of them accidentally ended up in Cluster 2.
The Overlap: Now look at the third row (Iris Virginica). 36 were correctly grouped into Cluster 2, but 14 were lumped into Cluster 1. This tells you exactly what we saw on the scatter plot: Versicolor and Virginica have very similar physical measurements, so the algorithm had a harder time drawing a clean boundary between them.
The “Label Shuffle” Rule
When you look at your own output, remember that the column headers (0, 1, 2) are completely arbitrary. Cluster 2 in K-Means does not necessarily mean Species 2. You are just looking for the highest numbers in each row and column to figure out which K-Means bucket corresponds to which actual species.