import arff
import torch
import copy
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from torch import nn, optim
import torch.nn.functional as F
- arff: Used to work with datasets in the ARFF format (common in machine learning). Likely, your THz micromobility dataset is in ARFF format.
- torch: A deep learning library for creating and training neural networks.
- copy: Allows deep copying of objects. This might be used later for duplicating models, data structures, etc.
- numpy: Fundamental library for numerical computations.
- pandas: For handling tabular data in DataFrames.
- seaborn and matplotlib: Visualization libraries to create graphs and charts.
- sklearn: Tools for machine learning, including splitting data into training and testing sets.
- torch.nn, optim, and F: Submodules of PyTorch for defining models (
nn
), optimization (optim
), and additional utility functions (F
).
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%matplotlib inline:
Ensures that matplotlib plots are displayed directly inside the Jupyter notebook.
%config InlineBackend.figure_format = 'retina':
Makes plots rendered in a higher resolution for better clarity.
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))
sns.set(): Customizes the appearance of Seaborn plots with:
whitegrid:
Adds a grid background to plots.
palette='muted':
Sets muted colors as the default palette.
font_scale=1.2:
Scales up font sizes in plots for better readability.
HAPPY_COLORS_PALETTE: A custom color palette. These colors will be applied to the plots.
rcParams['figure.figsize'] = 12, 8
Configures the default size of all plots to be 12 inches wide and 8 inches tall.
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
RANDOM_SEED = 42: A fixed value for reproducibility. Ensures that results remain consistent across runs.
np.random.seed(): Seeds the random number generator for NumPy operations.
torch.manual_seed(): Seeds PyTorch's random number generator.
This ensures deterministic results during training and data preprocessing.
output: <torch._C.Generator at 0x10a91bb30>
The output <torch._C.Generator at 0x10a91bb30> is generated by the line torch.manual_seed(RANDOM_SEED). It’s an object that represents the state of PyTorch's random number generator. This is normal and indicates that the seed has been successfully set.
import arff
# Load the ARFF file
with open('Train70%alloriginal.arff', 'r') as f:
train_dataset = arff.load(f)
with open('Test30%alloriginal.arff', 'r') as f:
test_dataset = arff.load(f)
- The
arff
library is being used to read ARFF files, which are typically structured data files used in machine learning (e.g., WEKA datasets). arff.load(f)
reads and parses the contents of the ARFF file into Python objects.
train_data = train_dataset['data']
train_attributes = train_dataset['attributes']
test_data = test_dataset['data']
test_attributes = test_dataset['attributes']
The ARFF file is a dictionary-like structure with keys like data
and attributes
:
data
: Contains the actual dataset rows (features and labels).
attributes
: Describes the column names and types in the dataset (e.g., features and labels).
Printing Data
Print the first 14 rows of the training data
for row in train_data[:14]:
print(row)
This loop iterates through the first 14 rows of the training data and prints them. From the screenshots, we can see that: Each row contains numeric values corresponding to the features of a trace. The last value in each row is a label: 1 indicates oscillation. 2 indicates non-block.
- Function
load_arff_to_dataframe
- Reads an ARFF file using the
arff module
. - Extracts the
attributes
(column names) anddata
(rows) from the ARFF file. - Converts these into a Pandas DataFrame, with attribute names as column headers.
- Optionally ensures the target column (e.g., "Target") is treated as a
categorical variable, which is useful for classification tasks.
attribute_names = [attr[0] for attr in dataset['attributes']]
Extracts column names from the attributes
field of the ARFF file.
data = [list(row) for row in dataset['data']]
Converts the rows into a list of lists, making them compatible with Pandas
df = pd.DataFrame(data, columns=attribute_names)
Creates a DataFrame with extracted column names and data
print(train.head())
print(test.head())
Displays the first 5 rows of both datasets.
# Concatenate train and test DataFrames
df = pd.concat([train, test])
# Shuffle the DataFrame (shuffle rows randomly)
df = df.sample(frac=1.0)
# Check the shape of the combined and shuffled DataFrame
print(df.shape)
pd.concat([train, test])
: This combines train and test vertically (row- wise)..sample(frac=1.0)
: Shuffles all rows randomly while keeping all data (frac=1.0 means 100% of rows are sampled).df.shape
: Displays the shape of the resulting DataFrame.
Output: (480, 1501)
This means the train and test DataFrames combined have 480 rows and 1501 columns, which matches expectations based on our dataset structure.
# Concatenate train and test DataFrames
df = pd.concat([test])
# Shuffle the DataFrame (shuffle rows randomly)
df = df.sample(frac=1.0)
# Check the shape of the combined and shuffled DataFrame
print(df.shape)
pd.concat([test])
: Only includes test (no actual concatenation because test is the sole DataFrame)..sample(frac=1.0)
: Shuffles the rows in the test DataFrame.df.shape
: Displays the shape of the resulting DataFrame.
Output: (144, 1501)
This reflects that the test DataFrame alone has 144 rows and 1501 columns, consistent with our dataset structure.
# Concatenate train and test DataFrames
df = pd.concat([train])
# Shuffle the DataFrame (shuffle rows randomly)
df = df.sample(frac=1.0)
# Check the shape of the combined and shuffled DataFrame
print(df.shape)
This reflects that the train DataFrame alone has 336 rows and 1501 columns, consistent with our dataset structure again.
print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")
# Concatenate train and test DataFrames
df = pd.concat([train, test], axis=0)
# Shuffle the combined DataFrame (randomizes the row order)
df = df.sample(frac=1.0, random_state=42)
# Check the shape of the concatenated and shuffled DataFrame
print(f"Combined DataFrame shape: {df.shape}")
# Display the first 5 rows of the shuffled DataFrame
print(df.head())
Check the shapes of train and test to verify they match the expected dimensions Train shape: (336, 1501) Test shape: (144, 1501) Combined DataFrame shape: (480, 1501)
CLASS_OSCILLATION = 1
class_names = ['Oscillation','Nonblocked']
CLASS_OSCILLATION = 1
:
A constant is defined to represent the "Oscillation" class with a value of 1. This is useful if you want to refer to the class in your code later without hardcoding the number.
class_names = ['Oscillation', 'Nonblocked']
:
A list of class names is defined. These are human-readable labels for the numeric target values (1 and 2). They are used to make the plot more interpretable.
new_columns = list(df.columns)
new_columns[-1] = 'target'
df.columns = new_columns
This renames the last column of your DataFrame to target
, which is presumably the column containing the class labels (1 or 2).
print(df.target.value_counts())
this method counts the occurrences of each unique value in the target column.
target
2 240
1 239
Name: count, dtype: int64
There are 240 examples of class 2 (Nonblocked) and 239 examples of class 1 (Oscillation).
ax = sns.countplot(df.target)
ax.set_xticklabels(class_names);
This creates a bar plot using Seaborn. Each bar represents the count of rows for a particular class (1 or 2) in the target column.
The plot_time_series_class
function plots a time series for a given class, with a smoothed version of the series (rolling mean) and a shaded region that represents the variability (confidence interval) around the smoothed curve.
def plot_time_series_class(data, class_name, ax, n_steps=10):
data
: A single time-series sequence (e.g., one row or column from your
dataset). This represents a specific example of the time-series
values for one class (e.g., Oscillation).
class_name
: The name of the class being plotted, such as "Oscillation" or
"Nonblocked". This will be used to label the plot.
ax
: The matplotlib Axes object where the plot will be drawn.
n_steps
: The number of steps for the rolling window. This determines how
smooth the rolling average and standard deviation will appear.
time_series_df = pd.DataFrame(data)
Converts the input time series into a DataFrame, making it easier to calculate rolling statistics. For example, if you pass a row of time-series data for "Oscillation," this step creates a structured DataFrame with one column containing the series. Example from our dataset: If our row for "Oscillation" looks like this:
[-52.062580, -52.060561, -52.058553, ..., -52.322579]
time_series_df
will look like:
0
0 -52.062580
1 -52.060561
2 -52.058553
...
smooth_path = time_series_df.rolling(n_steps).mean()
Computes the rolling mean over n_steps
. This smooths the raw time-series data, reducing noise and revealing trends in "Oscillation" or "Nonblocked."
Example: With n_steps=10
, each value in the smoothed series is the average of the preceding 10 values. This helps us to see the trend more clearly.
path_deviation = 2 * time_series_df.rolling(n_steps).std()
Calculates the rolling standard deviation over n_steps
and multiplies it by 2. This quantifies variability in the time series and is used to create a confidence interval.
Example: If the rolling standard deviation is 0.5, then path_deviation
will be 1.0. The confidence interval will extend ±1.0 from the smoothed line.
under_line = (smooth_path - path_deviation)[0]
over_line = (smooth_path + path_deviation)[0]
Computes the lower (under_line
) and upper (over_line
) bounds of the confidence interval by subtracting or adding path_deviation
from/to the smoothed series.
For Oscillation:
under_line
might represent the minimum expected oscillation values over time.
over_line
might represent the maximum expected oscillation values over time.
ax.plot(smooth_path, linewidth=2)
Plots the smoothed time series as a line on the graph. This gives a clean view of the overall trend in the data.
ax.fill_between(
path_deviation.index,
under_line,
over_line,
alpha=.125
)
ax.set_title(class_name)
- Fills the region between
under_line
andover_line
to visualize the variability (or confidence interval) around the smoothed series. For Oscillation: This shaded area shows how much the oscillation values deviate from the trend. - Sets the title of the plot to the class name (e.g., "Oscillation" or "Nonblocked"). This helps identify which class the time series belongs to.