astronomicAL.active_learning

class astronomicAL.active_learning.active_learning.ActiveLearningModel(src, df, label)

This class handles the Machine Learning aspect of the codebase.

Based on the users settings, the required features will be extracted from the data and split into train, validation and test sets. The user can then specify which classifiers and query functions they would like to use in the Active Learning Process. The results at each stage will be displayed in various widgets and plots, allowing the user to select the correct label for the newly queried source. Each instance will train a separate one-vs-rest classifier.

Parameters:
  • src (ColumnDataSource) – The shared data source which holds the current selected source.

  • df (DataFrame) – The shared dataframe which holds all the data.

  • label (str) – The string alias of the label that will be the positive case in the one-vs-rest classifier.

df

The shared dataframe which holds all the data.

Type:

Dataframe

src

The shared data source which holds the current selected source.

Type:

ColumnDataSource

_label

The label that will be the positive case in the one-vs-rest classifier.

Type:

int

_label_alias

The string alias of _label

Type:

str

_training

Flag for whether the training process has begun.

Type:

bool

_assigned

Flag for whether the user has assigned a label to the current queried source.

Type:

bool

retrain

Flag for whether the class is retraining a previous model from within config.settings[“classifiers”]

Type:

bool

scaler

The scaler used to standardise features across the train, val and test sets according to the training set. NOTE: Only initialised if config.settings[“scale_data”] is True.

Type:

sklearn.preprocessing.RobustScaler

_show_test_results

Flag for whether to render the test set results column in panel method.

Type:

bool

_seen_test_results

Flag for indicating whether the user has viewed the test results of the classifier.

Type:

bool

_show_caution

Flag for whether to show the test set caution column to the user when trying to view the test set results.

Type:

bool

_seen_caution

Flag for indicating whether the user has viewed the test results caution page.

Type:

bool

_max_x

The maximum value of the x axis of the train, val and metric plots. This is set as the min(μ(x)+4*σ(x), max(x)).

Type:

float

_max_y

The maximum value of the y axis of the train, val and metric plots. This is set as the min(μ(y)+4*σ(y), max(y)).

Type:

float

_min_x

The minimum value of the x axis of the train, val and metric plots. This is set as the max(μ(x)-4*σ(x), min(x)).

Type:

float

_min_y

The minimum value of the y axis of the train, val and metric plots. This is set as the max(μ(y)-4*σ(y), min(y)).

Type:

float

_model_output_data_tr

Dictionary containing the plotting data [config.settings[“default_vars”][0],`config.settings[“default_vars”][1]`,`metric`,`y`,`pred`] required for the train and metric plots.

Type:

dict

_model_output_data_val

Dictionary containing the plotting data [config.settings[“default_vars”][0],`config.settings[“default_vars”][1]`,`y`,`pred`] required for the val plot.

Type:

dict

_accuracy_list

Dictionary containing the train and validation vs number of points accuracy scores.

Type:

dict

_f1_list

Dictionary containing the train and validation vs number of points f1 scores.

Type:

dict

_precision_list

Dictionary containing the train and validation vs number of points precision scores.

Type:

dict

_recall_list

Dictionary containing the train and validation vs number of points recall scores.

Type:

dict

_train_scores

Dictionary containing the current scores for the training set.

Type:

dict

_val_scores

Dictionary containing the current scores for the validation set.

Type:

dict

_test_scores

Dictionary containing the current scores for the testing set.

Type:

dict

corr_train

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the training sources that are currently predicted correctly.

Type:

ColumnDataSource

incorr_train

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the training sources that are currently predicted incorrectly.

Type:

ColumnDataSource

corr_val

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the validation sources that are currently predicted correctly.

Type:

ColumnDataSource

incorr_val

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the validation sources that are currently predicted incorrectly.

Type:

ColumnDataSource

queried_points

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of the current queried points.

Type:

ColumnDataSource

full_labelled_data

Dictionary containing the id and y values of all labelled points during training.

Type:

dict

assign_label_group

The group of buttons containing the possible labels for labelling during training.

Type:

Panel RadioButtonGroup Widget

assign_label_button

The button for assigning the selected label from assign_label_group to the currently queried source.

Type:

Panel Button Widget

show_queried_button

The button for making the current queried point the current selected point.

Type:

Panel Button Widget

classifier_dropdown

A dropdown menu showing all the classifiers initialised in astronomicAL.extensions.models.

Type:

Panel Select Widget

query_strategy_dropdown

A dropdown menu showing all the query strategies initialised in astronomicAL.extensions.query_strategies.

Type:

Panel Select Widget

starting_num_points

Set the number of initial randomly selected points to train on.

Type:

Panel IntInput Widget

classifier_table_source

The collection of all the currently selected classifier and query strategy pairs.

Type:

ColumnDataSource

classifier_table

The table for visualising classifier_table_source

Type:

DataTable

add_classifier_button

The button for appending the currently selected values from classifier_dropdown and query_strategy_dropdown to classifier_table_source.

Type:

Panel Button Widget

remove_classifier_button

The button for removing the last entry from classifier_table_source.

Type:

Panel Button Widget

start_training_button

The button for beginning the training of a classifier using the selected parameters from classifier_table_source and starting_num_points.

Type:

Panel Button Widget

next_interation_button

The button to begin the next iteration of the Active Learning process. Only visible after assigning a label to the currently queried point.

Type:

Panel Button Widget

checkpoint_button

The button to save the current model and parameters required to recreate current set up.

Type:

Panel Button Widget

request_test_results_button

The button to request the current classifiers results for the test set.

Type:

Panel Button Widget

_return_to_train_view_button

The button displayed in the test set caution window, allowing the user to return to the train and validation results without seeing the test set results.

Type:

Panel Button Widget

_stop_caution_show_checkbox

A checkbox for whether the user wants to disable the test set caution window from appearing when they want to view the test set.

Type:

Panel Checkbox Widget

_view_test_results_button

The button to show the test set results to the user. If _show_caution is True, this button will show the test set caution window instead.

Type:

Panel Button Widget

_queried_is_selected

Flag for whether the current queried point is also the current selected point.

Type:

bool

setup_row

A row containing all the classifier setup settings required before the training process has begun.

Type:

Panel Row

panel_row

A row containing all the visualisation aspects of the ActiveLearningModel view.

Type:

Panel Row

conf_mat_tr_tn

The current number of true negatives in the classifiers current prediction of the training set.

Type:

str

conf_mat_tr_fn

The current number of false negatives in the classifiers current prediction of the training set.

Type:

str

conf_mat_tr_fp

The current number of false positives in the classifiers current prediction of the training set.

Type:

str

conf_mat_tr_tp

The current number of true positives in the classifiers current prediction of the training set.

Type:

str

conf_mat_val_tn

The current number of true negatives in the classifiers current prediction of the validation set.

Type:

str

conf_mat_val_fn

The current number of false negatives in the classifiers current prediction of the validation set.

Type:

str

conf_mat_val_fp

The current number of false positives in the classifiers current prediction of the validation set.

Type:

str

conf_mat_val_tp

The current number of true positives in the classifiers current prediction of the validation set.

Type:

str

conf_mat_test_tn

The current number of true negatives in the classifiers current prediction of the test set.

Type:

str

conf_mat_test_fn

The current number of false negatives in the classifiers current prediction of the test set.

Type:

str

conf_mat_test_fp

The current number of false positives in the classifiers current prediction of the test set.

Type:

str

conf_mat_test_tp

The current number of true positives in the classifiers current prediction of the test set.

Type:

str

all_al_data

A dataframe containing a subset of df with only the required features for training.

Type:

DataFrame

x_train

A dataframe containing all the training input data.

Type:

DataFrame

y_train

A dataframe containing all the training labels.

Type:

DataFrame

id_train

A dataframe containing all the training ids.

Type:

DataFrame

x_val

A dataframe containing all the validation input data.

Type:

DataFrame

y_val

A dataframe containing all the validation labels.

Type:

DataFrame

id_val

A dataframe containing all the validation ids.

Type:

DataFrame

x_test

A dataframe containing all the test input data.

Type:

DataFrame

y_test

A dataframe containing all the test labels.

Type:

DataFrame

id_test

A dataframe containing all the test ids.

Type:

DataFrame

x_al_train

The data that the classifier is training on.

Type:

Numpy Array

y_al_train

The labels for the data the classifier is training on.

Type:

Numpy Array

id_al_train

The ids of the data the classifier is training on.

Type:

DataFrame

x_pool

The data of the sources in the pool that are available to query from.

Type:

Numpy Array

y_pool

The labels of the sources in the pool that are available to query from.

Type:

Numpy Array

id_pool

The ids of the sources in the pool that are available to query from.

Type:

DataFrame

query_index

The current index of x_pool that contains the current queried point.

Type:

int

learner

The current classifier that is being trained. If multiple classifiers exist in classifier_table_source, then learner will be a ModAL Committee.

Type:

ModAL ActiveLearner

assign_global_data()
Assign the current train, validation and test sets to the shared

ml_data dictionary so that it can be used by other classifiers.

Return type:

None

create_pool(preselected=None)

Create the pool used for query points during active learning. The training set will be split into the pool and the classifier’s training set. The number in the classifier’s training set has already been set by the user and these points will be chosen randomly from the pool.

Return type:

None

exclude_unclassified_labels(df_data_x, df_data_y, excluded)

Remove any sources that have a label that is not being trained on.

Parameters:
  • df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.

  • df_data_y (DataFrame) – A dataframe containing the label corresponding to df_data_x.

  • excluded (str) – The label which should be removed from df_data_x and df_data_y.

Returns:

  • data_x (DataFrame) – A subset of df_data_x which has had all rows with label excluded removed.

  • data_y (DataFrame) – A subset of df_data_y which has had all rows with label excluded removed.

  • excluded_x (DataFrame) – A subset of df_data_x which only has rows with label excluded.

  • excluded_y (DataFrame) – A subset of df_data_y which only has rows with label excluded.

generate_features(df)

Create the feature combinations that the user specified.

Parameters:

df (DataFrame) – A dataframe containing all of the dataset.

Returns:

  • df (DataFrame) – An expanding dataframe of df with the inclusion of the feature combinations.

  • df_al (DataFrame) – A dataframe containing a subset of df with only the required features for training.

iterate_AL()

Iterate through one iteration of active learning.

Return type:

None

panel(button_update=False)

Create the active learning tab panel.

Returns:

panel_row – The panel is housed in a row which can then be rendered by the respective Dashboard.

Return type:

Panel Row

query_new_point()

Query the most informative point from the training pool based off the chosen query metric.

Return type:

None

remove_from_pool(id=None)

Remove the current queried source from the active learning pool.

Return type:

None

save_model(checkpoint=False)

Save the current classifier(s) as a joblib file to the models/ directory. The classifier filename will include the classifier(s) used and corresponding query function(s). If training a committee, a new directory will be created where each of the committee modals will be saved.

Parameters:

checkpoint (bool, default = False) – Flag whether or not the model is saving a checkpoint. If True the filename will include the current size of the training set, the current validation F1-score as well as the time and date to allow for easy tracking and organisation of models.

Return type:

None

scale_data(x_train, x_val, x_test, x_cols)

Scale the features of the data according to the training set.

A RobustScaler is used to limit the impact of outliers on the data.

Parameters:
  • x_train (DataFrame) – A dataframe containing the training set. All subsequent data will be scaled according to this data.

  • x_val (DataFrame) – A dataframe containing the validation set.

  • x_test (DataFrame) – A dataframe containing the testing set.

  • x_cols (list of str) – List containing all the column names in x_train,`x_val` and x_test.

Returns:

  • data_x_tr (DataFrame) – A dataframe containing the normalised training set.

  • data_x_val (DataFrame) – A dataframe containing the normalised validation set.

  • data_x_test (DataFrame) – A dataframe containing the normalised testing set.

setup_learners()

Initialise the classifiers used during active learning.

The classifiers used have already been chosen by the user.

Return type:

None

setup_panel()

Create the panel which will house all the classifier setup options.

Returns:

self.panel_row – The panel is housed in a row which can then be rendered by the respective Dashboard.

Return type:

Panel Row

show_queried_point()

Assign the classifier’s current queried point as the current selected source.

Return type:

None

split_x_y_ids(df_data)

Separate the data into X and [y,ids] dataframes.

Parameters:

df_data (DataFrame) – A dataframe containing all the training features, the label column and the id column.

Returns:

  • df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.

  • df_data_y_ids (DataFrame) – A dataframe containing only the label and id columns corresponding to df_data_x.

split_y_ids(y_id_train, y_id_val, y_id_test)

Split label and id columns into separate dataframes.

Parameters:
  • y_id_train (DataFrame) – Dataframe containing label and id columns of the training set.

  • y_id_val (DataFrame) – Dataframe containing label and id columns of the validation set.

  • y_id_test (DataFrame) – Dataframe containing label and id columns of the test set.

Returns:

  • data_y_tr (DataFrame) – Dataframe containing only the label column of y_id_train.

  • data_id_tr (DataFrame) – Dataframe containing only the id column of y_id_train.

  • data_y_val (DataFrame) – Dataframe containing only the label column of y_id_val.

  • data_id_val (DataFrame) – Dataframe containing only the id column of y_id_val.

  • data_y_test (DataFrame) – Dataframe containing only the label column of y_id_test.

  • data_id_test (DataFrame) – Dataframe containing only the id column of y_id_test.

train_val_test_split(df_data_x, df_data_y, excluded_x, excluded_y, train_ratio, val_ratio)

Split data into train, validation and test sets. The method uses stratified sampling to ensure each set has the correct distribution of points.

Parameters:
  • df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.

  • df_data_y (DataFrame) – A dataframe containing the labels corresponding to df_data_x.

  • train_ratio (float) – The ratio of all the total dataset that should be used for the training set.

  • val_ratio (float) – The ratio of all the total dataset that should be used for the validation set.

Returns:

  • x_train (DataFrame) – A subset of df_data_x which will be used for training a model.

  • y_train (DataFrame) – A dataframe containing the labels corresponding to x_train.

  • x_val (DataFrame) – A subset of df_data_x which will be used for validating a model.

  • y_val (DataFrame) – A dataframe containing the labels corresponding to x_val.

  • x_test (DataFrame) – A subset of df_data_x which will be used for testing a model.

  • y_test (DataFrame) – A dataframe containing the labels corresponding to x_test.