astronomicAL.active_learning

class astronomicAL.active_learning.active_learning.ActiveLearningModel(src, df, label)

This class handles the Machine Learning aspect of the codebase.

Based on the users settings, the required features will be extracted from the data and split into train, validation and test sets. The user can then specify which classifiers and query functions they would like to use in the Active Learning Process. The results at each stage will be displayed in various widgets and plots, allowing the user to select the correct label for the newly queried source. Each instance will train a separate one-vs-rest classifier.

Parameters
  • src (ColumnDataSource) – The shared data source which holds the current selected source.

  • df (DataFrame) – The shared dataframe which holds all the data.

  • label (str) – The string alias of the label that will be the positive case in the one-vs-rest classifier.

df

The shared dataframe which holds all the data.

Type

Dataframe

src

The shared data source which holds the current selected source.

Type

ColumnDataSource

_label

The label that will be the positive case in the one-vs-rest classifier.

Type

int

_label_alias

The string alias of _label

Type

str

_training

Flag for whether the training process has begun.

Type

bool

_assigned

Flag for whether the user has assigned a label to the current queried source.

Type

bool

retrain

Flag for whether the class is retraining a previous model from within config.settings[“classifiers”]

Type

bool

scaler

The scaler used to standardise features across the train, val and test sets according to the training set. NOTE: Only initialised if config.settings[“scale_data”] is True.

Type

sklearn.preprocessing.RobustScaler

_show_test_results

Flag for whether to render the test set results column in panel method.

Type

bool

_seen_test_results

Flag for indicating whether the user has viewed the test results of the classifier.

Type

bool

_show_caution

Flag for whether to show the test set caution column to the user when trying to view the test set results.

Type

bool

_seen_caution

Flag for indicating whether the user has viewed the test results caution page.

Type

bool

_max_x

The maximum value of the x axis of the train, val and metric plots. This is set as the min(μ(x)+4*σ(x), max(x)).

Type

float

_max_y

The maximum value of the y axis of the train, val and metric plots. This is set as the min(μ(y)+4*σ(y), max(y)).

Type

float

_min_x

The minimum value of the x axis of the train, val and metric plots. This is set as the max(μ(x)-4*σ(x), min(x)).

Type

float

_min_y

The minimum value of the y axis of the train, val and metric plots. This is set as the max(μ(y)-4*σ(y), min(y)).

Type

float

_model_output_data_tr

Dictionary containing the plotting data [config.settings[“default_vars”][0],`config.settings[“default_vars”][1]`,`metric`,`y`,`pred`] required for the train and metric plots.

Type

dict

_model_output_data_val

Dictionary containing the plotting data [config.settings[“default_vars”][0],`config.settings[“default_vars”][1]`,`y`,`pred`] required for the val plot.

Type

dict

_accuracy_list

Dictionary containing the train and validation vs number of points accuracy scores.

Type

dict

_f1_list

Dictionary containing the train and validation vs number of points f1 scores.

Type

dict

_precision_list

Dictionary containing the train and validation vs number of points precision scores.

Type

dict

_recall_list

Dictionary containing the train and validation vs number of points recall scores.

Type

dict

_train_scores

Dictionary containing the current scores for the training set.

Type

dict

_val_scores

Dictionary containing the current scores for the validation set.

Type

dict

_test_scores

Dictionary containing the current scores for the testing set.

Type

dict

corr_train

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the training sources that are currently predicted correctly.

Type

ColumnDataSource

incorr_train

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the training sources that are currently predicted incorrectly.

Type

ColumnDataSource

corr_val

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the validation sources that are currently predicted correctly.

Type

ColumnDataSource

incorr_val

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the validation sources that are currently predicted incorrectly.

Type

ColumnDataSource

queried_points

The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of the current queried points.

Type

ColumnDataSource

full_labelled_data

Dictionary containing the id and y values of all labelled points during training.

Type

dict

assign_label_group

The group of buttons containing the possible labels for labelling during training.

Type

Panel RadioButtonGroup Widget

assign_label_button

The button for assigning the selected label from assign_label_group to the currently queried source.

Type

Panel Button Widget

show_queried_button

The button for making the current queried point the current selected point.

Type

Panel Button Widget

classifier_dropdown

A dropdown menu showing all the classifiers initialised in astronomicAL.extensions.models.

Type

Panel Select Widget

query_strategy_dropdown

A dropdown menu showing all the query strategies initialised in astronomicAL.extensions.query_strategies.

Type

Panel Select Widget

starting_num_points

Set the number of initial randomly selected points to train on.

Type

Panel IntInput Widget

classifier_table_source

The collection of all the currently selected classifier and query strategy pairs.

Type

ColumnDataSource

classifier_table

The table for visualising classifier_table_source

Type

DataTable

add_classifier_button

The button for appending the currently selected values from classifier_dropdown and query_strategy_dropdown to classifier_table_source.

Type

Panel Button Widget

remove_classifier_button

The button for removing the last entry from classifier_table_source.

Type

Panel Button Widget

start_training_button

The button for beginning the training of a classifier using the selected parameters from classifier_table_source and starting_num_points.

Type

Panel Button Widget

next_interation_button

The button to begin the next iteration of the Active Learning process. Only visible after assigning a label to the currently queried point.

Type

Panel Button Widget

checkpoint_button

The button to save the current model and parameters required to recreate current set up.

Type

Panel Button Widget

request_test_results_button

The button to request the current classifiers results for the test set.

Type

Panel Button Widget

_return_to_train_view_button

The button displayed in the test set caution window, allowing the user to return to the train and validation results without seeing the test set results.

Type

Panel Button Widget

_stop_caution_show_checkbox

A checkbox for whether the user wants to disable the test set caution window from appearing when they want to view the test set.

Type

Panel Checkbox Widget

_view_test_results_button

The button to show the test set results to the user. If _show_caution is True, this button will show the test set caution window instead.

Type

Panel Button Widget

_queried_is_selected

Flag for whether the current queried point is also the current selected point.

Type

bool

setup_row

A row containing all the classifier setup settings required before the training process has begun.

Type

Panel Row

panel_row

A row containing all the visualisation aspects of the ActiveLearningModel view.

Type

Panel Row

conf_mat_tr_tn

The current number of true negatives in the classifiers current prediction of the training set.

Type

str

conf_mat_tr_fn

The current number of false negatives in the classifiers current prediction of the training set.

Type

str

conf_mat_tr_fp

The current number of false positives in the classifiers current prediction of the training set.

Type

str

conf_mat_tr_tp

The current number of true positives in the classifiers current prediction of the training set.

Type

str

conf_mat_val_tn

The current number of true negatives in the classifiers current prediction of the validation set.

Type

str

conf_mat_val_fn

The current number of false negatives in the classifiers current prediction of the validation set.

Type

str

conf_mat_val_fp

The current number of false positives in the classifiers current prediction of the validation set.

Type

str

conf_mat_val_tp

The current number of true positives in the classifiers current prediction of the validation set.

Type

str

conf_mat_test_tn

The current number of true negatives in the classifiers current prediction of the test set.

Type

str

conf_mat_test_fn

The current number of false negatives in the classifiers current prediction of the test set.

Type

str

conf_mat_test_fp

The current number of false positives in the classifiers current prediction of the test set.

Type

str

conf_mat_test_tp

The current number of true positives in the classifiers current prediction of the test set.

Type

str

all_al_data

A dataframe containing a subset of df with only the required features for training.

Type

DataFrame

x_train

A dataframe containing all the training input data.

Type

DataFrame

y_train

A dataframe containing all the training labels.

Type

DataFrame

id_train

A dataframe containing all the training ids.

Type

DataFrame

x_val

A dataframe containing all the validation input data.

Type

DataFrame

y_val

A dataframe containing all the validation labels.

Type

DataFrame

id_val

A dataframe containing all the validation ids.

Type

DataFrame

x_test

A dataframe containing all the test input data.

Type

DataFrame

y_test

A dataframe containing all the test labels.

Type

DataFrame

id_test

A dataframe containing all the test ids.

Type

DataFrame

x_al_train

The data that the classifier is training on.

Type

Numpy Array

y_al_train

The labels for the data the classifier is training on.

Type

Numpy Array

id_al_train

The ids of the data the classifier is training on.

Type

DataFrame

x_pool

The data of the sources in the pool that are available to query from.

Type

Numpy Array

y_pool

The labels of the sources in the pool that are available to query from.

Type

Numpy Array

id_pool

The ids of the sources in the pool that are available to query from.

Type

DataFrame

query_index

The current index of x_pool that contains the current queried point.

Type

int

learner

The current classifier that is being trained. If multiple classifiers exist in classifier_table_source, then learner will be a ModAL Committee.

Type

ModAL ActiveLearner

assign_global_data()
Assign the current train, validation and test sets to the shared

ml_data dictionary so that it can be used by other classifiers.

Returns

Return type

None

create_pool(preselected=None)

Create the pool used for query points during active learning. The training set will be split into the pool and the classifier’s training set. The number in the classifier’s training set has already been set by the user and these points will be chosen randomly from the pool.

Returns

Return type

None

exclude_unclassified_labels(df_data_x, df_data_y, excluded)

Remove any sources that have a label that is not being trained on.

Parameters
  • df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.

  • df_data_y (DataFrame) – A dataframe containing the label corresponding to df_data_x.

  • excluded (str) – The label which should be removed from df_data_x and df_data_y.

Returns

  • data_x (DataFrame) – A subset of df_data_x which has had all rows with label excluded removed.

  • data_y (DataFrame) – A subset of df_data_y which has had all rows with label excluded removed.

  • excluded_x (DataFrame) – A subset of df_data_x which only has rows with label excluded.

  • excluded_y (DataFrame) – A subset of df_data_y which only has rows with label excluded.

generate_features(df)

Create the feature combinations that the user specified.

Parameters

df (DataFrame) – A dataframe containing all of the dataset.

Returns

  • df (DataFrame) – An expanding dataframe of df with the inclusion of the feature combinations.

  • df_al (DataFrame) – A dataframe containing a subset of df with only the required features for training.

iterate_AL()

Iterate through one iteration of active learning.

Returns

Return type

None

panel(button_update=False)

Create the active learning tab panel.

Returns

panel_row – The panel is housed in a row which can then be rendered by the respective Dashboard.

Return type

Panel Row

query_new_point()

Query the most informative point from the training pool based off the chosen query metric.

Returns

Return type

None

remove_from_pool(id=None)

Remove the current queried source from the active learning pool.

Returns

Return type

None

save_model(checkpoint=False)

Save the current classifier(s) as a joblib file to the models/ directory. The classifier filename will include the classifier(s) used and corresponding query function(s). If training a committee, a new directory will be created where each of the committee modals will be saved.

Parameters

checkpoint (bool, default = False) – Flag whether or not the model is saving a checkpoint. If True the filename will include the current size of the training set, the current validation F1-score as well as the time and date to allow for easy tracking and organisation of models.

Returns

Return type

None

scale_data(x_train, x_val, x_test, x_cols)

Scale the features of the data according to the training set.

A RobustScaler is used to limit the impact of outliers on the data.

Parameters
  • x_train (DataFrame) – A dataframe containing the training set. All subsequent data will be scaled according to this data.

  • x_val (DataFrame) – A dataframe containing the validation set.

  • x_test (DataFrame) – A dataframe containing the testing set.

  • x_cols (list of str) – List containing all the column names in x_train,`x_val` and x_test.

Returns

  • data_x_tr (DataFrame) – A dataframe containing the normalised training set.

  • data_x_val (DataFrame) – A dataframe containing the normalised validation set.

  • data_x_test (DataFrame) – A dataframe containing the normalised testing set.

setup_learners()

Initialise the classifiers used during active learning.

The classifiers used have already been chosen by the user.

Returns

Return type

None

setup_panel()

Create the panel which will house all the classifier setup options.

Returns

self.panel_row – The panel is housed in a row which can then be rendered by the respective Dashboard.

Return type

Panel Row

show_queried_point()

Assign the classifier’s current queried point as the current selected source.

Returns

Return type

None

split_x_y_ids(df_data)

Separate the data into X and [y,ids] dataframes.

Parameters

df_data (DataFrame) – A dataframe containing all the training features, the label column and the id column.

Returns

  • df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.

  • df_data_y_ids (DataFrame) – A dataframe containing only the label and id columns corresponding to df_data_x.

split_y_ids(y_id_train, y_id_val, y_id_test)

Split label and id columns into separate dataframes.

Parameters
  • y_id_train (DataFrame) – Dataframe containing label and id columns of the training set.

  • y_id_val (DataFrame) – Dataframe containing label and id columns of the validation set.

  • y_id_test (DataFrame) – Dataframe containing label and id columns of the test set.

Returns

  • data_y_tr (DataFrame) – Dataframe containing only the label column of y_id_train.

  • data_id_tr (DataFrame) – Dataframe containing only the id column of y_id_train.

  • data_y_val (DataFrame) – Dataframe containing only the label column of y_id_val.

  • data_id_val (DataFrame) – Dataframe containing only the id column of y_id_val.

  • data_y_test (DataFrame) – Dataframe containing only the label column of y_id_test.

  • data_id_test (DataFrame) – Dataframe containing only the id column of y_id_test.

train_val_test_split(df_data_x, df_data_y, excluded_x, excluded_y, train_ratio, val_ratio)

Split data into train, validation and test sets. The method uses stratified sampling to ensure each set has the correct distribution of points.

Parameters
  • df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.

  • df_data_y (DataFrame) – A dataframe containing the labels corresponding to df_data_x.

  • train_ratio (float) – The ratio of all the total dataset that should be used for the training set.

  • val_ratio (float) – The ratio of all the total dataset that should be used for the validation set.

Returns

  • x_train (DataFrame) – A subset of df_data_x which will be used for training a model.

  • y_train (DataFrame) – A dataframe containing the labels corresponding to x_train.

  • x_val (DataFrame) – A subset of df_data_x which will be used for validating a model.

  • y_val (DataFrame) – A dataframe containing the labels corresponding to x_val.

  • x_test (DataFrame) – A subset of df_data_x which will be used for testing a model.

  • y_test (DataFrame) – A dataframe containing the labels corresponding to x_test.