astronomicAL.active_learning
- class astronomicAL.active_learning.active_learning.ActiveLearningModel(src, df, label)
This class handles the Machine Learning aspect of the codebase.
Based on the users settings, the required features will be extracted from the data and split into train, validation and test sets. The user can then specify which classifiers and query functions they would like to use in the Active Learning Process. The results at each stage will be displayed in various widgets and plots, allowing the user to select the correct label for the newly queried source. Each instance will train a separate one-vs-rest classifier.
- Parameters:
src (ColumnDataSource) – The shared data source which holds the current selected source.
df (DataFrame) – The shared dataframe which holds all the data.
label (str) – The string alias of the label that will be the positive case in the one-vs-rest classifier.
- df
The shared dataframe which holds all the data.
- Type:
Dataframe
- src
The shared data source which holds the current selected source.
- Type:
ColumnDataSource
- _label
The label that will be the positive case in the one-vs-rest classifier.
- Type:
int
- _label_alias
The string alias of _label
- Type:
str
- _training
Flag for whether the training process has begun.
- Type:
bool
- _assigned
Flag for whether the user has assigned a label to the current queried source.
- Type:
bool
- retrain
Flag for whether the class is retraining a previous model from within config.settings[“classifiers”]
- Type:
bool
- scaler
The scaler used to standardise features across the train, val and test sets according to the training set. NOTE: Only initialised if config.settings[“scale_data”] is True.
- Type:
sklearn.preprocessing.RobustScaler
- _show_test_results
Flag for whether to render the test set results column in panel method.
- Type:
bool
- _seen_test_results
Flag for indicating whether the user has viewed the test results of the classifier.
- Type:
bool
- _show_caution
Flag for whether to show the test set caution column to the user when trying to view the test set results.
- Type:
bool
- _seen_caution
Flag for indicating whether the user has viewed the test results caution page.
- Type:
bool
- _max_x
The maximum value of the x axis of the train, val and metric plots. This is set as the min(μ(x)+4*σ(x), max(x)).
- Type:
float
- _max_y
The maximum value of the y axis of the train, val and metric plots. This is set as the min(μ(y)+4*σ(y), max(y)).
- Type:
float
- _min_x
The minimum value of the x axis of the train, val and metric plots. This is set as the max(μ(x)-4*σ(x), min(x)).
- Type:
float
- _min_y
The minimum value of the y axis of the train, val and metric plots. This is set as the max(μ(y)-4*σ(y), min(y)).
- Type:
float
- _model_output_data_tr
Dictionary containing the plotting data [config.settings[“default_vars”][0],`config.settings[“default_vars”][1]`,`metric`,`y`,`pred`] required for the train and metric plots.
- Type:
dict
- _model_output_data_val
Dictionary containing the plotting data [config.settings[“default_vars”][0],`config.settings[“default_vars”][1]`,`y`,`pred`] required for the val plot.
- Type:
dict
- _accuracy_list
Dictionary containing the train and validation vs number of points accuracy scores.
- Type:
dict
- _f1_list
Dictionary containing the train and validation vs number of points f1 scores.
- Type:
dict
- _precision_list
Dictionary containing the train and validation vs number of points precision scores.
- Type:
dict
- _recall_list
Dictionary containing the train and validation vs number of points recall scores.
- Type:
dict
- _train_scores
Dictionary containing the current scores for the training set.
- Type:
dict
- _val_scores
Dictionary containing the current scores for the validation set.
- Type:
dict
- _test_scores
Dictionary containing the current scores for the testing set.
- Type:
dict
- corr_train
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the training sources that are currently predicted correctly.
- Type:
ColumnDataSource
- incorr_train
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the training sources that are currently predicted incorrectly.
- Type:
ColumnDataSource
- corr_val
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the validation sources that are currently predicted correctly.
- Type:
ColumnDataSource
- incorr_val
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the validation sources that are currently predicted incorrectly.
- Type:
ColumnDataSource
- queried_points
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of the current queried points.
- Type:
ColumnDataSource
- full_labelled_data
Dictionary containing the id and y values of all labelled points during training.
- Type:
dict
- assign_label_group
The group of buttons containing the possible labels for labelling during training.
- Type:
Panel RadioButtonGroup Widget
- assign_label_button
The button for assigning the selected label from assign_label_group to the currently queried source.
- Type:
Panel Button Widget
- show_queried_button
The button for making the current queried point the current selected point.
- Type:
Panel Button Widget
- classifier_dropdown
A dropdown menu showing all the classifiers initialised in astronomicAL.extensions.models.
- Type:
Panel Select Widget
- query_strategy_dropdown
A dropdown menu showing all the query strategies initialised in astronomicAL.extensions.query_strategies.
- Type:
Panel Select Widget
- starting_num_points
Set the number of initial randomly selected points to train on.
- Type:
Panel IntInput Widget
- classifier_table_source
The collection of all the currently selected classifier and query strategy pairs.
- Type:
ColumnDataSource
- classifier_table
The table for visualising classifier_table_source
- Type:
DataTable
- add_classifier_button
The button for appending the currently selected values from classifier_dropdown and query_strategy_dropdown to classifier_table_source.
- Type:
Panel Button Widget
- remove_classifier_button
The button for removing the last entry from classifier_table_source.
- Type:
Panel Button Widget
- start_training_button
The button for beginning the training of a classifier using the selected parameters from classifier_table_source and starting_num_points.
- Type:
Panel Button Widget
- next_interation_button
The button to begin the next iteration of the Active Learning process. Only visible after assigning a label to the currently queried point.
- Type:
Panel Button Widget
- checkpoint_button
The button to save the current model and parameters required to recreate current set up.
- Type:
Panel Button Widget
- request_test_results_button
The button to request the current classifiers results for the test set.
- Type:
Panel Button Widget
- _return_to_train_view_button
The button displayed in the test set caution window, allowing the user to return to the train and validation results without seeing the test set results.
- Type:
Panel Button Widget
- _stop_caution_show_checkbox
A checkbox for whether the user wants to disable the test set caution window from appearing when they want to view the test set.
- Type:
Panel Checkbox Widget
- _view_test_results_button
The button to show the test set results to the user. If _show_caution is True, this button will show the test set caution window instead.
- Type:
Panel Button Widget
- _queried_is_selected
Flag for whether the current queried point is also the current selected point.
- Type:
bool
- setup_row
A row containing all the classifier setup settings required before the training process has begun.
- Type:
Panel Row
- panel_row
A row containing all the visualisation aspects of the ActiveLearningModel view.
- Type:
Panel Row
- conf_mat_tr_tn
The current number of true negatives in the classifiers current prediction of the training set.
- Type:
str
- conf_mat_tr_fn
The current number of false negatives in the classifiers current prediction of the training set.
- Type:
str
- conf_mat_tr_fp
The current number of false positives in the classifiers current prediction of the training set.
- Type:
str
- conf_mat_tr_tp
The current number of true positives in the classifiers current prediction of the training set.
- Type:
str
- conf_mat_val_tn
The current number of true negatives in the classifiers current prediction of the validation set.
- Type:
str
- conf_mat_val_fn
The current number of false negatives in the classifiers current prediction of the validation set.
- Type:
str
- conf_mat_val_fp
The current number of false positives in the classifiers current prediction of the validation set.
- Type:
str
- conf_mat_val_tp
The current number of true positives in the classifiers current prediction of the validation set.
- Type:
str
- conf_mat_test_tn
The current number of true negatives in the classifiers current prediction of the test set.
- Type:
str
- conf_mat_test_fn
The current number of false negatives in the classifiers current prediction of the test set.
- Type:
str
- conf_mat_test_fp
The current number of false positives in the classifiers current prediction of the test set.
- Type:
str
- conf_mat_test_tp
The current number of true positives in the classifiers current prediction of the test set.
- Type:
str
- all_al_data
A dataframe containing a subset of df with only the required features for training.
- Type:
DataFrame
- x_train
A dataframe containing all the training input data.
- Type:
DataFrame
- y_train
A dataframe containing all the training labels.
- Type:
DataFrame
- id_train
A dataframe containing all the training ids.
- Type:
DataFrame
- x_val
A dataframe containing all the validation input data.
- Type:
DataFrame
- y_val
A dataframe containing all the validation labels.
- Type:
DataFrame
- id_val
A dataframe containing all the validation ids.
- Type:
DataFrame
- x_test
A dataframe containing all the test input data.
- Type:
DataFrame
- y_test
A dataframe containing all the test labels.
- Type:
DataFrame
- id_test
A dataframe containing all the test ids.
- Type:
DataFrame
- x_al_train
The data that the classifier is training on.
- Type:
Numpy Array
- y_al_train
The labels for the data the classifier is training on.
- Type:
Numpy Array
- id_al_train
The ids of the data the classifier is training on.
- Type:
DataFrame
- x_pool
The data of the sources in the pool that are available to query from.
- Type:
Numpy Array
- y_pool
The labels of the sources in the pool that are available to query from.
- Type:
Numpy Array
- id_pool
The ids of the sources in the pool that are available to query from.
- Type:
DataFrame
- query_index
The current index of x_pool that contains the current queried point.
- Type:
int
- learner
The current classifier that is being trained. If multiple classifiers exist in classifier_table_source, then learner will be a ModAL Committee.
- Type:
ModAL ActiveLearner
- assign_global_data()
- Assign the current train, validation and test sets to the shared
ml_data dictionary so that it can be used by other classifiers.
- Return type:
None
- create_pool(preselected=None)
Create the pool used for query points during active learning. The training set will be split into the pool and the classifier’s training set. The number in the classifier’s training set has already been set by the user and these points will be chosen randomly from the pool.
- Return type:
None
- exclude_unclassified_labels(df_data_x, df_data_y, excluded)
Remove any sources that have a label that is not being trained on.
- Parameters:
df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.
df_data_y (DataFrame) – A dataframe containing the label corresponding to df_data_x.
excluded (str) – The label which should be removed from df_data_x and df_data_y.
- Returns:
data_x (DataFrame) – A subset of df_data_x which has had all rows with label excluded removed.
data_y (DataFrame) – A subset of df_data_y which has had all rows with label excluded removed.
excluded_x (DataFrame) – A subset of df_data_x which only has rows with label excluded.
excluded_y (DataFrame) – A subset of df_data_y which only has rows with label excluded.
- generate_features(df)
Create the feature combinations that the user specified.
- Parameters:
df (DataFrame) – A dataframe containing all of the dataset.
- Returns:
df (DataFrame) – An expanding dataframe of df with the inclusion of the feature combinations.
df_al (DataFrame) – A dataframe containing a subset of df with only the required features for training.
- iterate_AL()
Iterate through one iteration of active learning.
- Return type:
None
- panel(button_update=False)
Create the active learning tab panel.
- Returns:
panel_row – The panel is housed in a row which can then be rendered by the respective Dashboard.
- Return type:
Panel Row
- query_new_point()
Query the most informative point from the training pool based off the chosen query metric.
- Return type:
None
- remove_from_pool(id=None)
Remove the current queried source from the active learning pool.
- Return type:
None
- save_model(checkpoint=False)
Save the current classifier(s) as a joblib file to the models/ directory. The classifier filename will include the classifier(s) used and corresponding query function(s). If training a committee, a new directory will be created where each of the committee modals will be saved.
- Parameters:
checkpoint (bool, default = False) – Flag whether or not the model is saving a checkpoint. If True the filename will include the current size of the training set, the current validation F1-score as well as the time and date to allow for easy tracking and organisation of models.
- Return type:
None
- scale_data(x_train, x_val, x_test, x_cols)
Scale the features of the data according to the training set.
A RobustScaler is used to limit the impact of outliers on the data.
- Parameters:
x_train (DataFrame) – A dataframe containing the training set. All subsequent data will be scaled according to this data.
x_val (DataFrame) – A dataframe containing the validation set.
x_test (DataFrame) – A dataframe containing the testing set.
x_cols (list of str) – List containing all the column names in x_train,`x_val` and x_test.
- Returns:
data_x_tr (DataFrame) – A dataframe containing the normalised training set.
data_x_val (DataFrame) – A dataframe containing the normalised validation set.
data_x_test (DataFrame) – A dataframe containing the normalised testing set.
- setup_learners()
Initialise the classifiers used during active learning.
The classifiers used have already been chosen by the user.
- Return type:
None
- setup_panel()
Create the panel which will house all the classifier setup options.
- Returns:
self.panel_row – The panel is housed in a row which can then be rendered by the respective Dashboard.
- Return type:
Panel Row
- show_queried_point()
Assign the classifier’s current queried point as the current selected source.
- Return type:
None
- split_x_y_ids(df_data)
Separate the data into X and [y,ids] dataframes.
- Parameters:
df_data (DataFrame) – A dataframe containing all the training features, the label column and the id column.
- Returns:
df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.
df_data_y_ids (DataFrame) – A dataframe containing only the label and id columns corresponding to df_data_x.
- split_y_ids(y_id_train, y_id_val, y_id_test)
Split label and id columns into separate dataframes.
- Parameters:
y_id_train (DataFrame) – Dataframe containing label and id columns of the training set.
y_id_val (DataFrame) – Dataframe containing label and id columns of the validation set.
y_id_test (DataFrame) – Dataframe containing label and id columns of the test set.
- Returns:
data_y_tr (DataFrame) – Dataframe containing only the label column of y_id_train.
data_id_tr (DataFrame) – Dataframe containing only the id column of y_id_train.
data_y_val (DataFrame) – Dataframe containing only the label column of y_id_val.
data_id_val (DataFrame) – Dataframe containing only the id column of y_id_val.
data_y_test (DataFrame) – Dataframe containing only the label column of y_id_test.
data_id_test (DataFrame) – Dataframe containing only the id column of y_id_test.
- train_val_test_split(df_data_x, df_data_y, excluded_x, excluded_y, train_ratio, val_ratio)
Split data into train, validation and test sets. The method uses stratified sampling to ensure each set has the correct distribution of points.
- Parameters:
df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.
df_data_y (DataFrame) – A dataframe containing the labels corresponding to df_data_x.
train_ratio (float) – The ratio of all the total dataset that should be used for the training set.
val_ratio (float) – The ratio of all the total dataset that should be used for the validation set.
- Returns:
x_train (DataFrame) – A subset of df_data_x which will be used for training a model.
y_train (DataFrame) – A dataframe containing the labels corresponding to x_train.
x_val (DataFrame) – A subset of df_data_x which will be used for validating a model.
y_val (DataFrame) – A dataframe containing the labels corresponding to x_val.
x_test (DataFrame) – A subset of df_data_x which will be used for testing a model.
y_test (DataFrame) – A dataframe containing the labels corresponding to x_test.