astronomicAL.active_learning¶
- class astronomicAL.active_learning.active_learning.ActiveLearningModel(src, df, label)¶
This class handles the Machine Learning aspect of the codebase.
Based on the users settings, the required features will be extracted from the data and split into train, validation and test sets. The user can then specify which classifiers and query functions they would like to use in the Active Learning Process. The results at each stage will be displayed in various widgets and plots, allowing the user to select the correct label for the newly queried source. Each instance will train a separate one-vs-rest classifier.
- Parameters
src (ColumnDataSource) – The shared data source which holds the current selected source.
df (DataFrame) – The shared dataframe which holds all the data.
label (str) – The string alias of the label that will be the positive case in the one-vs-rest classifier.
- df¶
The shared dataframe which holds all the data.
- Type
Dataframe
- src¶
The shared data source which holds the current selected source.
- Type
ColumnDataSource
- _label¶
The label that will be the positive case in the one-vs-rest classifier.
- Type
int
- _label_alias¶
The string alias of _label
- Type
str
- _training¶
Flag for whether the training process has begun.
- Type
bool
- _assigned¶
Flag for whether the user has assigned a label to the current queried source.
- Type
bool
- retrain¶
Flag for whether the class is retraining a previous model from within config.settings[“classifiers”]
- Type
bool
- scaler¶
The scaler used to standardise features across the train, val and test sets according to the training set. NOTE: Only initialised if config.settings[“scale_data”] is True.
- Type
sklearn.preprocessing.RobustScaler
- _show_test_results¶
Flag for whether to render the test set results column in panel method.
- Type
bool
- _seen_test_results¶
Flag for indicating whether the user has viewed the test results of the classifier.
- Type
bool
- _show_caution¶
Flag for whether to show the test set caution column to the user when trying to view the test set results.
- Type
bool
- _seen_caution¶
Flag for indicating whether the user has viewed the test results caution page.
- Type
bool
- _max_x¶
The maximum value of the x axis of the train, val and metric plots. This is set as the min(μ(x)+4*σ(x), max(x)).
- Type
float
- _max_y¶
The maximum value of the y axis of the train, val and metric plots. This is set as the min(μ(y)+4*σ(y), max(y)).
- Type
float
- _min_x¶
The minimum value of the x axis of the train, val and metric plots. This is set as the max(μ(x)-4*σ(x), min(x)).
- Type
float
- _min_y¶
The minimum value of the y axis of the train, val and metric plots. This is set as the max(μ(y)-4*σ(y), min(y)).
- Type
float
- _model_output_data_tr¶
Dictionary containing the plotting data [config.settings[“default_vars”][0],`config.settings[“default_vars”][1]`,`metric`,`y`,`pred`] required for the train and metric plots.
- Type
dict
- _model_output_data_val¶
Dictionary containing the plotting data [config.settings[“default_vars”][0],`config.settings[“default_vars”][1]`,`y`,`pred`] required for the val plot.
- Type
dict
- _accuracy_list¶
Dictionary containing the train and validation vs number of points accuracy scores.
- Type
dict
- _f1_list¶
Dictionary containing the train and validation vs number of points f1 scores.
- Type
dict
- _precision_list¶
Dictionary containing the train and validation vs number of points precision scores.
- Type
dict
- _recall_list¶
Dictionary containing the train and validation vs number of points recall scores.
- Type
dict
- _train_scores¶
Dictionary containing the current scores for the training set.
- Type
dict
- _val_scores¶
Dictionary containing the current scores for the validation set.
- Type
dict
- _test_scores¶
Dictionary containing the current scores for the testing set.
- Type
dict
- corr_train¶
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the training sources that are currently predicted correctly.
- Type
ColumnDataSource
- incorr_train¶
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the training sources that are currently predicted incorrectly.
- Type
ColumnDataSource
- corr_val¶
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the validation sources that are currently predicted correctly.
- Type
ColumnDataSource
- incorr_val¶
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of all the validation sources that are currently predicted incorrectly.
- Type
ColumnDataSource
- queried_points¶
The config.settings[“default_vars”][0] and config.settings[“default_vars”][1] values of the current queried points.
- Type
ColumnDataSource
- full_labelled_data¶
Dictionary containing the id and y values of all labelled points during training.
- Type
dict
- assign_label_group¶
The group of buttons containing the possible labels for labelling during training.
- Type
Panel RadioButtonGroup Widget
- assign_label_button¶
The button for assigning the selected label from assign_label_group to the currently queried source.
- Type
Panel Button Widget
- show_queried_button¶
The button for making the current queried point the current selected point.
- Type
Panel Button Widget
- classifier_dropdown¶
A dropdown menu showing all the classifiers initialised in astronomicAL.extensions.models.
- Type
Panel Select Widget
- query_strategy_dropdown¶
A dropdown menu showing all the query strategies initialised in astronomicAL.extensions.query_strategies.
- Type
Panel Select Widget
- starting_num_points¶
Set the number of initial randomly selected points to train on.
- Type
Panel IntInput Widget
- classifier_table_source¶
The collection of all the currently selected classifier and query strategy pairs.
- Type
ColumnDataSource
- classifier_table¶
The table for visualising classifier_table_source
- Type
DataTable
- add_classifier_button¶
The button for appending the currently selected values from classifier_dropdown and query_strategy_dropdown to classifier_table_source.
- Type
Panel Button Widget
- remove_classifier_button¶
The button for removing the last entry from classifier_table_source.
- Type
Panel Button Widget
- start_training_button¶
The button for beginning the training of a classifier using the selected parameters from classifier_table_source and starting_num_points.
- Type
Panel Button Widget
- next_interation_button¶
The button to begin the next iteration of the Active Learning process. Only visible after assigning a label to the currently queried point.
- Type
Panel Button Widget
- checkpoint_button¶
The button to save the current model and parameters required to recreate current set up.
- Type
Panel Button Widget
- request_test_results_button¶
The button to request the current classifiers results for the test set.
- Type
Panel Button Widget
- _return_to_train_view_button¶
The button displayed in the test set caution window, allowing the user to return to the train and validation results without seeing the test set results.
- Type
Panel Button Widget
- _stop_caution_show_checkbox¶
A checkbox for whether the user wants to disable the test set caution window from appearing when they want to view the test set.
- Type
Panel Checkbox Widget
- _view_test_results_button¶
The button to show the test set results to the user. If _show_caution is True, this button will show the test set caution window instead.
- Type
Panel Button Widget
- _queried_is_selected¶
Flag for whether the current queried point is also the current selected point.
- Type
bool
- setup_row¶
A row containing all the classifier setup settings required before the training process has begun.
- Type
Panel Row
- panel_row¶
A row containing all the visualisation aspects of the ActiveLearningModel view.
- Type
Panel Row
- conf_mat_tr_tn¶
The current number of true negatives in the classifiers current prediction of the training set.
- Type
str
- conf_mat_tr_fn¶
The current number of false negatives in the classifiers current prediction of the training set.
- Type
str
- conf_mat_tr_fp¶
The current number of false positives in the classifiers current prediction of the training set.
- Type
str
- conf_mat_tr_tp¶
The current number of true positives in the classifiers current prediction of the training set.
- Type
str
- conf_mat_val_tn¶
The current number of true negatives in the classifiers current prediction of the validation set.
- Type
str
- conf_mat_val_fn¶
The current number of false negatives in the classifiers current prediction of the validation set.
- Type
str
- conf_mat_val_fp¶
The current number of false positives in the classifiers current prediction of the validation set.
- Type
str
- conf_mat_val_tp¶
The current number of true positives in the classifiers current prediction of the validation set.
- Type
str
- conf_mat_test_tn¶
The current number of true negatives in the classifiers current prediction of the test set.
- Type
str
- conf_mat_test_fn¶
The current number of false negatives in the classifiers current prediction of the test set.
- Type
str
- conf_mat_test_fp¶
The current number of false positives in the classifiers current prediction of the test set.
- Type
str
- conf_mat_test_tp¶
The current number of true positives in the classifiers current prediction of the test set.
- Type
str
- all_al_data¶
A dataframe containing a subset of df with only the required features for training.
- Type
DataFrame
- x_train¶
A dataframe containing all the training input data.
- Type
DataFrame
- y_train¶
A dataframe containing all the training labels.
- Type
DataFrame
- id_train¶
A dataframe containing all the training ids.
- Type
DataFrame
- x_val¶
A dataframe containing all the validation input data.
- Type
DataFrame
- y_val¶
A dataframe containing all the validation labels.
- Type
DataFrame
- id_val¶
A dataframe containing all the validation ids.
- Type
DataFrame
- x_test¶
A dataframe containing all the test input data.
- Type
DataFrame
- y_test¶
A dataframe containing all the test labels.
- Type
DataFrame
- id_test¶
A dataframe containing all the test ids.
- Type
DataFrame
- x_al_train¶
The data that the classifier is training on.
- Type
Numpy Array
- y_al_train¶
The labels for the data the classifier is training on.
- Type
Numpy Array
- id_al_train¶
The ids of the data the classifier is training on.
- Type
DataFrame
- x_pool¶
The data of the sources in the pool that are available to query from.
- Type
Numpy Array
- y_pool¶
The labels of the sources in the pool that are available to query from.
- Type
Numpy Array
- id_pool¶
The ids of the sources in the pool that are available to query from.
- Type
DataFrame
- query_index¶
The current index of x_pool that contains the current queried point.
- Type
int
- learner¶
The current classifier that is being trained. If multiple classifiers exist in classifier_table_source, then learner will be a ModAL Committee.
- Type
ModAL ActiveLearner
- assign_global_data()¶
- Assign the current train, validation and test sets to the shared
ml_data dictionary so that it can be used by other classifiers.
- Returns
- Return type
None
- create_pool(preselected=None)¶
Create the pool used for query points during active learning. The training set will be split into the pool and the classifier’s training set. The number in the classifier’s training set has already been set by the user and these points will be chosen randomly from the pool.
- Returns
- Return type
None
- exclude_unclassified_labels(df_data_x, df_data_y, excluded)¶
Remove any sources that have a label that is not being trained on.
- Parameters
df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.
df_data_y (DataFrame) – A dataframe containing the label corresponding to df_data_x.
excluded (str) – The label which should be removed from df_data_x and df_data_y.
- Returns
data_x (DataFrame) – A subset of df_data_x which has had all rows with label excluded removed.
data_y (DataFrame) – A subset of df_data_y which has had all rows with label excluded removed.
excluded_x (DataFrame) – A subset of df_data_x which only has rows with label excluded.
excluded_y (DataFrame) – A subset of df_data_y which only has rows with label excluded.
- generate_features(df)¶
Create the feature combinations that the user specified.
- Parameters
df (DataFrame) – A dataframe containing all of the dataset.
- Returns
df (DataFrame) – An expanding dataframe of df with the inclusion of the feature combinations.
df_al (DataFrame) – A dataframe containing a subset of df with only the required features for training.
- iterate_AL()¶
Iterate through one iteration of active learning.
- Returns
- Return type
None
- panel(button_update=False)¶
Create the active learning tab panel.
- Returns
panel_row – The panel is housed in a row which can then be rendered by the respective Dashboard.
- Return type
Panel Row
- query_new_point()¶
Query the most informative point from the training pool based off the chosen query metric.
- Returns
- Return type
None
- remove_from_pool(id=None)¶
Remove the current queried source from the active learning pool.
- Returns
- Return type
None
- save_model(checkpoint=False)¶
Save the current classifier(s) as a joblib file to the models/ directory. The classifier filename will include the classifier(s) used and corresponding query function(s). If training a committee, a new directory will be created where each of the committee modals will be saved.
- Parameters
checkpoint (bool, default = False) – Flag whether or not the model is saving a checkpoint. If True the filename will include the current size of the training set, the current validation F1-score as well as the time and date to allow for easy tracking and organisation of models.
- Returns
- Return type
None
- scale_data(x_train, x_val, x_test, x_cols)¶
Scale the features of the data according to the training set.
A RobustScaler is used to limit the impact of outliers on the data.
- Parameters
x_train (DataFrame) – A dataframe containing the training set. All subsequent data will be scaled according to this data.
x_val (DataFrame) – A dataframe containing the validation set.
x_test (DataFrame) – A dataframe containing the testing set.
x_cols (list of str) – List containing all the column names in x_train,`x_val` and x_test.
- Returns
data_x_tr (DataFrame) – A dataframe containing the normalised training set.
data_x_val (DataFrame) – A dataframe containing the normalised validation set.
data_x_test (DataFrame) – A dataframe containing the normalised testing set.
- setup_learners()¶
Initialise the classifiers used during active learning.
The classifiers used have already been chosen by the user.
- Returns
- Return type
None
- setup_panel()¶
Create the panel which will house all the classifier setup options.
- Returns
self.panel_row – The panel is housed in a row which can then be rendered by the respective Dashboard.
- Return type
Panel Row
- show_queried_point()¶
Assign the classifier’s current queried point as the current selected source.
- Returns
- Return type
None
- split_x_y_ids(df_data)¶
Separate the data into X and [y,ids] dataframes.
- Parameters
df_data (DataFrame) – A dataframe containing all the training features, the label column and the id column.
- Returns
df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.
df_data_y_ids (DataFrame) – A dataframe containing only the label and id columns corresponding to df_data_x.
- split_y_ids(y_id_train, y_id_val, y_id_test)¶
Split label and id columns into separate dataframes.
- Parameters
y_id_train (DataFrame) – Dataframe containing label and id columns of the training set.
y_id_val (DataFrame) – Dataframe containing label and id columns of the validation set.
y_id_test (DataFrame) – Dataframe containing label and id columns of the test set.
- Returns
data_y_tr (DataFrame) – Dataframe containing only the label column of y_id_train.
data_id_tr (DataFrame) – Dataframe containing only the id column of y_id_train.
data_y_val (DataFrame) – Dataframe containing only the label column of y_id_val.
data_id_val (DataFrame) – Dataframe containing only the id column of y_id_val.
data_y_test (DataFrame) – Dataframe containing only the label column of y_id_test.
data_id_test (DataFrame) – Dataframe containing only the id column of y_id_test.
- train_val_test_split(df_data_x, df_data_y, excluded_x, excluded_y, train_ratio, val_ratio)¶
Split data into train, validation and test sets. The method uses stratified sampling to ensure each set has the correct distribution of points.
- Parameters
df_data_x (DataFrame) – A dataframe containing only the features used for machine learning.
df_data_y (DataFrame) – A dataframe containing the labels corresponding to df_data_x.
train_ratio (float) – The ratio of all the total dataset that should be used for the training set.
val_ratio (float) – The ratio of all the total dataset that should be used for the validation set.
- Returns
x_train (DataFrame) – A subset of df_data_x which will be used for training a model.
y_train (DataFrame) – A dataframe containing the labels corresponding to x_train.
x_val (DataFrame) – A subset of df_data_x which will be used for validating a model.
y_val (DataFrame) – A dataframe containing the labels corresponding to x_val.
x_test (DataFrame) – A subset of df_data_x which will be used for testing a model.
y_test (DataFrame) – A dataframe containing the labels corresponding to x_test.