Adding Custom Features to Your Model
The features given to the model can often be the deciding factor for how well a model can produce accurate predictions. This is arguably even more so when approaching the problem using a method such as Active Learning, where you may only be using a tiny fraction of your entire dataset.
Custom Features
Custom generated features can be added as a new function to astronomicAL.extensions.feature_generation.
There are some requirements when declaring a new feature generation function:
The new function must have two input parameters:
df
- The DataFrame containing the entire dataset.
n
- The number of features involved in the operation.Note
If your particular operation does not easily scale to more than two features at a time, then you can simply not make use of the
n
input parameter inside your function. However you must still includen
as an input parameter, even if you don’t use it.
The function must return the following:
df
- The updated dataframe with the newly generated features.
generated_features
- a list containing the updated column names.
The created function must be added as a new value to the
oper
dictionary within theget_oper_dict
function, with a brief string key identifying the operation.
Within the function, you can generate the combinations of features using:
1base_features = config.settings["features_for_training"]
2combs = list(combinations(base_features, n))
Creating Colours
Given the prevalence of photometry data in astronomical datasets, the most common additional features to create are colours. In AstronomicAL, these are provided with the default subtract (a-b) with a combination value of 2.
Example: Max(a, b)
In this example, we will show how we would create a new max
function. Although the produced features from this specific function may not be particularly useful for improving this model’s performance, it works well as an example.
1def max_oper(df, n): # The function must include the parameters df and n
2
3 np.random.seed(0) # set random seed if required for reproducability
4
5 base_features = config.settings["features_for_training"] # get the base features chosen by the user
6
7 combs = list(combinations(base_features, n)) # a list of all the combinations of n base_features
8
9 cols = list(df.columns) # all the columns in the dataset
10 generated_features = [] # The list that will keep track of all the new feature names
11
12 for comb in combs: #loop over all combination tuples
13
14 # This loop is to create the feature name string for the dataframe
15 new_feature_name = "max(" # start of feature name
16 for i in range(n): # loop over each feature in tuple
17 new_feature_name = new_feature_name + f"{comb[i]}" # add each feature in operation
18 if i != (n - 1):
19 new_feature_name = new_feature_name + "," # seperate features by a comma
20 else:
21 new_feature_name = new_feature_name + ")"
22
23 generated_features.append(new_feature_name) # add new feature name which is the form: max(f_1,f_2,...,f_n)
24
25
26 if new_feature_name not in cols: # if the feature already exists in the data, dont recalculate
27
28 # This loop applies the operation over all the feature in the combination and adds it as the new column in the dataframe
29 for i in range(n): # Loop of each individual feature in comb
30 if i == 0:
31 df[new_feature_name] = df[comb[i]] # add the new column and set its value to the starting feature (without this you will get a KeyError)
32 else:
33 df[new_feature_name] = np.maximum(df[new_feature_name], df[comb[i]]) #calculate the running maximum
34
35 return df, generated_features # The function must return the updated dataframe and the list of generated features
Finally, adding the new entry in the oper
dictionary, without specifying the parameters:
def get_oper_dict():
oper = {
"subtract (a-b)": subtract,
"add (a+b)": add,
"multiply (a*b)": multiply,
"divide (a/b)": divide,
"max(a,b)": max_oper, # Newly created function
}
return oper
And that is all that is required. The new max_oper
function is now available to use in AstronomicAL:
