# Getting Started ## Contents - {ref}`what-is-learning-machine-for` - {ref}`overview` - {ref}`install` - {ref}`quick-start` (what-is-learning-machine-for)= ## What is Learning Machine (LM) for? Learning Machine (LM) is for supporting machine learning data preprocessing, model construction, and experiment configuration. LM supports configuration files in YAML format, allowing users to easily define processing pipelines and manage version. Also, provides built-in, general-purpose and widely used processing engines to facilitate convenient processing. (overview)= ## Overview Learning Machine consists of several major components. - Data engine - Model (work in progress) - Params (work in progress) #### Data engine Data engine processes input data. User can stack process engines to build complex pipeline. By default, data engine supports processing with pandas.Dataframe. #### Model Model provides a consistent interface (train, validate, test). #### Params Params provides default optuna tuning parameters. (install)= ## Install #### as package (local) ``` git clone https://github.com/devhoodit/learning-machine.git pip install -e . ``` #### as directory ``` git clone https://github.com/devhoodit/learning-machine.git ``` (quick-start)= ## Quick start #### Process data We will start with [titanic dataset](https://www.kaggle.com/c/titanic/data). In titanic dataset, there are some nan values and categorical data. ```python import seaborn as sns data = sns.load_dataset('titanic') # pclass = class, so drop data = data.drop(["pclass"], axis=1) print(data.head(3)) print() print(data.info()) ``` ``` survived sex age sibsp parch fare embarked class who \ 0 0 male 22.0 1 0 7.2500 S Third man 1 1 female 38.0 1 0 71.2833 C First woman 2 1 female 26.0 0 0 7.9250 S Third woman adult_male deck embark_town alive alone 0 True NaN Southampton no False 1 False C Cherbourg yes False 2 False NaN Southampton yes True RangeIndex: 891 entries, 0 to 890 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 survived 891 non-null int64 1 sex 891 non-null object 2 age 714 non-null float64 3 sibsp 891 non-null int64 4 parch 891 non-null int64 5 fare 891 non-null float64 6 embarked 889 non-null object 7 class 891 non-null category 8 who 891 non-null object 9 adult_male 891 non-null bool 10 deck 203 non-null category 11 embark_town 889 non-null object 12 alive 891 non-null object 13 alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(3), object(5) ``` ```age``` and ```deck``` have nan values. ```sex```, ```embarked```, ```class```, ```who```, ```adult_male```, ```deck```, ```embark_town``` and ```alone``` are categorical data. We will drop nan value rows and encode categorical data to one hot representation. We will build data processing pipeline with Learning Machine's data engine. Data engine is created via initial settings and processes data by passing data into engine instance. We can build processing pipeline by flowing data through multiple data engines sequentially. The following example code shows commonly used data engines and important features when using them. ```python import learning_machine.engine as lm_engine category_cols = ["sex", "embarked", "class", "who", "adult_male", "deck", "embark_town", "alone"] # One Hot Encoder return NEW dataframe, so we need to concat it with original dataframe onehot_engine = lm_engine.OneHotEncoder(cols=category_cols) # Concat original data with new data concat_engine = lm_engine.ConcatDFs([onehot_engine]) # drop nan value rows dropna_engine = lm_engine.DropNARow(cols=["age"]) # drop processed columns. these columns are already encoded dropcol_engine = lm_engine.DropColumns(cols=category_cols) # apply engie sequentially seq_engine = lm_engine.SequentialEngine([concat_engine, dropna_engine, dropcol_engine]) # process data data = seq_engine(data) data.info() ``` ``` Index: 714 entries, 0 to 890 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 survived 714 non-null int64 1 age 714 non-null float64 2 sibsp 714 non-null int64 3 parch 714 non-null int64 4 fare 714 non-null float64 5 alive 714 non-null object 6 onehot_female 714 non-null float64 7 onehot_male 714 non-null float64 8 onehot_C 714 non-null float64 9 onehot_Q 714 non-null float64 10 onehot_S 714 non-null float64 11 onehot_nan 714 non-null float64 12 onehot_First 714 non-null float64 13 onehot_Second 714 non-null float64 14 onehot_Third 714 non-null float64 15 onehot_child 714 non-null float64 16 onehot_man 714 non-null float64 17 onehot_woman 714 non-null float64 18 onehot_False 714 non-null float64 19 onehot_True 714 non-null float64 20 onehot_A 714 non-null float64 21 onehot_B 714 non-null float64 22 onehot_C 714 non-null float64 23 onehot_D 714 non-null float64 24 onehot_E 714 non-null float64 25 onehot_F 714 non-null float64 26 onehot_G 714 non-null float64 27 onehot_nan 714 non-null float64 28 onehot_Cherbourg 714 non-null float64 29 onehot_Queenstown 714 non-null float64 30 onehot_Southampton 714 non-null float64 31 onehot_nan 714 non-null float64 32 onehot_False 714 non-null float64 33 onehot_True 714 non-null float64 dtypes: float64(30), int64(3), object(1) ``` nan value rows are droped and one hot encodings are also successfully done. The following is a brief description of each engine. - ```OneHotEncoder```: One hot encoding using [scikit-learn onehot encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). Engine returns new dataframe in which each column name is onehot_{category_name} - ```ConcatDFs```: Apply each engine and concat return dataframe with original data. This engine is used to concatenate the resulting Dataframes produced by engines. - ```DropNARows```: Drop na row in specific columns - ```DropColumns```: Drop columns - ```SequentialEngine```: Apply engines sequentially. data -> engine1 -> data1 -> engine2 -> data2. This is useful for grouping multiple engines into a single composite engine. Other engines can be found [here](api/engine.md). #### Build data engine from config file The above example demonstrates building data engine with code, but it can also be built via a config file. Learning Machine supports YAML format config file. ```yaml # config.yaml data_engine: - ConcatDFs: - OneHotEncoder: cols: - sex - embarked - class - who - adult_male - deck - embark_town - alone - DropNARow: cols: - age - DropColumns: cols: - sex - embarked - class - who - adult_male - deck - embark_town - alone ``` As you might have noticed, config file format is same as name of data engine and arguments (key-value). There is one exception: ```ConcatDFs``` format is list of engines. Since ```ConcatDFs``` takes engines as arguments, it requires a special format. Some engine might require a specialized format. Now, we will build engine from config file. ```python from learning_machine import create_from_config bundle = create_from_config("config.yaml") engine = bundle.data_engine ``` ```create_from_config``` return bundle (we can build other component from config), so we need to get data engine. In config file, the engines are specified as a list. However, ```create_from_config``` automatically apply ```SequentialEngine``` to concat engines. Additionally, individual engines can be accessed through the ```bundle.data_engines``` attribute.