While most of your machine learning applications consume data from the data lake, you encountered a scenario where you need an ad-hoc import of a CSV file for some supplemental data for your machine learning model. While best practices would suggest an automated import of periodic/updated data into the data lake, this is a time where a one-time-use dataset, or a static dataset is required.
- Access to an Infor CloudSuite
- User privileges to Infor Coleman AI
- Delimited data on a local machine
- Optional Campus courses:
- Infor OS: Foundation for Multi-Tenant – Part 1
- Infor OS: Foundation for Multi-Tenant – Part 2
- Coleman AI Platform: Enablement Overview
The datasets section of Coleman is the staging area of data for machine learning consumption. Importing files is straightforward, with a few items of note to pay attention to.
- Verify data structure
Blindly importing data is dangerous. It is good practice to understand your data before import. Know its structure, and its data types. Some attention to this on the front end will save effort on the development side. For import into Coleman, your data must be in some delimited form, usually CSV. Tab separated data can also be read, or a custom delimiter can be used. Understanding which variables (columns) are numeric and which are strings is an important distinction as well. Coleman will make some assumptions on datatypes, but in the end, you are the keeper of the data and should verify datatypes.
Additionally, pay attention to any special datatypes like date formats and time formats. These formats are often read in as strings and it takes some processing to treat them as date/time data.
Select the appropriate delimiter before data input. You will also have the opportunity to define any custom data formats. These are particularly important for data and time data as there are many accepted formats for such data. Identifying the format early will help process the data later. While certainly not an exhaustive list, common datetime formats can be found here for reference.
Your data should also be structured such that each variable is a column and each observation (or case) is a row. Your data may optionally have a header which gives names to each variable.
It is also best practice to have done some data investigation at this point. Understanding the extent of missing values, value distribution, and correlations in your data will help develop better machine learning models.
- Import data
In the datasets section of Coleman, click “+Add” and select import from “File”. Use the navigator to find your file. Select the appropriate delimiter and identify any custom data type formats. You will also need to tell Coleman if your file has a header.
- Examine Metadata
Press the preview button after selecting your data file. This should give you a brief glimpse of the data below, as well as opening up the Metadata tab. Select Metadata and view your data types. Coleman has made some guesses as to your data types (double/string/float/boolean etc.), and you can change any you need to customize. This is also where you can rename variables as desired–particularly if your data did not have a header. You can always modify data types in a quest, but there are time savings and computational savings involved with setting these correctly on import.
- Save and Examine
When happy with your metadata, click the save icon and Coleman will process and load the data. Once complete, the Statistics tab will be available allowing you some insights into the structure of your data. This is a great place to double check for missing values, and to understand the spread and shape of your data.
- Use your data
Your dataset is now ready for use in a Coleman quest. Import data blocks in any quest will now have your new data set available in the dropdown.