A processor in an ETL-like concept highly optimized for getting data into Shooju. Processors extract data from external sources or from existing Shooju series, transform the data into the Shooju Series concept, and load it into Shooju. The documentation explains how processors work and lays out key assumptions. JSON-formatted objects store key things like URLs, passwords, tokens, etc. The processor code is written in Python and can be edited and run via Shooju Web. Most processors use Launchers to run at the appropriate time while others are run manually or through Uploaders.
Keep in mind that processors write series through Jobs.
Documentation
The documentation is often written by the business user or analyst in order to describe the business requirements of the processor (e.g. where to pull data from, how often it is updated at the source, what timezone it is in, etc). The processor developer often edits the documentation to add key implementation notes (e.g. parsing issues, potential dangers, etc). The documentation is most often viewed and edited online using Shooju Web.
Settings
Settings are used to separate processor parameters that may change from the core code, which shouldn’t change often. The settings are written by the business analyst or the processor developer. Examples include:
- URLs
- FTP credentials
- Lookups (e.g. country code to country name)
- Additional field data to apply to all series (e.g. unit)
- RegEx for parsing data
Roles
Roles are permissions set at the processor level. Users that are a member of a Teams added to a category below will have the corresponding permissions:
- Launchers: Able to start/kill/revoke jobs.
- Callers: Permission to call: /processors/<processor_id>/call/<func_name> via API.
- Expression Executors: Able to use processors functions in expression context.
- Admins: Able to make changes to processor settings, create or edit launchers, documentation, etc.
- Code Editors: Able to inspect and edit processor code.
Code
The code is the core part of the processor that actually does the work. Processors are written in Python and are generally 10 to 500 lines long, with the average around 150. The processor developer writes the code.
Launchers
Processors can be run manually, through Uploaders, or, most frequently, through launchers. Each processor can have multiple launchers. Each launcher specifies the frequency it should run using CRON syntax, as well as how it should run: as a job or a trigger. As the name suggests, jobs immediately start a job and try to import data. Triggers check if a job should be started, and based on the logic in the code, may decide to start the job or wait until the next time the trigger runs.
Jobs
All writes to Series in Shooju must be done through a job. A job is identified by a sequential numerical identifier referred to as the job ID. The higher the job id, the more recent the job. Jobs serve several purposes in Shooju:
Named Separation of Writes
Most data in Shooju comes in batches (an update from the IMF, NYSE closing prices, etc). Jobs help logically separate writes into Shooju among these batches, identify the batch with the job ID, and name it something like “NYSE Close Prices”.
Storing Job Metadata
Jobs contains metadata that helps understand what the job wrote and how: who started the job, when the job started and ended, how many Series were written by the job, how many series/points/fields were changed, added or removed as part of the job, etc.
Preventing Write Conflicts
Series’ points or fields can only be changed by a job with the same or higher job id than the previous one. This ensures that newer jobs always get preferential treatment in a conflict. Consider a case when two jobs that started seconds apart try to import the same 10 series, but the second one has a slightly more updated version of the data. Shooju guarantees that by the time both jobs are done, all 10 series will have the values set by the second job (the one with the higher job id) even if the first job wrote the series last (because of slower processing, for example).
Version Control
Shooju stores snapshots at the level of the job instead of the level of the write to facilitate comparisons across jobs. All changes are stored unless this feature is expressly turned off when registering a new job.
Uploaders
An uploader is a way for a user to launch processors that operate on one or more files. An administrator sets up an Uploader Preset by:
- associating it with an processor,
- giving permissions to users or teams to use it,
- overwriting any processor settings for when this uploader is used to launch it,
- adding descriptive language for users of the uploader.
A user then runs the uploader through Shooju Web by:
- opening the uploader tool,
- choosing the Uploader Preset
- uploading the file(s) they want to use,
- optionally doing a test run,
- giving the upload job a descriptive name.
Keep in mind that uploaders launch processors that write series through Jobs. All data must be written through a job, and uploaders are not an exception.
Summary about configurable fields
field | description |
description | Process description |
async_mode | if true, write requests that occur during the HTTP API call will be executed asynchronously |
batch_size_num | maximum number of updates/inserts to run at a time |
code_saved_by | last person who updated the code |
code_saved_date | last update date |
disable_sara_auto_resolve | In case of TRUE, the SARA autosolve during the created IOPS will be disabled |
id | processor reference key |
last_job_date | last run date |
last_job_num | last run number |
last_trigger_date | last trigger run |
loads_proc_obj | linked processors that are needed in execution |
long_running_job_threshold_num | Maximum execution time before starting a long-running IOPS (LRT) |
ltnj_threshold_num | Maximum non-execution time before starting an IOPS due to inactivity (LTNJ) |
next_schedule_date | next run date |
next_schedule_description | description of next run |
next_schedule_type | It can be JOB or Trigger |
no_history | if TRUE, previous versions of sid will not be saved. If FALSE, each sid version will be stored |
run_code_from_linked_account | if the code has its source in linked account, here you will find the processor link |
save_description | last comment saved in version control |
schedules_num | Number of launchers associated with the processor (JOBs or TRIGGERs) |
series_prefix | the prefix is the sids space where this processor can perform creations, alterations and deletions |
status | processor status. It could be Production | Build | Cancel | validation |
tags | tags (concepts or keywords) associated with the processor. May be useful for sorting and further searching |
updated_date | last update date |
urgency_score_num | Processor urgency in case of presenting IOPS. scale 0/5 |