Big Data/Machine Learning
Infobahn is an implementation partner for the Sparkflows Big Data and Machine Learning platform which enables building end to end Big Data Predictive Applications. Now users can perform complex analytics, Machine Learning & Data Pipelines in minutes on Apache Spark. All organizations dealing with Big Data have the challenge of extracting enough value out of them quickly. Sparkflows provides a powerful solution in that direction.
It does so by providing 140+ Operators on Data Profiling/Cleaning, Machine Learning, NLP, OCR and Visualization. The operators are brought together into a very Intelligent Workflow Editor. Sparkflows also provides Dashboards for rich visualizations. It has both batch and streaming engines running on Apache Spark.
Sparkflows connects to various big data sources (HDFS, HIVE, HBase, Kafka, Elasticsearch etc.) and seamlessly handles both structured and unstructured data.
- 140+ Processors running on Apache Spark providing Data Profiling, Machine Learning, ETL, NLP, OCR and Visualization.
- An intelligent Workflow Editor providing Schema Inference, Schema Propagation and Interactive Execution.
- Machine Learning covering Classification, Clustering, Regression. Complex feature generation using an array of processors.
- Streaming Analytics with Spark Streaming, connectors to Kafka, Flume and Twitter.
- Data Cleaning, Data Profiling and ETL covering Summary Statistics, SQL, Row Filters, Column Filters, Joins etc. Ability to write SQL, Scala, Python within the workflow.
- Reading and Writing various File Formats including CSV/TSV, Avro, Parquet, JSON, PDF, Images etc.
- Reading and writing from various sources including HIVE, RDBMS/JDBC, HBase, Cassandra, Solr, Elastic Search.
- NLP using OpenNLP and StanfordNLP. OCR using Tesseract.
- Powerful Visualization with Processors and Streaming Dashboards for streaming data.
- Workflow Scheduling.
- Smooth and powerful deployment on the Edge node of an Apache Spark Cluster on-premise or in the Cloud.