AI - Orc
Optimizing large-scale data processing with self-describing, type-aware ORC file format and libraries for Java and C++ under Apache umbrella.
- Name
- Orc - https://github.com/apache/orc
- Last Audited At
About Orc
Orc is a file format project under the Apache umbrella that develops and provides both Java and C++ libraries for reading and writing the Optimized Row Columnar (ORC) file format. The ORC file format is designed for Hadoop workloads, optimized for large streaming reads while also supporting quick searches. It is a self-describing type-aware columnar format that lets readers process only the required values for each query due to its type-awareness and internal indexes.
The ORC project includes a C++ reader and writer and a Java reader and writer, which are completely independent of each other. The libraries can read all versions of ORC files. Users can build and test releases using Apache Jira for bug tracking, Maven Central for downloads, or the latest releases from Apache.
To build ORC, you'll need to install Java 17 or higher, Maven 3.9.6 or higher, and cmake 3.12 or higher. Users can build release versions with debug information, a debug version without debug information, or just the Java library or C++ library by following specific build instructions.
The ORC file format is designed to optimize streaming reads and provide quick searches. By using type-awareness and internal indexes, it allows readers to read, decompress, and process only the required values for the current query. The format supports a complete set of types in Hive, including complex data structures like structs, lists, maps, and unions.
Additionally, the ORC project offers an optional AVX512 compilation feature which can be set at both compile time and run time using the BUILD_ENABLE_AVX512 cmake option or the ORC_USER_SIMD_LEVEL environment variable, respectively. This enables SIMD optimization for specific hardware.