Avro使用JSON定义数据类型及通信协议,使用压缩二进制来序列化数据。在 Hadoop 的其他项目中,例如 HBase 和 Hive 的 Client 端与服务端的数据传输也采用了这个工具。Avro的schema定义示例如下:
{"namespace": "com.abc.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
它的优点有:
缺点:
Schema 通过 JSON 对象表示。Schema 定义了简单数据类型(primitive types
)和复杂数据类型(complex types
),其中复杂数据类型包含不同属性。通过各种数据类型用户可以自定义丰富的数据结构。
简单数据类型有:
类型 | 说明 |
---|---|
null | no value |
boolean | a binary value |
int | 32-bit signed integer |
long | 64-bit signed integer |
float | single precision (32-bit) IEEE 754 floating-point number |
double | double precision (64-bit) IEEE 754 floating-point number |
bytes | sequence of 8-bit unsigned bytes |
string | unicode character sequence |
复杂数据类型:Avro定义了六种复杂数据类型,每一种复杂数据类型都具有独特的属性。六种复杂数据类型如下:
Type | encoding |
---|---|
Records | encoded just the concatenation of the encodings of its fields |
Enums | a int representing the zero-based position of the symbol in the schema |
Arrays | encoded as series of blocks. A block with count 0 indicates the end of the array. block:{long,items} |
Maps | encoded as series of blocks. A block with count 0 indicates the end of the map. block:{long,key/value pairs}. |
Unions | encoded by first writing a long value indicating the zero-based position within the union of the schema of its value. The value is then encoded per the indicated schema within the union. |
fixed | encoded using number of bytes declared in the schema |
Record类型字段说明如下:
name(必填):record的名字
namespace:名称空间(可选),相当于java中的包名
doc:这个类别的文件说明(可选)
aliases:record类别的别名(可选)
fields(必填):每个字段需要以下属性:
例如,下面是employee.avsc
的Record定义:
{
"type": "record",
"namespace": "com.aaa",
"name": "Employee",
"doc": "Employee avro schema "
"fields": [
{ "name": "id", "type": "string", "doc":"employee's name"},
{ "name": "first_name", "type": "string", "default": "", "doc":"employee first name"},
{ "name": "last_name", "type": "string", "default":""},
{ "name": "age", "type": "int"},
]
}
Logical Type在简单数据类型
或复杂数据类型
后面增加一个额外的字段,用于补充字段的属性
例如:
{
"type": "int",
"logicalType": "date"
}
参考: https://avro.apache.org/docs/current/spec.html#Logical+Types