Avro Schema介绍

Avro介绍

Avro使用JSON定义数据类型及通信协议,使用压缩二进制来序列化数据。在 Hadoop 的其他项目中,例如 HBase 和 Hive 的 Client 端与服务端的数据传输也采用了这个工具。Avro的schema定义示例如下:

{"namespace": "com.abc.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

它的优点有:

  1. 二进制消息,性能好/效率高
  2. 使用JSON描述模式
  3. Schema可以在未来发生变动(Schema evolution)

缺点:

  1. 虽然支持主流语言,但对其他语言的支持缺失
  2. 如果没有avro工具,不能直接打印出来数据,因为它对数据做了压缩和序列化

Primitive types - 简单数据类型

Schema 通过 JSON 对象表示。Schema 定义了简单数据类型(primitive types)和复杂数据类型(complex types),其中复杂数据类型包含不同属性。通过各种数据类型用户可以自定义丰富的数据结构。

简单数据类型有:

类型 说明
null no value
boolean a binary value
int 32-bit signed integer
long 64-bit signed integer
float single precision (32-bit) IEEE 754 floating-point number
double double precision (64-bit) IEEE 754 floating-point number
bytes sequence of 8-bit unsigned bytes
string unicode character sequence

Complex types - 复杂数据类型

复杂数据类型:Avro定义了六种复杂数据类型,每一种复杂数据类型都具有独特的属性。六种复杂数据类型如下:

Type encoding
Records encoded just the concatenation of the encodings of its fields
Enums a int representing the zero-based position of the symbol in the schema
Arrays encoded as series of blocks. A block with count 0 indicates the end of the array. block:{long,items}
Maps encoded as series of blocks. A block with count 0 indicates the end of the map. block:{long,key/value pairs}.
Unions encoded by first writing a long value indicating the zero-based position within the union of the schema of its value. The value is then encoded per the indicated schema within the union.
fixed encoded using number of bytes declared in the schema

Record类型

Record类型字段说明如下:

  • name(必填):record的名字

  • namespace:名称空间(可选),相当于java中的包名

  • doc:这个类别的文件说明(可选)

  • aliases:record类别的别名(可选)

  • fields(必填):每个字段需要以下属性:

    • name(必填):栏位名字
    • doc:字段说明文件(可选)
    • type(必填):类别属性
    • default:预设值(可选)
    • order:排序(可选),只有3个值ascending(预设),descending或ignore
    • aliases:别名(可选)

例如,下面是employee.avsc的Record定义:

{
        "type": "record",
        "namespace": "com.aaa",
        "name": "Employee",
        "doc": "Employee avro schema "
        "fields": [
          { "name": "id", "type": "string", "doc":"employee's name"},
          { "name": "first_name", "type": "string", "default": "", "doc":"employee first name"},
          { "name": "last_name", "type": "string", "default":""},
          { "name": "age", "type": "int"},
        ]
   }

Logical Types(逻辑数据类型)

Logical Type在简单数据类型复杂数据类型后面增加一个额外的字段,用于补充字段的属性

例如:

{
  "type": "int",
  "logicalType": "date"
}

参考: https://avro.apache.org/docs/current/spec.html#Logical+Types