scala - About how to create a custom org.apache.spark.sql.types.StructType schema object starting from a json file programmatically -
i have create custom org.apache.spark.sql.types.structtype schema object info json file, json file can anything, have parametriced within property file.
this how looks property file:
//ruta al esquema del fichero output (por defecto se infiere el esquema del parquet destino). si existe, el esquema serĂ¡ en formato json, aplicable dataframe (ver structtype.fromjson) schema.parquet=/users/xxxx/desktop/generated_schema.json writing.mode=overwrite separator=; header=false
the file generated_schema.json looks like:
{"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]}
so, how thought can solve it:
val path: path = new path(mra_schema_parquet) val filesystem = path.getfilesystem(sc.hadoopconfiguration) val inputstream: fsdatainputstream = filesystem.open(path) val schema_json = stream.cons(inputstream.readline(), stream.continually( inputstream.readline)) system.out.println("schema_json looks " + schema_json.head) val myschemastructtype :datatype = datatype.fromjson(schema_json.head) /* after line, myschemastructtype have 4 structfields objects inside it, same appears @ schema_json */ logger.info(myschemastructtype) val mystructtype = new structtype() mystructtype.add("myschemastructtype",myschemastructtype) /* after line, mystructtype have 0 structfields! here must bug, mystructtype should have 4 structfields represents loaded schema json! must error! how can construct necessary structtype object? */ mydf = loadcsv(sqlcontext, path_input_csv,separator,mystructtype,header) system.out.println("mydf.schema.json looks " + mydf.schema.json) inputstream.close() df.write .format("com.databricks.spark.csv") .option("header", header) .option("delimiter",delimiter) .option("nullvalue","") .option("treatemptyvaluesasnulls","true") .mode(savemode) .parquet(pathparquet)
when code runs last line, .parquet(pathparquet), exception happens:
**parquet.schema.invalidschemaexception: cannot write schema empty group: message root { }**
the output of code this:
16/11/11 13:57:04 info anothercsvtoparquet$: job started using propertie file: /users/aisidoro/desktop/mra-csv-converter/parametrizacion.properties 16/11/11 13:57:05 info anothercsvtoparquet$: path_input_csv /users/aisidoro/desktop/mra-csv-converter/cds_glcs.csv 16/11/11 13:57:05 info anothercsvtoparquet$: path_output_parquet /users/aisidoro/desktop/output900000 16/11/11 13:57:05 info anothercsvtoparquet$: mra_schema_parquet /users/aisidoro/desktop/mra-csv-converter/generated_schema.json 16/11/11 13:57:05 info anothercsvtoparquet$: writting_mode overwrite 16/11/11 13:57:05 info anothercsvtoparquet$: separator ; 16/11/11 13:57:05 info anothercsvtoparquet$: header false 16/11/11 13:57:05 info anothercsvtoparquet$: attention! aplying mra_schema_parquet /users/aisidoro/desktop/mra-csv-converter/generated_schema.json schema_json looks {"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]} 16/11/11 13:57:12 info anothercsvtoparquet$: structtype(structfield(codigo,stringtype,true), structfield(otro,stringtype,true), structfield(vacio,stringtype,true), structfield(final,stringtype,true)) 16/11/11 13:57:13 info anothercsvtoparquet$: loadcsv. header false, inferschema false pathcsv /users/aisidoro/desktop/mra-csv-converter/cds_glcs.csv separator ; mydf.schema.json looks {"type":"struct","fields":[]}
it should schema_json object , mydf.schema.json object should have same content, shouldn't ? did not happen. think must launch error.
finally job crushes exception:
**parquet.schema.invalidschemaexception: cannot write schema empty group: message root { }**
the fact if not provide json schema file, job performs fine, schema...
can me? want create parquet files starting csv file , json schema file.
thank you.
the dependencies are:
<spark.version>1.5.0-cdh5.5.2</spark.version> <databricks.version>1.5.0</databricks.version> <dependency> <groupid>org.apache.spark</groupid> <artifactid>spark-sql_2.10</artifactid> <version>${spark.version}</version> <scope>compile</scope> </dependency> <dependency> <groupid>com.databricks</groupid> <artifactid>spark-csv_2.10</artifactid> <version>${databricks.version}</version> </dependency>
update
i can see there open issue,
since said custom schema
, can this.
val schema = (new structtype).add("field1", stringtype).add("field2", stringtype) sqlcontext.read.schema(schema).json("/json/file/path").show
you can create nested json schema below.
for example:
{ "field1": { "field2": { "field3": "create", "field4": 1452121277 } } } val schema = (new structtype) .add("field1", (new structtype) .add("field2", (new structtype) .add("field3", stringtype) .add("field4", longtype) ) )
Comments
Post a Comment