scala - About how to create a custom org.apache.spark.sql.types.StructType schema object starting from a json file programmatically -


i have create custom org.apache.spark.sql.types.structtype schema object info json file, json file can anything, have parametriced within property file.

this how looks property file:

//ruta al esquema del fichero output (por defecto se infiere el esquema del parquet destino). si existe, el esquema serĂ¡ en formato json, aplicable dataframe (ver structtype.fromjson) schema.parquet=/users/xxxx/desktop/generated_schema.json writing.mode=overwrite separator=; header=false 

the file generated_schema.json looks like:

{"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]} 

so, how thought can solve it:

val path: path = new path(mra_schema_parquet) val filesystem = path.getfilesystem(sc.hadoopconfiguration) val inputstream: fsdatainputstream = filesystem.open(path) val schema_json = stream.cons(inputstream.readline(), stream.continually( inputstream.readline))  system.out.println("schema_json looks "  + schema_json.head)  val myschemastructtype :datatype = datatype.fromjson(schema_json.head)  /* after line, myschemastructtype have 4 structfields objects inside it, same appears @ schema_json */ logger.info(myschemastructtype)  val mystructtype = new structtype() mystructtype.add("myschemastructtype",myschemastructtype)  /*  after line, mystructtype have 0 structfields! here must bug, mystructtype should have 4 structfields represents loaded schema json! must error! how can construct necessary structtype object?  */  mydf = loadcsv(sqlcontext, path_input_csv,separator,mystructtype,header) system.out.println("mydf.schema.json looks " + mydf.schema.json) inputstream.close()  df.write   .format("com.databricks.spark.csv")   .option("header", header)   .option("delimiter",delimiter)   .option("nullvalue","")   .option("treatemptyvaluesasnulls","true")   .mode(savemode)   .parquet(pathparquet) 

when code runs last line, .parquet(pathparquet), exception happens:

**parquet.schema.invalidschemaexception: cannot write schema empty group: message root { }** 

the output of code this:

16/11/11 13:57:04 info anothercsvtoparquet$: job started using propertie file: /users/aisidoro/desktop/mra-csv-converter/parametrizacion.properties 16/11/11 13:57:05 info anothercsvtoparquet$: path_input_csv /users/aisidoro/desktop/mra-csv-converter/cds_glcs.csv 16/11/11 13:57:05 info anothercsvtoparquet$: path_output_parquet  /users/aisidoro/desktop/output900000 16/11/11 13:57:05 info anothercsvtoparquet$: mra_schema_parquet /users/aisidoro/desktop/mra-csv-converter/generated_schema.json 16/11/11 13:57:05 info anothercsvtoparquet$: writting_mode overwrite 16/11/11 13:57:05 info anothercsvtoparquet$: separator ; 16/11/11 13:57:05 info anothercsvtoparquet$: header false 16/11/11 13:57:05 info anothercsvtoparquet$: attention! aplying mra_schema_parquet  /users/aisidoro/desktop/mra-csv-converter/generated_schema.json schema_json looks {"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]} 16/11/11 13:57:12 info anothercsvtoparquet$: structtype(structfield(codigo,stringtype,true), structfield(otro,stringtype,true), structfield(vacio,stringtype,true), structfield(final,stringtype,true))  16/11/11 13:57:13 info anothercsvtoparquet$: loadcsv. header false, inferschema false pathcsv /users/aisidoro/desktop/mra-csv-converter/cds_glcs.csv separator ;  mydf.schema.json looks {"type":"struct","fields":[]} 

it should schema_json object , mydf.schema.json object should have same content, shouldn't ? did not happen. think must launch error.

finally job crushes exception:

**parquet.schema.invalidschemaexception: cannot write schema empty group: message root { }** 

the fact if not provide json schema file, job performs fine, schema...

can me? want create parquet files starting csv file , json schema file.

thank you.

the dependencies are:

    <spark.version>1.5.0-cdh5.5.2</spark.version>     <databricks.version>1.5.0</databricks.version>      <dependency>         <groupid>org.apache.spark</groupid>         <artifactid>spark-sql_2.10</artifactid>         <version>${spark.version}</version>         <scope>compile</scope>     </dependency>     <dependency>         <groupid>com.databricks</groupid>         <artifactid>spark-csv_2.10</artifactid>         <version>${databricks.version}</version>     </dependency> 

update

i can see there open issue,

https://github.com/databricks/spark-csv/issues/61

since said custom schema, can this.

val schema = (new structtype).add("field1", stringtype).add("field2", stringtype) sqlcontext.read.schema(schema).json("/json/file/path").show 

also, this , this

you can create nested json schema below.

for example:

{   "field1": {     "field2": {       "field3": "create",       "field4": 1452121277     }   } }  val schema = (new structtype)   .add("field1", (new structtype)     .add("field2", (new structtype)       .add("field3", stringtype)       .add("field4", longtype)     )   ) 

Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

depending on nth recurrence of job in control M -

asp.net - Problems sending emails from forum -