r - Extract 2 parts of a string -


assume have following string (filename):

a <- "x/zheb100/tkn_var29380_timely_p1.txt" 

which consists of several parts (here given p1)

or one

b <- "x/zheb100/zhn_var29380_timely.txt" 

which consists of 1 part (so no need label p)

how can extract identifier, 3 letters before varxxxxx (so in case 1 tkn, in case 2 zhn) plus part identifier, if available?

so result should be:

case1 : tkn_p1 case2 : zhn 

i know how extract first identifier, cannot handle second 1 @ same time.

my approach far:

sub(".*(.{3})_var29380_timely(.{3}).*","\\1\\2", a) sub(".*(.{3})_var29380_timely(.{3}).*","\\1\\2", b) 

but adds .tx incorrectly in second case.

you not using anchors , matching last 3 characters right after timely without checking these characters (. matches character).

i suggest

sub("^.*/([a-z]{3})_var\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a) 

details:

  • ^ - start of string
  • .*/ - part of string , including last /
  • ([a-z]{3}) - 3 ascii uppercase letters captured group 1
  • _var\\d+_timely - _var + 1 or more digits + _timely
  • (_[^_.]+)? - optional group 2 capturing _ + 1 or more chars other _ , .
  • \\. - dot
  • [^.]* - 0 or more chars other .
  • $ - end of string.

replacement pattern contains 2 backreferences both capturing groups insert contents replaced string.

r demo:

a <- "x/zheb100/tkn_var29380_timely_p1.txt" a2 <- sub("^.*/([a-z]{3})_var\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a) a2 [1] "tkn_p1" b <- "x/zheb100/zhn_var29380_timely.txt" b2 <- sub("^.*/([a-z]{3})_var\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", b) b2 [1] "zhn" 

Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -