regex - Rearranging the structure of many txt files and then merging them in one data frame -


i appreciate lot!

i have ~4.5k txt files this:

simple statistics using mspa parameters: 8_3_1_1 on input file: 20130815 104359  875  000000 0528 0548_result.tif   mspa-class [color]:  foreground/data pixels [%]  frequency ============================================================     core(s) [green]:               --                   0     core(m) [green]:      48.43/13.45                   1     core(l) [green]:               --                   0       islet [brown]:       3.70/ 1.03                  20  perforation [blue]:       0.00/ 0.00                   0        edge [black]:      30.93/ 8.59                  11       loop [yellow]:       9.66/ 2.68                   6        bridge [red]:       0.00/ 0.00                   0     branch [orange]:       7.28/ 2.02                  40   background [grey]:       --- /72.22                  11     missing [white]:            0.00                    0 

i want read txt files directory r , perform rearranging task on them before merging them together.

the values in txt files can change, in places there 0.00 now, relevant number in files (so need those). fields there -- now, if script test if there -- , or number. if there --, should turn them nas. on other hand, real 0.00 values of value , need them. there 1 value missing white column (or row here), value should copied both columns, foreground% , data pixels%.

the general rearranging need make data available columns 1 row per txt file. every row of data in txt file here, there should 3 columns in output file (foreground%, data pixel% , frequency every color). name of row should image name mentioned in beginning of file, here: 20130815 104359 875 000000 0528 0548

the rest can omitted.

the output should this:

http://i.stack.imgur.com/wxcgg.png

i working on simultaneously not sure direction take. more welcome!

best, moritz

this puts in format want, think, example doesn't match image can't sure:

(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = true)) # [1] "/users/rawr/desktop/image001.txt" "/users/rawr/desktop/image002.txt" # [3] "/users/rawr/desktop/image003.txt"  lapply(lf, function(xx) {   rl <- readlines(con <- file(xx), warn = false)   close(con)   ## assuming file name after "file: " until end of string   ## , ends in .tif   img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])   ## removes each string , including ===== string   rl <- rl[-(1:grep('==', rl))]   ## remove leading whitespace   rl <- gsub('^\\s+', '', rl)    ## split remaining lines larger chunks of whitespace   mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))   ## more cleaning, setting attributes, etc   mat[mat == '--'] <- na   mat <- cbind(image_name = img_name, `colnames<-`(t(mat[, 2]), mat[, 1]))   as.data.frame(mat) }) 

i created 3 files using example , made each 1 different show how work on directory several files:

# [[1]] #                                        image_name core(s) [green]: core(m) [green]: core(l) [green]: islet [brown]:   perforation [blue]: edge [black]: loop [yellow]: bridge [red]: branch [orange]: background [grey]: missing [white]: #   1 20130815 104359  875  000000 0528 0548_result             <na>      48.43/13.45             <na>     3.70/ 1.03          0.00/ 0.00     30.93/ 8.59     9.66/ 2.68    0.00/ 0.00       7.28/ 2.02         --- /72.22             0.00 #  # [[2]] #                                        image_name core(s) [green]: core(m) [green]: core(l) [green]: islet [brown]:   perforation [blue]: edge [black]: loop [yellow]: bridge [red]: branch [orange]: background [grey]: missing [white]: #   1 20139341 104359  875  000000 0528 0548_result               23      48.43/13.45               23           <na>          0.00/ 0.00     30.93/ 8.59     9.66/ 2.68    0.00/ 0.00       7.28/ 2.02         --- /72.22             0.00 #  # [[3]] #                                        image_name core(s) [green]: core(m) [green]: core(l) [green]: islet [brown]: perforation [blue]:  edge [black]: loop [yellow]: bridge [red]: branch [orange]: background [grey]: missing [white]: #   1 20132343 104359  875  000000 0528 0548_result             <na>             <na>             <na>           <na>                <na>    30.93/ 8.59     9.66/ 2.68    0.00/ 0.00       7.28/ 2.02               <na>             0.00 

edit

made few changes extract info:

(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = true)) # [1] "/users/rawr/desktop/image001.txt" "/users/rawr/desktop/image002.txt" # [3] "/users/rawr/desktop/image003.txt"  res <- lapply(lf, function(xx) {   rl <- readlines(con <- file(xx), warn = false)   close(con)   img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])   rl <- rl[-(1:grep('==', rl))]   rl <- gsub('^\\s+', '', rl)   mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))   dat <- as.data.frame(mat, stringsasfactors = false)   tmp <- `colnames<-`(do.call('rbind', strsplit(dat$v2, '[-\\/\\s]+', perl = true)),                       c('foreground','data pixels'))   dat <- cbind(dat[, -2], tmp, image_name = img_name)   dat[] <- lapply(dat, as.character)   dat[dat == ''] <- na   names(dat)[1:2] <- c('mspa-class','frequency')    zzz <- reshape(dat, direction = 'wide', idvar = 'image_name', timevar = 'mspa-class')   names(zzz)[-1] <- gsub('(.*)\\.(.*) (?:.*)', '\\2_\\1', names(zzz)[-1], perl = true)   zzz }) 

here result (i transformed long matrix easier read. real results in wide data frame, 1 each file):

`rownames<-`(matrix(res[[1]]), names(res[[1]])) # [,1]                                            # image_name              "20130815 104359  875  000000 0528 0548_result" # core(s)_frequency       "0"                                             # core(s)_foreground      "na"                                            # core(s)_data pixels     "na"                                            # core(m)_frequency       "1"                                             # core(m)_foreground      "48.43"                                         # core(m)_data pixels     "13.45"                                         # core(l)_frequency       "0"                                             # core(l)_foreground      "na"                                            # core(l)_data pixels     "na"                                            # islet_frequency         "20"                                            # islet_foreground        "3.70"                                          # islet_data pixels       "1.03"                                          # perforation_frequency   "0"                                             # perforation_foreground  "0.00"                                          # perforation_data pixels "0.00"                                          # edge_frequency          "11"                                            # edge_foreground         "30.93"                                         # edge_data pixels        "8.59"                                          # loop_frequency          "6"                                             # loop_foreground         "9.66"                                          # loop_data pixels        "2.68"                                          # bridge_frequency        "0"                                             # bridge_foreground       "0.00"                                          # bridge_data pixels      "0.00"                                          # branch_frequency        "40"                                            # branch_foreground       "7.28"                                          # branch_data pixels      "2.02"                                          # background_frequency    "11"                                            # background_foreground   "na"                                            # background_data pixels  "72.22"                                         # missing_frequency       "0"                                             # missing_foreground      "0.00"                                          # missing_data pixels     "0.00"   

with sample data:

lf <- list.files('~/desktop/data', pattern = '.txt', full.names = true)  `rownames<-`(matrix(res[[1]]), names(res[[1]]))  #                         [,1]                                     # image_name              "20130815 103704  780  000000 0372 0616" # core(s)_frequency       "0"                                      # core(s)_foreground      "na"                                     # core(s)_data pixels     "na"                                     # core(m)_frequency       "1"                                      # core(m)_foreground      "54.18"                                  # core(m)_data pixels     "15.16"                                  # core(l)_frequency       "0"                                      # core(l)_foreground      "na"                                     # core(l)_data pixels     "na"                                     # islet_frequency         "11"                                     # islet_foreground        "3.14"                                   # islet_data pixels       "0.88"                                   # perforation_frequency   "0"                                      # perforation_foreground  "0.00"                                   # perforation_data pixels "0.00"                                   # edge_frequency          "1"                                      # edge_foreground         "34.82"                                  # edge_data pixels        "9.75"                                   # loop_frequency          "1"                                      # loop_foreground         "4.96"                                   # loop_data pixels        "1.39"                                   # bridge_frequency        "0"                                      # bridge_foreground       "0.00"                                   # bridge_data pixels      "0.00"                                   # branch_frequency        "20"                                     # branch_foreground       "2.89"                                   # branch_data pixels      "0.81"                                   # background_frequency    "1"                                      # background_foreground   "na"                                     # background_data pixels  "72.01"                                  # missing_frequency       "0"                                      # missing_foreground      "0.00"                                   # missing_data pixels     "0.00"  

Comments

Popular posts from this blog

How to run C# code using mono without Xamarin in Android? -

html - grunt SVG to webfont -

c# - SharpSsh Command Execution -