regex - Rearranging the structure of many txt files and then merging them in one data frame -
i appreciate lot!
i have ~4.5k txt files this:
simple statistics using mspa parameters: 8_3_1_1 on input file: 20130815 104359 875 000000 0528 0548_result.tif mspa-class [color]: foreground/data pixels [%] frequency ============================================================ core(s) [green]: -- 0 core(m) [green]: 48.43/13.45 1 core(l) [green]: -- 0 islet [brown]: 3.70/ 1.03 20 perforation [blue]: 0.00/ 0.00 0 edge [black]: 30.93/ 8.59 11 loop [yellow]: 9.66/ 2.68 6 bridge [red]: 0.00/ 0.00 0 branch [orange]: 7.28/ 2.02 40 background [grey]: --- /72.22 11 missing [white]: 0.00 0
i want read txt files directory r , perform rearranging task on them before merging them together.
the values in txt files can change, in places there 0.00 now, relevant number in files (so need those). fields there -- now, if script test if there -- , or number. if there --, should turn them nas. on other hand, real 0.00 values of value , need them. there 1 value missing white column (or row here), value should copied both columns, foreground% , data pixels%.
the general rearranging need make data available columns 1 row per txt file. every row of data in txt file here, there should 3 columns in output file (foreground%, data pixel% , frequency every color). name of row should image name mentioned in beginning of file, here: 20130815 104359 875 000000 0528 0548
the rest can omitted.
the output should this:
i working on simultaneously not sure direction take. more welcome!
best, moritz
this puts in format want, think, example doesn't match image can't sure:
(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = true)) # [1] "/users/rawr/desktop/image001.txt" "/users/rawr/desktop/image002.txt" # [3] "/users/rawr/desktop/image003.txt" lapply(lf, function(xx) { rl <- readlines(con <- file(xx), warn = false) close(con) ## assuming file name after "file: " until end of string ## , ends in .tif img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1]) ## removes each string , including ===== string rl <- rl[-(1:grep('==', rl))] ## remove leading whitespace rl <- gsub('^\\s+', '', rl) ## split remaining lines larger chunks of whitespace mat <- do.call('rbind', strsplit(rl, '\\s{2, }')) ## more cleaning, setting attributes, etc mat[mat == '--'] <- na mat <- cbind(image_name = img_name, `colnames<-`(t(mat[, 2]), mat[, 1])) as.data.frame(mat) })
i created 3 files using example , made each 1 different show how work on directory several files:
# [[1]] # image_name core(s) [green]: core(m) [green]: core(l) [green]: islet [brown]: perforation [blue]: edge [black]: loop [yellow]: bridge [red]: branch [orange]: background [grey]: missing [white]: # 1 20130815 104359 875 000000 0528 0548_result <na> 48.43/13.45 <na> 3.70/ 1.03 0.00/ 0.00 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 --- /72.22 0.00 # # [[2]] # image_name core(s) [green]: core(m) [green]: core(l) [green]: islet [brown]: perforation [blue]: edge [black]: loop [yellow]: bridge [red]: branch [orange]: background [grey]: missing [white]: # 1 20139341 104359 875 000000 0528 0548_result 23 48.43/13.45 23 <na> 0.00/ 0.00 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 --- /72.22 0.00 # # [[3]] # image_name core(s) [green]: core(m) [green]: core(l) [green]: islet [brown]: perforation [blue]: edge [black]: loop [yellow]: bridge [red]: branch [orange]: background [grey]: missing [white]: # 1 20132343 104359 875 000000 0528 0548_result <na> <na> <na> <na> <na> 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 <na> 0.00
edit
made few changes extract info:
(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = true)) # [1] "/users/rawr/desktop/image001.txt" "/users/rawr/desktop/image002.txt" # [3] "/users/rawr/desktop/image003.txt" res <- lapply(lf, function(xx) { rl <- readlines(con <- file(xx), warn = false) close(con) img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1]) rl <- rl[-(1:grep('==', rl))] rl <- gsub('^\\s+', '', rl) mat <- do.call('rbind', strsplit(rl, '\\s{2, }')) dat <- as.data.frame(mat, stringsasfactors = false) tmp <- `colnames<-`(do.call('rbind', strsplit(dat$v2, '[-\\/\\s]+', perl = true)), c('foreground','data pixels')) dat <- cbind(dat[, -2], tmp, image_name = img_name) dat[] <- lapply(dat, as.character) dat[dat == ''] <- na names(dat)[1:2] <- c('mspa-class','frequency') zzz <- reshape(dat, direction = 'wide', idvar = 'image_name', timevar = 'mspa-class') names(zzz)[-1] <- gsub('(.*)\\.(.*) (?:.*)', '\\2_\\1', names(zzz)[-1], perl = true) zzz })
here result (i transformed long matrix easier read. real results in wide data frame, 1 each file):
`rownames<-`(matrix(res[[1]]), names(res[[1]])) # [,1] # image_name "20130815 104359 875 000000 0528 0548_result" # core(s)_frequency "0" # core(s)_foreground "na" # core(s)_data pixels "na" # core(m)_frequency "1" # core(m)_foreground "48.43" # core(m)_data pixels "13.45" # core(l)_frequency "0" # core(l)_foreground "na" # core(l)_data pixels "na" # islet_frequency "20" # islet_foreground "3.70" # islet_data pixels "1.03" # perforation_frequency "0" # perforation_foreground "0.00" # perforation_data pixels "0.00" # edge_frequency "11" # edge_foreground "30.93" # edge_data pixels "8.59" # loop_frequency "6" # loop_foreground "9.66" # loop_data pixels "2.68" # bridge_frequency "0" # bridge_foreground "0.00" # bridge_data pixels "0.00" # branch_frequency "40" # branch_foreground "7.28" # branch_data pixels "2.02" # background_frequency "11" # background_foreground "na" # background_data pixels "72.22" # missing_frequency "0" # missing_foreground "0.00" # missing_data pixels "0.00"
with sample data:
lf <- list.files('~/desktop/data', pattern = '.txt', full.names = true) `rownames<-`(matrix(res[[1]]), names(res[[1]])) # [,1] # image_name "20130815 103704 780 000000 0372 0616" # core(s)_frequency "0" # core(s)_foreground "na" # core(s)_data pixels "na" # core(m)_frequency "1" # core(m)_foreground "54.18" # core(m)_data pixels "15.16" # core(l)_frequency "0" # core(l)_foreground "na" # core(l)_data pixels "na" # islet_frequency "11" # islet_foreground "3.14" # islet_data pixels "0.88" # perforation_frequency "0" # perforation_foreground "0.00" # perforation_data pixels "0.00" # edge_frequency "1" # edge_foreground "34.82" # edge_data pixels "9.75" # loop_frequency "1" # loop_foreground "4.96" # loop_data pixels "1.39" # bridge_frequency "0" # bridge_foreground "0.00" # bridge_data pixels "0.00" # branch_frequency "20" # branch_foreground "2.89" # branch_data pixels "0.81" # background_frequency "1" # background_foreground "na" # background_data pixels "72.01" # missing_frequency "0" # missing_foreground "0.00" # missing_data pixels "0.00"
Comments
Post a Comment