Parcourir la source

split train and valid

jstzwj il y a 6 ans
Parent
commit
af76e4b631
5 fichiers modifiés avec 19341 ajouts et 1 suppressions
  1. 9666 0
      data/train_source.txt
  2. 9665 0
      data/train_target.txt
  3. 0 0
      data/valid_source.txt
  4. 0 0
      data/valid_target.txt
  5. 10 1
      preprocess.py

Fichier diff supprimé car celui-ci est trop grand
+ 9666 - 0
data/train_source.txt


Fichier diff supprimé car celui-ci est trop grand
+ 9665 - 0
data/train_target.txt


+ 0 - 0
data/octoon_source.txt → data/valid_source.txt


+ 0 - 0
data/octoon_target.txt → data/valid_target.txt


+ 10 - 1
preprocess.py

@@ -70,9 +70,18 @@ def generate_dataset(messages, output_path_source, output_path_target):
                     pbar.update()
 
 if __name__ == "__main__":
+    '''
     print('read message')
     msg = read_qq_history_file('data/Octoon 开发组.txt')
     print('filter message')
     filter_msg(msg)
     print('write to file')
-    generate_dataset(msg, 'data/octoon_source.txt', 'data/octoon_target.txt')
+    generate_dataset(msg, 'data/octoon_source.txt', 'data/octoon_target.txt')
+    '''
+
+    print('read message')
+    msg = read_qq_history_file('data/ISOIEC C++ China Unofficial.txt')
+    print('filter message')
+    filter_msg(msg)
+    print('write to file')
+    generate_dataset(msg, 'data/train_source.txt', 'data/train_target.txt')

Certains fichiers n'ont pas été affichés car il y a eu trop de fichiers modifiés dans ce diff